Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...

Code and data for "Machine learning surrogates for efficient hydrologic modeling: Insights from stochastic simulations of managed aquifer recharge"


Authors:
Owners: This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource.
Type: Resource
Storage: The size of this resource is 3.9 GB
Created: Dec 16, 2024 at 2:24 a.m.
Last updated: Jan 09, 2025 at 2:09 p.m.
Published date: Jan 09, 2025 at 2:09 p.m.
DOI: 10.4211/hs.f0a31fbc3de148a98deb36795b4fac53
Citation: See how to cite this resource
Sharing Status: Published
Views: 108
Downloads: 76
+1 Votes: Be the first one to 
 this.
Comments: No comments (yet)

Abstract

This repository contains the data and code associated with the paper "Machine Learning Surrogates for Efficient Hydrologic Modeling: Insights from Stochastic Simulations of Managed Aquifer Recharge" by Dai et al. (2025) in the Journal of Hydrology (https://doi.org/10.1016/j.jhydrol.2024.132606) The study evaluates a hybrid modeling framework that combines process-based hydrologic simulations (with the integrated hydrologic code ParFlow-CLM) and machine learning (ML) surrogates to efficiently simulate managed aquifer recharge. This repository includes:

1) Sample ParFlow-CLM output for all three simulation stages
2) PyTorch dataset modules and utility functions that construct PyTorch tensors from raw ParFlow-CLM outputs
3) PyTorch modules to implement each of the eight ML architectures described in the paper (CNN3d, CNN4d, U-FNO3d, U-FNO4d, ViT3d, ViT4d, PredRNN++, and a CNN autoencoder)
4) PyTorch modules for custom layers implemented in each architecture
5) A PyTorch module that implements a normalized L2 loss function
6) Scripts to train and evaluate each surrogate architecture, including the autoencoder

Though this repository only contains sample ParFlow-CLM simulation output, complete ParFlow output files for all simulations used in the paper are available to the public in a separate repository (https://doi.org/10.25740/hj302gv2126)

Subject Keywords

Coverage

Spatial

Coordinate System/Geographic Projection:
WGS 84 EPSG:4326
Coordinate Units:
Decimal degrees
North Latitude
40.7000°
East Longitude
-118.4000°
South Latitude
34.8000°
West Longitude
-122.6000°

Content

README.md

Code and data for "Machine learning surrogates for efficient hydrologic modeling: Insights from stochastic simulations of managed aquifer recharge"

Overview

This repository contains the code for the paper "Machine Learning Surrogates for Efficient Hydrologic Modeling: Insights from Stochastic Simulations of Managed Aquifer Recharge" by Dai et al. (2025) in the Journal of Hydrology. The study evaluates a hybrid modeling framework that combines process-based hydrologic simulations (with the integrated hydrologic code ParFlow-CLM) and machine learning (ML) surrogates to efficiently simulate managed aquifer recharge.

This repository is organized as follows:

  1. data/sample_data contains sample output for all three simulation stages and sample data for autoencoder training. Instructions for unzipping the data are provided in the Installation section below.
  2. data also contains PyTorch dataset modules and utility functions that construct PyTorch tensors from raw ParFlow-CLM outputs.
  3. models contains PyTorch implementations of the 8 surrogate architectures used in the study (CNN3d, CNN4d, U-FNO3d, U-FNO4d, ViT3d, ViT4d, PredRNN++, and a CNN autoencoder).
  4. layers contains custom PyTorch layers used in some of the surrogate architectures above.
  5. losses contains a PyTorch implementation of the normalized $L^p$-norm used as a loss function in this study.
  6. Finally, the base directory contains scripts to train and evaluate each surrogate architecture.

Installation

Install all required modules with pip install -r requirements.txt. For complete compatibility, create your virtual environment with Python 3.8.20. Other versions of Python have not been tested but may also work.

Get sample data

Unzip the sample data and set up the data directory hierarchy with sh data/sample_data/unzip_all.sh. For users who wish to train on the complete dataset used in the paper, complete ParFlow output files are available to the public in a separate repository.

Any external data, provided through the --data_dir option, must have its directory hierarchy structured similarly to the sample data.

Training

All surrogate architectures described in the paper can be trained using the train.py script. The script uses argparse to take in several command line arguments to specify the model, dataset, and hyperparameters. To view all command-line options, run python train.py --help.

As a warning, training models on stages 1, 2 or 3 can be computationally expensive. Each architecture requires anywhere from 2 to 32 GB of memory to train (see Table 3 in the paper) and can take several hours on a single GPU.

To train an autoencoder

python train.py --name <name> \ --mode autoencoder \ --model CNNAutoencoder \ --data_dir data/sample_data/autoencoder \ [--OPTIONS]

where <name> is the name of the experiment (e.g., my_first_autoencoder) and CNNAutoencoder is the architecture to be used.

To train a Stage 1 surrogate

python train.py --name <name> \ --mode stage1 \ --model <model> \ --data_dir data/sample_data/stage1 \ [--OPTIONS]

where <name> is the name of the experiment and <model> is the name of the architecture to be used. Note that the --model option must be one of the following: CNN3d, CNN4d, PredRNN, UFNO3d, UFNO4d, ViT3d or ViT4d. All other options can be viewed with python train.py --help.

To train a Stage 2 surrogate

python train.py --name <name> \ --mode stage2 \ --model <model> \ --data_dir data/sample_data/stage2 \ [--OPTIONS]

To train a Stage 3 surrogate

python train.py --name <name> \ --mode stage3 \ --model <model> \ --data_dir data/sample_data/stage3 \ --autoencoder_ckpt_path <autoencoder_ckpt_path> \ [--OPTIONS]

Instead of providing an autoencoder checkpoint in Stage 3 training, users can also use a randomly initialized autoencoder by omitting the --autoencoder_ckpt_path option.

Notable options:

  • Start --name with "test" to run without saving checkpoints or tensorboard data.
  • Use the --use_dummy_dataset flag to quickly load correctly sized but randomly initialized tensors.

Evaluation

Testing occurs automatically at the end of training when the --train_only flag is not set. However, testing can also be initiated separately with the commands below.

To test an autoencoder

python test.py \ --mode autoencoder \ --ckpt <ckpt> \ --data_dir data/sample_data/autoencoder \ [--OPTIONS]

To test a Stage 1 surrogate

python test.py \ --mode stage1 \ --ckpt <ckpt> \ --data_dir data/sample_data/stage1 \ [--OPTIONS]

To test a Stage 2 surrogate

python test.py \ --mode stage2 \ --ckpt <ckpt> \ --data_dir data/sample_data/stage2 \ [--OPTIONS]

To test a Stage 3 surrogate

python test.py \ --mode stage3 \ --ckpt <ckpt> \ --data_dir data/sample_data/stage3 \ [--OPTIONS]

End-to-end (E2E) evaluation

Three checkpoints can be tested together in an end-to-end fashion using the following command:

python e2e.py \ --name <name> \ --stage1_ckpt <stage1_ckpt> \ --stage2_ckpt <stage2_ckpt> \ --stage3_ckpt <stage3_ckpt> \ --stage1_data_dir data/sample_data/stage1 \ --stage2_data_dir data/sample_data/stage2 \ --stage3_data_dir data/sample_data/stage3 \ --autoencoder_ckpt_path <autoencoder_ckpt_path> \ [--OPTIONS]

Related Resources

This resource is referenced by Timothy Dai, Kate Maher, Zach Perzan, Machine learning surrogates for efficient hydrologic modeling: Insights from stochastic simulations of managed aquifer recharge, Journal of Hydrology, Volume 652, 2025, 132606, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2024.132606.

How to Cite

Dai, T., K. Maher, Z. Perzan (2025). Code and data for "Machine learning surrogates for efficient hydrologic modeling: Insights from stochastic simulations of managed aquifer recharge", HydroShare, https://doi.org/10.4211/hs.f0a31fbc3de148a98deb36795b4fac53

This resource is shared under the Creative Commons Attribution CC BY.

http://creativecommons.org/licenses/by/4.0/
CC-BY

Comments

There are currently no comments

New Comment

required