Hi all,
I’ve been encountering a persistent issue running WRF on Derecho and wanted to see if others have experienced something similar or found a workaround.
Issue description:
WRF appears to stall immediately after successfully reading input data. In rsl.error.0000, the last message consistently indicates that the input data is acceptable (e.g., “input data is acceptable to use”), but the model does not proceed further. It then hangs indefinitely until the job reaches the walltime limit and is terminated.
Impact:
This behavior is costly in terms of both queue wait time and allocated walltime, especially when running multiple simulations. In some cases, I lose 10+ hours waiting in queue and another 10+ hours of walltime per failed attempt.
What I’ve tested so far:
#PBS -l select=1:ncpus=128:mpiprocs=110:mem=235gb
# Load modules to match compile-time environment of wrfv4.6.1
module --force purge
module load ncarenv/23.09
module reset
module swap hdf5 hdf5-mpi/1.12.2
module swap netcdf netcdf-mpi/4.9.2
module load intel-classic/2023.2.1
module load ncarcompilers/1.0.0
module load craype/2.7.23
module load cray-mpich/8.1.27
# print the list of modules loaded for wrf (same when compiled wrf)
module list
mpiexec -n 110 ./wrf.exe
(WRF version: v4.6.1)
Questions:
rsl.error.0000
taskid: 0 hostname: dec2234
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 10 , ntasks in Y 11
*************************************
Configuring physics suite 'conus'
mp_physics: 8* 8*
cu_physics: 6* 0*
ra_lw_physics: 1* 1*
ra_sw_physics: 1* 1*
bl_pbl_physics: 2* 2*
sf_sfclay_physics: 2* 2*
sf_surface_physics: 2 2
(* = option overrides suite setting)
*************************************
Domain # 1: dx = 20000.000 m
Domain # 2: dx = 4000.000 m
WRF V4.6.1 MODEL
git commit d66e442fccc04111067e29274c9f9eaccc3cef28
*************************************
Parent domain
ids,ide,jds,jde 1 137 1 115
ims,ime,jms,jme -4 21 -4 18
ips,ipe,jps,jpe 1 14 1 11
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 25419644 bytes allocated
RESTART run: opening wrfrst_d01_2016-10-01_00:00:00 for reading
Input data is acceptable to use: wrfrst_d01_2016-10-01_00:00:00
Timing for processing restart file for domain 1: 2.42051 elapsed seconds
Max map factor in domain 1 = 1.01. Scale the dt in the model accordingly.
D01: Time step = 75.00000 (s)
D01: Grid Distance = 20.00000 (km)
D01: Grid Distance Ratio dt/dx = 3.750000 (s/km)
D01: Ratio Including Maximum Map Factor = 3.784267 (s/km)
D01: NML defined reasonable_time_step_ratio = 6.000000
Climate GHG input from file from year 1765 to 2499
CO2 range = 277.913000000000 579.264000000000 ppm
N2O range = 274.372000000000 359.798000000000 ppb
CH4 range = 738.986000000000 997.311000000000 ppb
CFC11 range = 0.000000000000000E+000 1.400000000000000E-002 ppt
CFC12 range = 0.000000000000000E+000 2.88100000000000 ppt
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2013 , Julian day = 274
CO2 = 3.962691149230108E-004 volume mixing ratio
N2O = 3.262382361148917E-007 volume mixing ratio
CH4 = 1.825235824713203E-006 volume mixing ratio
CFC11 = 2.339924882273381E-010 volume mixing ratio
CFC12 = 5.223582847073324E-010 volume mixing ratio
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND 20 CATEGORIES
INPUT SOIL TEXTURE CLASSIFICATION = STAS
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
ThompMP: computing qr_acr_qg table: qr_acr_qg_V4.dat
Timing for table computation qr_acr_qg_V4.dat: 0.74196 elapsed seconds
Writing qr_acr_qg_V4.dat in Thompson MP init
ThompMP: computing qr_acr_qs
Writing qr_acr_qsV2.dat in Thompson MP init
ThompMP: computing freezeH2O
Writing freezeH2O.dat in Thompson MP init
D01 Spectral nudging for wind is turned on and Guv= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for temperature is turned on and Gt= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for geopotential is turned on and Gph= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for water vapor mixing ratio is turned on and Gq= 0.1000E-04 xwave= 4 ywavenum= 4
D01 Spectral nudging for wind is turned off within the PBL.
D01 Spectral nudging for temperature is turned off within the PBL.
D01 Spectral nudging for geopotential is turned off within the PBL.
D01 Spectral nudging for water vapor mixing ratio is turned off within the PBL.
*************************************
Nesting domain
ids,ide,jds,jde 1 331 1 271
ims,ime,jms,jme -4 45 -4 35
ips,ipe,jps,jpe 1 33 1 25
INTERMEDIATE domain
ids,ide,jds,jde 40 111 28 87
ims,ime,jms,jme 35 58 23 44
ips,ipe,jps,jpe 38 48 26 34
*************************************
d01 2016-10-01_00:00:00 alloc_space_field: domain 2 , 4342272 bytes allocated
d01 2016-10-01_00:00:00 alloc_space_field: domain 2 , 62768356 bytes allocated
RESTART: nest, opening wrfrst_d02_2016-10-01_00:00:00 for reading
d01 2016-10-01_00:00:00 Input data is acceptable to use: wrfrst_d02_2016-10-01_00:00:00
Any insights, suggestions, or similar experiences would be greatly appreciated.
Thanks in advance.
Xingang
I’ve been encountering a persistent issue running WRF on Derecho and wanted to see if others have experienced something similar or found a workaround.
Issue description:
WRF appears to stall immediately after successfully reading input data. In rsl.error.0000, the last message consistently indicates that the input data is acceptable (e.g., “input data is acceptable to use”), but the model does not proceed further. It then hangs indefinitely until the job reaches the walltime limit and is terminated.
Impact:
This behavior is costly in terms of both queue wait time and allocated walltime, especially when running multiple simulations. In some cases, I lose 10+ hours waiting in queue and another 10+ hours of walltime per failed attempt.
What I’ve tested so far:
- Processor count sensitivity:
Tested across a range of MPI tasks (64 to 110 nprocs). The issue occurs consistently across all configurations. - I/O and data size considerations:
- Reduced input data size (e.g., limiting FDDA data to ~10 days per file)
- Distributed concurrent runs across different storage locations (scratch, work, etc.)
These changes did not resolve the issue.
- Reproducibility:
The issue is intermittent:- If the job stalls and is killed at walltime,
- A simple resubmission of the exact same job will often run successfully to completion.
#PBS -l select=1:ncpus=128:mpiprocs=110:mem=235gb
# Load modules to match compile-time environment of wrfv4.6.1
module --force purge
module load ncarenv/23.09
module reset
module swap hdf5 hdf5-mpi/1.12.2
module swap netcdf netcdf-mpi/4.9.2
module load intel-classic/2023.2.1
module load ncarcompilers/1.0.0
module load craype/2.7.23
module load cray-mpich/8.1.27
# print the list of modules loaded for wrf (same when compiled wrf)
module list
mpiexec -n 110 ./wrf.exe
(WRF version: v4.6.1)
Questions:
- Has anyone observed similar behavior on Derecho or other HPC systems?
- Could this be related to MPI communication, I/O contention, or filesystem latency that is not obvious from logs?
- Are there recommended debugging steps or runtime flags to isolate where WRF is getting stuck after input validation?
- Any known stability issues with specific compiler/MPI combinations that might lead to this kind of hang?
rsl.error.0000
taskid: 0 hostname: dec2234
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 10 , ntasks in Y 11
*************************************
Configuring physics suite 'conus'
mp_physics: 8* 8*
cu_physics: 6* 0*
ra_lw_physics: 1* 1*
ra_sw_physics: 1* 1*
bl_pbl_physics: 2* 2*
sf_sfclay_physics: 2* 2*
sf_surface_physics: 2 2
(* = option overrides suite setting)
*************************************
Domain # 1: dx = 20000.000 m
Domain # 2: dx = 4000.000 m
WRF V4.6.1 MODEL
git commit d66e442fccc04111067e29274c9f9eaccc3cef28
*************************************
Parent domain
ids,ide,jds,jde 1 137 1 115
ims,ime,jms,jme -4 21 -4 18
ips,ipe,jps,jpe 1 14 1 11
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 25419644 bytes allocated
RESTART run: opening wrfrst_d01_2016-10-01_00:00:00 for reading
Input data is acceptable to use: wrfrst_d01_2016-10-01_00:00:00
Timing for processing restart file for domain 1: 2.42051 elapsed seconds
Max map factor in domain 1 = 1.01. Scale the dt in the model accordingly.
D01: Time step = 75.00000 (s)
D01: Grid Distance = 20.00000 (km)
D01: Grid Distance Ratio dt/dx = 3.750000 (s/km)
D01: Ratio Including Maximum Map Factor = 3.784267 (s/km)
D01: NML defined reasonable_time_step_ratio = 6.000000
Climate GHG input from file from year 1765 to 2499
CO2 range = 277.913000000000 579.264000000000 ppm
N2O range = 274.372000000000 359.798000000000 ppb
CH4 range = 738.986000000000 997.311000000000 ppb
CFC11 range = 0.000000000000000E+000 1.400000000000000E-002 ppt
CFC12 range = 0.000000000000000E+000 2.88100000000000 ppt
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2013 , Julian day = 274
CO2 = 3.962691149230108E-004 volume mixing ratio
N2O = 3.262382361148917E-007 volume mixing ratio
CH4 = 1.825235824713203E-006 volume mixing ratio
CFC11 = 2.339924882273381E-010 volume mixing ratio
CFC12 = 5.223582847073324E-010 volume mixing ratio
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND 20 CATEGORIES
INPUT SOIL TEXTURE CLASSIFICATION = STAS
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
ThompMP: computing qr_acr_qg table: qr_acr_qg_V4.dat
Timing for table computation qr_acr_qg_V4.dat: 0.74196 elapsed seconds
Writing qr_acr_qg_V4.dat in Thompson MP init
ThompMP: computing qr_acr_qs
Writing qr_acr_qsV2.dat in Thompson MP init
ThompMP: computing freezeH2O
Writing freezeH2O.dat in Thompson MP init
D01 Spectral nudging for wind is turned on and Guv= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for temperature is turned on and Gt= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for geopotential is turned on and Gph= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for water vapor mixing ratio is turned on and Gq= 0.1000E-04 xwave= 4 ywavenum= 4
D01 Spectral nudging for wind is turned off within the PBL.
D01 Spectral nudging for temperature is turned off within the PBL.
D01 Spectral nudging for geopotential is turned off within the PBL.
D01 Spectral nudging for water vapor mixing ratio is turned off within the PBL.
*************************************
Nesting domain
ids,ide,jds,jde 1 331 1 271
ims,ime,jms,jme -4 45 -4 35
ips,ipe,jps,jpe 1 33 1 25
INTERMEDIATE domain
ids,ide,jds,jde 40 111 28 87
ims,ime,jms,jme 35 58 23 44
ips,ipe,jps,jpe 38 48 26 34
*************************************
d01 2016-10-01_00:00:00 alloc_space_field: domain 2 , 4342272 bytes allocated
d01 2016-10-01_00:00:00 alloc_space_field: domain 2 , 62768356 bytes allocated
RESTART: nest, opening wrfrst_d02_2016-10-01_00:00:00 for reading
d01 2016-10-01_00:00:00 Input data is acceptable to use: wrfrst_d02_2016-10-01_00:00:00
Any insights, suggestions, or similar experiences would be greatly appreciated.
Thanks in advance.
Xingang