WRF Stalling After Input Check on Derecho (Hangs Until Walltime Expires)

xgfan_wrf

New member
Hi all,

I’ve been encountering a persistent issue running WRF on Derecho and wanted to see if others have experienced something similar or found a workaround.

Issue description:
WRF appears to stall immediately after successfully reading input data. In rsl.error.0000, the last message consistently indicates that the input data is acceptable (e.g., “input data is acceptable to use”), but the model does not proceed further. It then hangs indefinitely until the job reaches the walltime limit and is terminated.

Impact:
This behavior is costly in terms of both queue wait time and allocated walltime, especially when running multiple simulations. In some cases, I lose 10+ hours waiting in queue and another 10+ hours of walltime per failed attempt.

What I’ve tested so far:
  • Processor count sensitivity:
    Tested across a range of MPI tasks (64 to 110 nprocs). The issue occurs consistently across all configurations.
  • I/O and data size considerations:
    • Reduced input data size (e.g., limiting FDDA data to ~10 days per file)
    • Distributed concurrent runs across different storage locations (scratch, work, etc.)
      These changes did not resolve the issue.
  • Reproducibility:
    The issue is intermittent:
    • If the job stalls and is killed at walltime,
    • A simple resubmission of the exact same job will often run successfully to completion.
Job configuration:
#PBS -l select=1:ncpus=128:mpiprocs=110:mem=235gb

# Load modules to match compile-time environment of wrfv4.6.1
module --force purge
module load ncarenv/23.09
module reset
module swap hdf5 hdf5-mpi/1.12.2
module swap netcdf netcdf-mpi/4.9.2
module load intel-classic/2023.2.1
module load ncarcompilers/1.0.0
module load craype/2.7.23
module load cray-mpich/8.1.27

# print the list of modules loaded for wrf (same when compiled wrf)
module list

mpiexec -n 110 ./wrf.exe

(WRF version: v4.6.1)

Questions:
  • Has anyone observed similar behavior on Derecho or other HPC systems?
  • Could this be related to MPI communication, I/O contention, or filesystem latency that is not obvious from logs?
  • Are there recommended debugging steps or runtime flags to isolate where WRF is getting stuck after input validation?
  • Any known stability issues with specific compiler/MPI combinations that might lead to this kind of hang?
Additional info:

rsl.error.0000

taskid: 0 hostname: dec2234
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 10 , ntasks in Y 11
*************************************
Configuring physics suite 'conus'
mp_physics: 8* 8*
cu_physics: 6* 0*
ra_lw_physics: 1* 1*
ra_sw_physics: 1* 1*
bl_pbl_physics: 2* 2*
sf_sfclay_physics: 2* 2*
sf_surface_physics: 2 2
(* = option overrides suite setting)
*************************************
Domain # 1: dx = 20000.000 m
Domain # 2: dx = 4000.000 m
WRF V4.6.1 MODEL
git commit d66e442fccc04111067e29274c9f9eaccc3cef28
*************************************
Parent domain
ids,ide,jds,jde 1 137 1 115
ims,ime,jms,jme -4 21 -4 18
ips,ipe,jps,jpe 1 14 1 11
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 25419644 bytes allocated
RESTART run: opening wrfrst_d01_2016-10-01_00:00:00 for reading
Input data is acceptable to use: wrfrst_d01_2016-10-01_00:00:00
Timing for processing restart file for domain 1: 2.42051 elapsed seconds
Max map factor in domain 1 = 1.01. Scale the dt in the model accordingly.
D01: Time step = 75.00000 (s)
D01: Grid Distance = 20.00000 (km)
D01: Grid Distance Ratio dt/dx = 3.750000 (s/km)
D01: Ratio Including Maximum Map Factor = 3.784267 (s/km)
D01: NML defined reasonable_time_step_ratio = 6.000000
Climate GHG input from file from year 1765 to 2499
CO2 range = 277.913000000000 579.264000000000 ppm
N2O range = 274.372000000000 359.798000000000 ppb
CH4 range = 738.986000000000 997.311000000000 ppb
CFC11 range = 0.000000000000000E+000 1.400000000000000E-002 ppt
CFC12 range = 0.000000000000000E+000 2.88100000000000 ppt
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2013 , Julian day = 274
CO2 = 3.962691149230108E-004 volume mixing ratio
N2O = 3.262382361148917E-007 volume mixing ratio
CH4 = 1.825235824713203E-006 volume mixing ratio
CFC11 = 2.339924882273381E-010 volume mixing ratio
CFC12 = 5.223582847073324E-010 volume mixing ratio
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND 20 CATEGORIES
INPUT SOIL TEXTURE CLASSIFICATION = STAS
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
ThompMP: computing qr_acr_qg table: qr_acr_qg_V4.dat
Timing for table computation qr_acr_qg_V4.dat: 0.74196 elapsed seconds
Writing qr_acr_qg_V4.dat in Thompson MP init
ThompMP: computing qr_acr_qs
Writing qr_acr_qsV2.dat in Thompson MP init
ThompMP: computing freezeH2O
Writing freezeH2O.dat in Thompson MP init
D01 Spectral nudging for wind is turned on and Guv= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for temperature is turned on and Gt= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for geopotential is turned on and Gph= 0.3000E-03 xwave= 4 ywavenum= 4
D01 Spectral nudging for water vapor mixing ratio is turned on and Gq= 0.1000E-04 xwave= 4 ywavenum= 4
D01 Spectral nudging for wind is turned off within the PBL.
D01 Spectral nudging for temperature is turned off within the PBL.
D01 Spectral nudging for geopotential is turned off within the PBL.
D01 Spectral nudging for water vapor mixing ratio is turned off within the PBL.
*************************************
Nesting domain
ids,ide,jds,jde 1 331 1 271
ims,ime,jms,jme -4 45 -4 35
ips,ipe,jps,jpe 1 33 1 25
INTERMEDIATE domain
ids,ide,jds,jde 40 111 28 87
ims,ime,jms,jme 35 58 23 44
ips,ipe,jps,jpe 38 48 26 34
*************************************
d01 2016-10-01_00:00:00 alloc_space_field: domain 2 , 4342272 bytes allocated
d01 2016-10-01_00:00:00 alloc_space_field: domain 2 , 62768356 bytes allocated
RESTART: nest, opening wrfrst_d02_2016-10-01_00:00:00 for reading
d01 2016-10-01_00:00:00 Input data is acceptable to use: wrfrst_d02_2016-10-01_00:00:00


Any insights, suggestions, or similar experiences would be greatly appreciated.

Thanks in advance.
Xingang
 
It happened today again after having waited in queue for 21 hours. Now I have to resubmit and continue to wait in the queue.
In cases when it works, the continued timing of the two domains in rsl.error.* files will appear in about 1-2 minutes after it started running, and the first three-hour wrfout would be created after 3-4 min.

If anyone had the same experience, please share and if any solutions have been found.
Thanks,
Xingang
 
Hi Xingang,
Can you point us to the directory where a case has failed? Please make sure the namelist and rsl* files for the failed case are available in that directory. Thanks!
 
Hi, Kwerner:

The one that is going to fail (I am pretty sure) has now been running for 40 min and is not producing any wrfout and it will fail eventually after the 1 hour walltime expires. My other runs are integrating for 10 days and the problem is the same. Please know that if the job is resubmitted, it will eventually succeed.
This one is at:
/glade/derecho/scratch/xfan/WRF/wrfv4.6.1/test/Testing_em_real_MPI1
The job submission script is:
PBS_wrfbmtest110_1.sh

I will not be running in that dir recently. Please go ahead and check it out.
Thanks,
Xingang
 
Hi, Kwerner:

Another one happened. It waited in the queue for at least 15 hours. I killed the job after it has been run for 1:25:00 without producing outputs. The job is usually run in the new dir with all data copied over. I have saved it at (the ending number in the folder name is the jobid):
/glade/work/xfan/tmp/Failed_job_mEd13rst112_5894138

For this same job, I have resubmitted and it usually would work.
Xingang
 
Xingang,
Apologies for the delay. I've tested running your case (your namelist, wrfrst*, wrffdda_d01 and wrfbdy_d01 files from /glade/derecho/scratch/xfan/WRF/wrfv4.6.1/test/Testing_em_real_MPI1) and it ran okay with any stalling. I ran this on both the Derecho pre-compiled wrfv4.6.1 and wrfv4.7.1 code. If you haven't made any modifications to your WRF code, then the only differences I can see are our batch scripts, though I am using 110 processors, like you did.

If you'd like to take a look at one of the cases, you can find it in /glade/derecho/scratch/kkeene/xgfan_wrf/wrfv4.6.1/test/em_real

My batch submission script is called runwrf.sh. My loaded modules are:

Code:
Currently Loaded Modules:
  1) ncarenv/25.10  (S)   4) ncarcompilers/1.1.0   7) hdf5/1.14.6   10) ncl/6.6.2
  2) craype/2.7.34        5) libfabric/1.22.0      8) netcdf/4.9.3
  3) intel/2025.2.1       6) cray-mpich/8.1.32     9) ncview/2.1.9
 
Kwerner,

Thanks! I realize this is likely difficult to debug because simply resubmitting the same job usually allows it to complete normally.

However, this issue still occurring almost daily. I typically have five WRF jobs running simultaneously, and the stalls appear somewhat random; sometimes one job stalls, sometimes two. In most cases, a single resubmission resolves it (occasionally a second resubmission is needed).

My guess is that this may be related to intermittent I/O blocking or filesystem contention that depends on overall system load at that moment. It seems possible that WRF occasionally gets stuck while waiting on a read/write operation and never recovers from that wait state. The behavior is usually easy to detect within a couple of minutes after launch. When the model is progressing normally, the rsl.error.* files continue printing timing lines such as:

Timing for main: time 2017-11-01_00:00:15 on domain 2: 0.31854 elapsed seconds
Timing for main: time 2017-11-01_00:00:30 on domain 2: 0.22356 elapsed seconds

When the stall occurs, output to rsl.error.* files stops completely, with the last line being at the same messaging (I included a few more lines before it):

D01 Spectral nudging for geopotential is turned off within the PBL.
D01 Spectral nudging for water vapor mixing ratio is turned off within the PBL.
*************************************
Nesting domain
ids,ide,jds,jde 1 331 1 271
ims,ime,jms,jme -4 45 -4 35
ips,ipe,jps,jpe 1 33 1 25
INTERMEDIATE domain
ids,ide,jds,jde 40 111 28 87
ims,ime,jms,jme 35 58 23 44
ips,ipe,jps,jpe 38 48 26 34
*************************************
d01 2017-11-01_00:00:00 alloc_space_field: domain 2 , 4342272 bytes allocated
d01 2017-11-01_00:00:00 alloc_space_field: domain 2 , 62768356 bytes allocated
RESTART: nest, opening wrfrst_d02_2017-11-01_00:00:00 for reading
d01 2017-11-01_00:00:00 Input data is acceptable to use: wrfrst_d02_2017-11-01_00:00:00

For a successful run, there is only 40 more lines after the above last line and before the timing lines, show as below:
...
d01 2017-11-01_00:00:00 Input data is acceptable to use: wrfrst_d02_2017-11-01_00:00:00
Timing for processing restart file for domain 2: 7.83255 elapsed seconds
Climate GHG input from file from year 1765 to 2499
CO2 range = 277.913000000000 579.264000000000 ppm
N2O range = 274.372000000000 359.798000000000 ppb
CH4 range = 738.986000000000 997.311000000000 ppb
CFC11 range = 0.000000000000000E+000 1.400000000000000E-002 ppt
CFC12 range = 0.000000000000000E+000 2.88100000000000 ppt
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2013 , Julian day = 274
CO2 = 3.962691149230108E-004 volume mixing ratio
N2O = 3.262382361148917E-007 volume mixing ratio
CH4 = 1.825235824713203E-006 volume mixing ratio
CFC11 = 2.339924882273381E-010 volume mixing ratio
CFC12 = 5.223582847073324E-010 volume mixing ratio
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND 20 CATEGORIES
INPUT SOIL TEXTURE CLASSIFICATION = STAS
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
d01 2017-11-01_00:00:00 Input data is acceptable to use: wrffdda_d01
d01 2017-11-01_00:00:00 Input data processed for aux input 10 for domain 1
d01 2017-11-01_00:00:00 Input data is acceptable to use: wrfbdy_d01
d01 2017-11-01_00:00:00 WRF restart, LBC starts at 2017-11-01_00:00:00 and restart starts at 2017-11-01_00:00:00
LBC for restart: Starting valid date = 2017-11-01_00:00:00, Ending valid date = 2017-11-01_06:00:00
LBC for restart: Restart time = 2017-11-01_00:00:00
LBC for restart: Found the correct bounding LBC time periods
LBC for restart: Found the correct bounding LBC time periods for restart time = 2017-11-01_00:00:00
Timing for processing lateral boundary for domain 1: 0.33573 elapsed seconds
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS 1 IE 14 JS 1 JE 11
WRF NUMBER OF TILES = 1
D01 Spectral nudging read in new data at time = 2148480.000 min.
D01 Spectral nudging bracketing times = 2148480.000 2148840.000 min.
d01 2017-11-01_00:00:00 ----------------------------------------
d01 2017-11-01_00:00:00 W-DAMPING BEGINS AT W-COURANT NUMBER = 1.000000
d01 2017-11-01_00:00:00 ----------------------------------------
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS 1 IE 33 JS 1 JE 25
WRF NUMBER OF TILES = 1
Timing for main: time 2017-11-01_00:00:15 on domain 2: 0.31854 elapsed seconds
Timing for main: time 2017-11-01_00:00:30 on domain 2: 0.22356 elapsed seconds
...

Since resubmission usually succeeds without changing anything, this does not appear to be namelist- or input-data-related. It seems more likely to be an intermittent runtime/system-level issue (possibly I/O wait, filesystem latency, MPI synchronization, or resource contention), but I do not have enough visibility into the system to diagnose further.

I am not sure if increasing the debug info output level in the namelist would help, but it may cause longer runtime for having more I/O operations.
 
Back
Top