Hello, I've started running into an issue where a wrf run progresses nearly to the end of the simulation period and then stops updating the rsl.out.* files. The job does not fail and continues until it reaches timeout in the slurm queue. After inspecting the rsl.error.* files I see lots of MPICH errors (see below). The failure always occurs at the same simulation time (2022-06-08_13:02:05). This simulation is part of a string of 20-day restart runs spanning 2022 for the southeastern US and I have yet to run into this issue with any other run. I've also run this model for some historical periods in the 1970's and 1980's, thus far without this issue either. This is a 2-domain nested model (d01=1008x698 @ 4km; d02=1133x1341 @ 1km) with spectral nudging and Noah-MP LSM using ERA5 for input. I'm at a loss here and not sure what to try next, any help would be greatly appreciated!
To test, I've tried the following:
Platform: Cray GNU/Linux, Intel x86_64
SBATCH: 2400 processors, with 80 reserved for I/O quilting
To test, I've tried the following:
- Rerun again with no change [failure],
- Build new, shorter, input files with real.exe that spans the failed simulation time [success],
- Build new input files with real.exe for the whole simulation period [failure].
Platform: Cray GNU/Linux, Intel x86_64
SBATCH: 2400 processors, with 80 reserved for I/O quilting
Code:
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
wrf.exe 0000000023352F9B for__signal_handl Unknown Unknown
libpthread-2.26.s 00001555529672D0 Unknown Unknown Unknown
libugni.so.0.6.0 000015554F031DEB Unknown Unknown Unknown
libmpich_intel.so 0000155552DB0F7B MPID_nem_gni_poll Unknown Unknown
libmpich_intel.so 0000155552D8EDB6 MPIDI_CH3I_Progre Unknown Unknown
libmpich_intel.so 0000155552C97A95 MPIR_Wait_impl Unknown Unknown
libmpich_intel.so 0000155552C97F68 MPI_Wait Unknown Unknown
wrf.exe 00000000232B3D94 Unknown Unknown Unknown
wrf.exe 00000000212F0D77 Unknown Unknown Unknown
wrf.exe 00000000211E45BA Unknown Unknown Unknown
wrf.exe 0000000020FC62DC Unknown Unknown Unknown
wrf.exe 0000000020181C1F Unknown Unknown Unknown
wrf.exe 0000000020182236 Unknown Unknown Unknown
wrf.exe 0000000020017911 Unknown Unknown Unknown
wrf.exe 00000000200178C9 Unknown Unknown Unknown
wrf.exe 0000000020017852 Unknown Unknown Unknown
libc-2.26.so 00001555525BD34A __libc_start_main Unknown Unknown
wrf.exe 000000002001776A Unknown Unknown Unknown
Attachments
Last edited: