[SOLVED] WRF "quiet failure": Job continues with no progress until timeout, MPICH errors in rsl.error files

jbellino · Aug 7, 2023

Hello, I've started running into an issue where a wrf run progresses nearly to the end of the simulation period and then stops updating the rsl.out.* files. The job does not fail and continues until it reaches timeout in the slurm queue. After inspecting the rsl.error.* files I see lots of MPICH errors (see below). The failure always occurs at the same simulation time (2022-06-08_13:02:05). This simulation is part of a string of 20-day restart runs spanning 2022 for the southeastern US and I have yet to run into this issue with any other run. I've also run this model for some historical periods in the 1970's and 1980's, thus far without this issue either. This is a 2-domain nested model (d01=1008x698 @ 4km; d02=1133x1341 @ 1km) with spectral nudging and Noah-MP LSM using ERA5 for input. I'm at a loss here and not sure what to try next, any help would be greatly appreciated!

To test, I've tried the following:

Rerun again with no change [failure],
Build new, shorter, input files with real.exe that spans the failed simulation time [success],
Build new input files with real.exe for the whole simulation period [failure].

WRF version: 4.4
Platform: Cray GNU/Linux, Intel x86_64
SBATCH: 2400 processors, with 80 reserved for I/O quilting

Code:

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source        
wrf.exe            0000000023352F9B  for__signal_handl     Unknown  Unknown
libpthread-2.26.s  00001555529672D0  Unknown               Unknown  Unknown
libugni.so.0.6.0   000015554F031DEB  Unknown               Unknown  Unknown
libmpich_intel.so  0000155552DB0F7B  MPID_nem_gni_poll     Unknown  Unknown
libmpich_intel.so  0000155552D8EDB6  MPIDI_CH3I_Progre     Unknown  Unknown
libmpich_intel.so  0000155552C97A95  MPIR_Wait_impl        Unknown  Unknown
libmpich_intel.so  0000155552C97F68  MPI_Wait              Unknown  Unknown
wrf.exe            00000000232B3D94  Unknown               Unknown  Unknown
wrf.exe            00000000212F0D77  Unknown               Unknown  Unknown
wrf.exe            00000000211E45BA  Unknown               Unknown  Unknown
wrf.exe            0000000020FC62DC  Unknown               Unknown  Unknown
wrf.exe            0000000020181C1F  Unknown               Unknown  Unknown
wrf.exe            0000000020182236  Unknown               Unknown  Unknown
wrf.exe            0000000020017911  Unknown               Unknown  Unknown
wrf.exe            00000000200178C9  Unknown               Unknown  Unknown
wrf.exe            0000000020017852  Unknown               Unknown  Unknown
libc-2.26.so       00001555525BD34A  __libc_start_main     Unknown  Unknown
wrf.exe            000000002001776A  Unknown               Unknown  Unknown

jbellino · Aug 14, 2023

I've spent the last few days testing another approach where I've broken up a 40-day simulation period (2 20-day runs) into one 18-day run (ends on 2022-06-08-00:00:00) and one 22-day run (begins on 2022-06-08-00:00:00). The first run goes fine up through the end time, which is no surprise since the time I've been struggling with is several hours after the end of the first run. When I start the next run using the restart files from the first, I'm now getting an error saying that the input file was created with an older WRF preprocessor:

Code:

d01 2022-06-08_00:00:00 File name that is causing troubles = wrfrst_d02_2022-06-08_00:00:00_0000
d01 2022-06-08_00:00:00  You can try 1) ensure that the input file was created with WRF v4 pre-processors, or
d01 2022-06-08_00:00:00  2) use force_use_old_data=T in the time_control record of the namelist.input file
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:     332
 ---- ERROR: The input file appears to be from a pre-v4 version of WRF initialization routines

In looking at the restart files, they have header data that can be read out with ncdump, but they appear to be empty since file size is only 122 bytes. I'll start digging into why the restart files aren't being generated properly. Edited to note that I improperly checked file size on simlinks, not the files which had been moved and relinked in the working area.

Ming Chen · Aug 16, 2023

I am suspicious that this error message could be misleading and the real problem is not the restart file, although I don't know yet what is wrong in this case.

Have you looked at wrfout files and wrfrst files? Are the variables in these files look reasonable?

I would suggest that you turn off the quilting option, i.e.,
nio_tasks_per_group = 0,
nio_groups = 1,

Then try again. We know that this option sometimes can cause trouble in WRF run.

Other than this, all other options in your namelist.input look fine.

jbellino · Aug 17, 2023

Ming Chen said:
Have you looked at wrfout files and wrfrst files? Are the variables in these files look reasonable?

I ran some high-level checks on the wrfrst and wrfout files by comparing attributes, sizes of variable arrays, and file size against a set of files from a separate run for a different date range and I don't see anything that looks out of the ordinary.

Ming Chen said:
I would suggest that you turn off the quilting option, i.e.,
nio_tasks_per_group = 0,
nio_groups = 1,

I haven't yet had a chance to re-run with quilting turned off, but will test this out when the cluster is free.

jbellino · Aug 17, 2023

In looking at the most recent bug-fixes incorporated in WRF v4.5.1 on July 25, 2023 I saw this:

Fix an issue in the revised MM5 (sf_sfclay_physics=1) scheme, where the model could potentially encounter an infinite loop. In specific conditions floating point roundoff errors were preventing a convergence condition from ever being met. Details

I am using this surface layer scheme in both domains and wonder if this could be happening with my model. Once I get an easily testable run setup that doesn't take 30 hours to fail I will test it with sf_sfclay_physics=0.

Ming Chen · Aug 17, 2023

It is worth trying WRFV4.5.1. Please keep me updated of the result. Thanks.

saeed tavakhsh · Aug 18, 2023

Ming Chen said:
It is worth trying WRFV4.5.1. Please keep me updated of the result. Thanks.

Hi, I am using the WRFV.4.5. I am facing this problem again. I have three nested domains. the run for the first two domains was successful. the problem emerged again when I add the third domain which I already somehow overcome by generating the wrfrst files. Now, I simply turned off the UCM model and it worked. I attached my namelist.input file.

Ming Chen · Aug 18, 2023

Note that urban physics scheme only works with Noah and NoahMP LSM. In your case, you run with Pleim-Xiu LSM, and urban physics must be turned off.

jbellino · Aug 19, 2023

Ming Chen said:
It is worth trying WRFV4.5.1. Please keep me updated of the result. Thanks.

Hi Ming, I cherry-picked commit 8723305 into a new local branch of WRF V4.4 to patch the infinite loop described in issue 1859. I'm happy to report that after recompiling WRF, I was able to successfully complete a run using the namelist file which was originally causing the problem. Given that I had replicated the issue several times prior to patching this bug I am going to assume that the problem I was experiencing was, in fact, the infinite loop and now consider this issue resolved. Hooray!!

[SOLVED] WRF "quiet failure": Job continues with no progress until timeout, MPICH errors in rsl.error files

jbellino

New member

Attachments

jbellino

New member

Ming Chen

Moderator

jbellino

New member

jbellino

New member

Ming Chen

Moderator

saeed tavakhsh

New member

Attachments

Ming Chen

Moderator

jbellino

New member