Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Multiple WRFv4.7.1 restart runs killed with signal errors on Derecho

spacekace3005

New member
Hello,

I am running 2 WRF simulations that require restart runs to finish. The first halves of each model ran successfully, but I am now experiencing issues with the restart runs. I successfully ran four other simulations with the same model configuration, including with restarts. The only changes between the first halves of these runs and the second are the start time and restart option (.true. instead of .false.) in the namelist.

With the first run (/glade/derecho/scratch/kshourd/8July2009/precursor/WRF/test/em_real), there are no CFL nor other errors in the rsl.* files. The last five lines of the WRF output file (wrf_pre.o4526758) are as follows:
starting wrf task 2301 of 4992
starting wrf task 2302 of 4992
starting wrf task 2303 of 4992
dec0260.hsn.de.hpc.ucar.edu: rank 128 exited with code 255
dec0392.hsn.de.hpc.ucar.edu: rank 3348 died from signal 15

The rsl.out.0000 file issues the following warning, which I have not seen before, despite using MYNN Level 2.5/3 numerous times before:
--- WARNING: MYNN is set to mix scalars, turning off scalar_pblmix
However, it's not clear this is the issue, as wrf.exe continues to run before being killed after opening the associated d02 restart file. The rsl.out.0000 and wrf_pre.o4526758 files are attached and located in the aforementioned folder on Derecho (rsl files are in the folder /oldRSL).

My second run (/glade/derecho/scratch/kshourd/8July2009/convection/WRF/test/em_real) is terminated in a similar way as the first (see wrf_cv.o4541589), though WRF runs for one timestep before being killed:
starting wrf task 4991 of 4992
starting wrf task 510 of 4992
starting wrf task 639 of 4992
dec2004.hsn.de.hpc.ucar.edu: rank 4430 died from signal 11 and dumped core
dec1574.hsn.de.hpc.ucar.edu: rank 299 died from signal 15

The difference with the second run is that some rsl files do have CFL errors in them:
rsl.error.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.error.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 45.38 w-cfl: 2.47 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.error.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1409 3 W: -104.11 w-cfl: 2.58 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.error.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 81.05 w-cfl: 2.95 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:41+03/** 4 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.error.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 97.01 w-cfl: 3.53 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:41+03/** 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.error.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1409 3 W: 81.00 w-cfl: 3.04 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:41+03/** 1 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.error.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 103.90 w-cfl: 3.35 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 9 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -37.25 w-cfl: 5.81 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -125.83 w-cfl: 3.85 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 12 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1408 4 W: -571.49 w-cfl: 5.06 dETA: 0.01
rsl.error.4431:d02 2009-07-08_23:46:43+39/50 5 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4431:d02 2009-07-08_23:46:43+39/50 Max W: 466 1409 4 W: -413.53 w-cfl: 3.68 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.out.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 45.38 w-cfl: 2.47 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.out.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1409 3 W: -104.11 w-cfl: 2.58 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.out.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 81.05 w-cfl: 2.95 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:41+03/** 4 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.out.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 97.01 w-cfl: 3.53 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:41+03/** 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.out.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1409 3 W: 81.00 w-cfl: 3.04 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:41+03/** 1 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.out.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 103.90 w-cfl: 3.35 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 9 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -37.25 w-cfl: 5.81 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -125.83 w-cfl: 3.85 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 12 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1408 4 W: -571.49 w-cfl: 5.06 dETA: 0.01
rsl.out.4431:d02 2009-07-08_23:46:43+39/50 5 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4431:d02 2009-07-08_23:46:43+39/50 Max W: 466 1409 4 W: -413.53 w-cfl: 3.68 dETA: 0.01

As I mentioned, I have successfully run two other cases (three other models) with the same configuration, parameters, and even restarts without issue. These successful runs are located here:
/glade/derecho/scratch/kshourd/10Aug2020/new_precursor1km/WRF/test/em_real/
/glade/derecho/scratch/kshourd/12May2022/1km_convection/WRF/test/em_real/
/glade/derecho/scratch/kshourd/12May2022/1km_precursor/WRF/test/em_real/

Thanks in advance for any help!
 

Attachments

  • rsl.out.0000
    3.1 KB · Views: 0
  • wrf_cv.o4541589.txt
    239 KB · Views: 0
  • wrf_pre.o4526758.txt
    239 KB · Views: 0
Top