Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Multiple WRFv4.7.1 restart runs killed with signal errors on Derecho

spacekace3005

New member
Hello,

I am running 2 WRF simulations that require restart runs to finish. The first halves of each model ran successfully, but I am now experiencing issues with the restart runs. I successfully ran four other simulations with the same model configuration, including with restarts. The only changes between the first halves of these runs and the second are the start time and restart option (.true. instead of .false.) in the namelist.

With the first run (/glade/derecho/scratch/kshourd/8July2009/precursor/WRF/test/em_real), there are no CFL nor other errors in the rsl.* files. The last five lines of the WRF output file (wrf_pre.o4526758) are as follows:
starting wrf task 2301 of 4992
starting wrf task 2302 of 4992
starting wrf task 2303 of 4992
dec0260.hsn.de.hpc.ucar.edu: rank 128 exited with code 255
dec0392.hsn.de.hpc.ucar.edu: rank 3348 died from signal 15

The rsl.out.0000 file issues the following warning, which I have not seen before, despite using MYNN Level 2.5/3 numerous times before:
--- WARNING: MYNN is set to mix scalars, turning off scalar_pblmix
However, it's not clear this is the issue, as wrf.exe continues to run before being killed after opening the associated d02 restart file. The rsl.out.0000 and wrf_pre.o4526758 files are attached and located in the aforementioned folder on Derecho (rsl files are in the folder /oldRSL).

My second run (/glade/derecho/scratch/kshourd/8July2009/convection/WRF/test/em_real) is terminated in a similar way as the first (see wrf_cv.o4541589), though WRF runs for one timestep before being killed:
starting wrf task 4991 of 4992
starting wrf task 510 of 4992
starting wrf task 639 of 4992
dec2004.hsn.de.hpc.ucar.edu: rank 4430 died from signal 11 and dumped core
dec1574.hsn.de.hpc.ucar.edu: rank 299 died from signal 15

The difference with the second run is that some rsl files do have CFL errors in them:
rsl.error.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.error.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 45.38 w-cfl: 2.47 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.error.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1409 3 W: -104.11 w-cfl: 2.58 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.error.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 81.05 w-cfl: 2.95 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:41+03/** 4 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.error.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 97.01 w-cfl: 3.53 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:41+03/** 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.error.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1409 3 W: 81.00 w-cfl: 3.04 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:41+03/** 1 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.error.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 103.90 w-cfl: 3.35 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 9 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -37.25 w-cfl: 5.81 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -125.83 w-cfl: 3.85 dETA: 0.01
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 12 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1408 4 W: -571.49 w-cfl: 5.06 dETA: 0.01
rsl.error.4431:d02 2009-07-08_23:46:43+39/50 5 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.error.4431:d02 2009-07-08_23:46:43+39/50 Max W: 466 1409 4 W: -413.53 w-cfl: 3.68 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.out.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 45.38 w-cfl: 2.47 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.out.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1409 3 W: -104.11 w-cfl: 2.58 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:37+**/** 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:37+**/** hours
rsl.out.4430:d02 2009-07-08_23:46:37+**/** Max W: 465 1408 2 W: 81.05 w-cfl: 2.95 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:41+03/** 4 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.out.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 97.01 w-cfl: 3.53 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:41+03/** 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.out.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1409 3 W: 81.00 w-cfl: 3.04 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:41+03/** 1 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:41+03/** hours
rsl.out.4430:d02 2009-07-08_23:46:41+03/** Max W: 465 1408 3 W: 103.90 w-cfl: 3.35 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 9 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -37.25 w-cfl: 5.81 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1409 3 W: -125.83 w-cfl: 3.85 dETA: 0.01
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 12 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4430:d02 2009-07-08_23:46:43+39/50 Max W: 465 1408 4 W: -571.49 w-cfl: 5.06 dETA: 0.01
rsl.out.4431:d02 2009-07-08_23:46:43+39/50 5 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_23:46:43+39/50 hours
rsl.out.4431:d02 2009-07-08_23:46:43+39/50 Max W: 466 1409 4 W: -413.53 w-cfl: 3.68 dETA: 0.01

As I mentioned, I have successfully run two other cases (three other models) with the same configuration, parameters, and even restarts without issue. These successful runs are located here:
/glade/derecho/scratch/kshourd/10Aug2020/new_precursor1km/WRF/test/em_real/
/glade/derecho/scratch/kshourd/12May2022/1km_convection/WRF/test/em_real/
/glade/derecho/scratch/kshourd/12May2022/1km_precursor/WRF/test/em_real/

Thanks in advance for any help!
 

Attachments

  • rsl.out.0000
    3.1 KB · Views: 0
  • wrf_cv.o4541589.txt
    239 KB · Views: 0
  • wrf_pre.o4526758.txt
    239 KB · Views: 0
Hi,
Can you try setting max_step_increase_pct = 5, 51 in the &domains namelist record and then try again? I'm not sure this will do anything, but I'm curious. Thanks!
 
Hi @kwerner, and thanks for your reply. Unfortunately, this did not work. The precursor case (/glade/derecho/scratch/kshourd/8July2009/precursor/WRF/test/em_real) ran for about 10 min before being killed. No new wrfout files were produced, and the errors produced were the same (signal errors), although this time there were also some cfl errors where there were previously none:

rsl.error.4545:d02 2009-07-08_13:06:48 1 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:06:48 hours
rsl.error.4545:d02 2009-07-08_13:06:48 Max W: 52 1450 3 W: -2.74 w-cfl: 2.30 dETA: 0.01
rsl.error.4545:d02 2009-07-08_13:06:56 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:06:56 hours
rsl.error.4545:d02 2009-07-08_13:06:56 Max W: 52 1450 3 W: -1.40 w-cfl: 3.26 dETA: 0.01
rsl.error.4545:d02 2009-07-08_13:07:00 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:00 hours
rsl.error.4545:d02 2009-07-08_13:07:00 Max W: 52 1450 3 W: -46.00 w-cfl: 2.84 dETA: 0.01
rsl.error.4545:d02 2009-07-08_13:07:00 5 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:00 hours
rsl.error.4545:d02 2009-07-08_13:07:00 Max W: 52 1451 3 W: -110.54 w-cfl: 3.22 dETA: 0.01
rsl.error.4545:d02 2009-07-08_13:07:04 11 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.error.4545:d02 2009-07-08_13:07:04 Max W: 52 1450 3 W: -3.67 w-cfl: 6.04 dETA: 0.01
rsl.error.4545:d02 2009-07-08_13:07:04 23 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.error.4545:d02 2009-07-08_13:07:04 Max W: 52 1450 2 W: 80.48 w-cfl: 4.89 dETA: 0.01
rsl.error.4545:d02 2009-07-08_13:07:04 33 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.error.4545:d02 2009-07-08_13:07:04 Max W: 53 1450 3 W: -232.63 w-cfl: 7.42 dETA: 0.01
rsl.error.4870:d02 2009-07-08_13:07:00 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:00 hours
rsl.error.4870:d02 2009-07-08_13:07:00 Max W: 200 1561 3 W: 8.21 w-cfl: 2.16 dETA: 0.01
rsl.error.4870:d02 2009-07-08_13:07:04 4 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.error.4870:d02 2009-07-08_13:07:04 Max W: 199 1561 3 W: -68.55 w-cfl: 2.92 dETA: 0.01
rsl.error.4870:d02 2009-07-08_13:07:04 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.error.4870:d02 2009-07-08_13:07:04 Max W: 200 1561 2 W: 217.85 w-cfl: 2.56 dETA: 0.01
rsl.error.4870:d02 2009-07-08_13:07:04 7 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.error.4870:d02 2009-07-08_13:07:04 Max W: 200 1561 3 W: 67.17 w-cfl: 4.71 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:06:48 1 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:06:48 hours
rsl.out.4545:d02 2009-07-08_13:06:48 Max W: 52 1450 3 W: -2.74 w-cfl: 2.30 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:06:56 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:06:56 hours
rsl.out.4545:d02 2009-07-08_13:06:56 Max W: 52 1450 3 W: -1.40 w-cfl: 3.26 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:07:00 3 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:00 hours
rsl.out.4545:d02 2009-07-08_13:07:00 Max W: 52 1450 3 W: -46.00 w-cfl: 2.84 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:07:00 5 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:00 hours
rsl.out.4545:d02 2009-07-08_13:07:00 Max W: 52 1451 3 W: -110.54 w-cfl: 3.22 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:07:04 11 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.out.4545:d02 2009-07-08_13:07:04 Max W: 52 1450 3 W: -3.67 w-cfl: 6.04 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:07:04 23 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.out.4545:d02 2009-07-08_13:07:04 Max W: 52 1450 2 W: 80.48 w-cfl: 4.89 dETA: 0.01
rsl.out.4545:d02 2009-07-08_13:07:04 33 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.out.4545:d02 2009-07-08_13:07:04 Max W: 53 1450 3 W: -232.63 w-cfl: 7.42 dETA: 0.01
rsl.out.4870:d02 2009-07-08_13:07:00 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:00 hours
rsl.out.4870:d02 2009-07-08_13:07:00 Max W: 200 1561 3 W: 8.21 w-cfl: 2.16 dETA: 0.01
rsl.out.4870:d02 2009-07-08_13:07:04 4 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.out.4870:d02 2009-07-08_13:07:04 Max W: 199 1561 3 W: -68.55 w-cfl: 2.92 dETA: 0.01
rsl.out.4870:d02 2009-07-08_13:07:04 2 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.out.4870:d02 2009-07-08_13:07:04 Max W: 200 1561 2 W: 217.85 w-cfl: 2.56 dETA: 0.01
rsl.out.4870:d02 2009-07-08_13:07:04 7 points exceeded v_cfl = 2 in domain d02 at time 2009-07-08_13:07:04 hours
rsl.out.4870:d02 2009-07-08_13:07:04 Max W: 200 1561 3 W: 67.17 w-cfl: 4.71 dETA: 0.01
 
Thanks for trying that. I notice that your dx/dy = 3000 and you have your regular time_step variable set to 54. The recommendation for that value is no larger than 6xdx, meaning it should be set to 18 or smaller. Did you ever try a run with the smaller time_step (with or without adaptive time step turned on)?
 
I have not run with a smaller time_step, as I was under the impression that turning the adaptive timestep on made the time_step variable irrelevant. As I mentioned, these simulations each ran for about 10-13 hours, and the issue is now with the restart runs for some reason.
 
Apologies for the delay. You're right, that if you use adaptive time step, it should overwrite your regular time_step value. To determine if this is specific to restarts, can you run a test that is NOT a restart, from some time before the model stops, to a bit after? If so, we know the simulation is able to get past that point, and that would tell us that it is, indeed, the restart that is the issue.

I know you said you haven't tried to run with a lower time_step because you've always been using adaptive time-step, but I'm curious if you would be able to run this if you turned off adaptive time-step and then set the standard time_step to a value less than 6xDX.
 
Top