Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe fails a few minutes after starting

ejones

New member
Hi all,

I am trying to run a WRF simulation with 3 nests on Cheyenne. I have successfully gotten it to run for 12 hours when I set the time_step under &domains in namelist.input to 15s. However, when I make this longer (30s,90s,120s) it fails after a few minutes with the following message:

MPT ERROR: MPI_COMM_WORLD rank 413 [or another random number] has terminated without calling MPI_Finalize()

aborting job
MPT: Received signal 9

Checking the rsl files, I don't see an error on 0000 or 0413 (in this example) or the other corresponding ones. I have played around with trying a different number of cores to no avail. It seems only to work when time_step is 15s.

I am a little confused on the interpretation of time_step and how that fits in with the other domains. Per other namelists/looking online, I have seen that parent_time_step_ratio for 3 domains is set to 1, 3, 3, (which is the same as my parent_grid_ratio). Would this mean the outer domain is 15s, middle is 5s, and inner domain is 5/3 seconds? I would like the inner domain to be 15s, which I think would correspond to 45s for the middle and 135s for the outer? But when I try WRF with a timestep other than 15s, I get the abort error above. So perhaps I am misinterpreting and its somehow violating the CFL condition and causing the crash? But with the 15s timestep, it takes about 1 hour of computing time per hour of simulation time, which seems slow. Or maybe this is a question for CISL?

Attached is my namelist.input file. Thanks for any insight on this (I am fairly new to all this).
 

Attachments

  • namelist.input.txt
    4.2 KB · Views: 15
Update on this: I tested it again and did find the following message in one of the rsl.error files:
42 points exceeded v_cfl = 2 in domain d01 at time 1999-09-16_00:04:30 hours

So I think it is something with the CFL condition. But I was doing time_step of 90, which, if my understanding is correct, shouldn't violate CFL for 27km * 6 = 162, and 90 is well below that?
 
One final update: rsl.error.0552 shows the following:

d01 1999-09-16_00:04:30 36 points exceeded v_cfl = 2 in domain d01 at time 1999-09-16_00:04:30 hours
d01 1999-09-16_00:04:30 Max W: 6 401 27 W: -313.17 w-cfl: 2.78 dETA: 0.04
d01 1999-09-16_00:04:30 29 points exceeded v_cfl = 2 in domain d01 at time 1999-09-16_00:04:30 hours
d01 1999-09-16_00:04:30 Max W: 9 402 21 W: 76.36 w-cfl: 8.21 dETA: 0.03

I think it has something to do with vertical velocities, perhaps in mountainous regions? I did some googling and have tried adding in "eppsm = 0.2, 0.2, 0.2" and also tried 0.3 for each of these, which changes the points exceeded in above, but is still bringing about the same issue. Attached are the rsl.out.0552 and rsl.error.0552 (it keeps failing on that one each simulation attempt!) files for reference, as well as an updated namelist.input. Thanks!
 

Attachments

  • namelist.input.txt
    4.2 KB · Views: 17
  • rsl.error.0552.txt
    18.9 KB · Views: 7
  • rsl.out.0552.txt
    8.8 KB · Views: 2
Thank you for your reply. I tried turning on w_damping and it did not work. I also tried 90s, 60s, 45s, and 30s, and none of those time_step values worked either. It still fails with the error:

xx points exceeded v_cfl = 2 in domain d01 at time xxxx hours (where the xx and xxxx represent a particular number of points and times in which the cfl condition is apparently violated).

I still wonder if there is something with the vertical damping that is causing issues, since the time_step of 15s for the outer domain seems much too small for 6*27km? Even 90s is much lower than that. I may also be misinterpreting something with the namelist and how time_step works. Thanks for any other ideas and input I could try!
 
In my case, I use different parameters. But the best options to avoid the cfl errors, I think would be changing to mp_physics=3
and some brute force way would be increasing the epssm value.
 
Thanks for your reply, I tried this again but it still crashed. Looking at the error, it appears to be possibly associated with an instability near topography in Greenland (and another time over Canada). These are both right on the northern boundary of my outer domain. Do you have any recommendations for other modifications I could try with the namelist? Or should I modify the domain area (either by cutting those areas out or expanding it somehow)? Thanks again for your help.
 
@ejones, The time_step limit of 6xDX is okay, UNLESS you are running into CFL violations. If that happens, then it's necessary to reduce your time-step, sometimes drastically. Take a look at this FAQ, which discusses CFL errors and ways to fix them. It specifically discusses issues along the boundary.

Regarding your question about the time_step ratio - yes, the time_step value you give is for domain 01 and if you are using a 3:1 parent_time_step_ratio, then domain 02 is using a time_step of (d01 time_step)/(ratio).

If you're able to get it running when using smaller time steps, but it's running incredibly slow, you could also look into the adaptive time stepping option, which will allow the model to use larger and smaller time steps, as needed. I would recommend using the default values for min, max, etc.
 
Thank you very much for your feedback! I tried some of the options in the linked FAQ to no avail, unfortunately. I am going to try modifying my outer domain region since all of the failures are happening at high latitudes near steep topographical gradients. I will post an update here for what I can get to work. The adaptive time stepping option is also very interesting and I would like to look into that. Thanks again!
 
Top