Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe fails without error message

kyle_r

New member
Hello,

I am attempting to run wrf.exe in parallel, but it keeps failing after running for 5-6 hours wall time. I was able to get real.exe to run successfully using the same specifications (mpirun -np 42). I've been unable to find anything in the rsl files that would indicate why it would fail. It appears it is running in parallel with "Ntasks in X 6 , ntasks in Y 7". I have also tried with 20, 30, and 60 cores, but got the same result. When compiling WRF, I followed the directions here: Full WRF and WPS Installation Example (GNU). I've tried compiling by following these directions, then I tried compiling with the most recent stable mpich version (mpich-4.1.2), and most recently, I tried the updated versions of all libraries but still get the same result. I'm not sure what else I could try at this point. I have also attached one of my rsl files. Any help would be appreciated. Thank you

Update
I've noticed that I get the following message in my rsl file:
**WARNING** Time in input file not equal to time on domain **WARNING**
**WARNING** Trying next time in file wrflowinp_d01 ...
Also, my wrfinput* files only have one time point (should have 48 for the entire simulation). I've checked namelist.input and start and end date/times are correct, and rsl* files show that real.exe completes successfully. I've also attached the real.exe rsl.error.0000 file (real_rsl.error.0000)
 

Attachments

  • rsl.error.0000
    972.1 KB · Views: 4
  • real_rsl.error.0000
    116.2 KB · Views: 1
Last edited:
Hi,
Can you package all of your rsl.* files from your wrf.exe run together into a single *.tar file and attach that? Will you also please attach your namelist.input file? And a couple other questions:
1) What is the wallclock limitation for your system?
2) Did you make any modifications to the wrf code, or is it pristine "out-of-the-box" code?
 
Hi,
Can you package all of your rsl.* files from your wrf.exe run together into a single *.tar file and attach that? Will you also please attach your namelist.input file? And a couple other questions:
1) What is the wallclock limitation for your system?
2) Did you make any modifications to the wrf code, or is it pristine "out-of-the-box" code?
Hi,

Sure. It was too large to post here, so I uploaded it to Nextcloud with the file name real_rslfiles.tar. The rsl* files are from running real.exe, which only gives me only 1 output time in wrfinput* although it shows real.exe successfully completed. The namelist.input file was also uploaded. There are 5 domains listed, but I am only attempting 3 domains at this time (max_dom=3).
I have a 2-day wall clock limitation on our cluster.
I haven't made any modifications to the code.

Thank you
 
Hi,
The wrfinput* files are supposed to only have a single time. They are the initial condition files and are only valid at the initial time. I will need the rsl* files from the wrf.exe run, since that is when the model is failing. Can you please share those, instead?

To be clear, you have a wall clock time limit of 48 hours? Did you get an output file from your batch submission? If so, will you attach that, as well? Thanks!
 
Correct, I have 48 hours of wall clock time to run my simulation before the system will stop it. For the simulation files I am sending you, wrf failed after 4 hr 43 min. I included a wrfout* file for my innermost domain, and it contains 4 Times written to it (should be 8 if it wouldn't have failed). The file name for the tar submitted to Nextcloud is kyle_r_rslfiles.tar. I specify my batch submission to output a "WRF.out" file, which is also included. I asked our hpc tech support, but they didn't seem to think the information in it was helpful. Thank you for your help.
 
Thanks for sending those. In your rsl files, I find the following CFL error messages:
Code:
rsl.error.0018:d01 2022-07-17_17:06:45           70  points exceeded v_cfl = 2 in domain d01 at time 2022-07-17_17:06:45 hours
rsl.error.0018:d01 2022-07-17_17:06:45 Max   W:     99     72      3 W:  114.84  w-cfl:   21.95  dETA:    0.01
rsl.out.0018:d01 2022-07-17_17:06:45           70  points exceeded v_cfl = 2 in domain d01 at time 2022-07-17_17:06:45 hours
rsl.out.0018:d01 2022-07-17_17:06:45 Max   W:     99     72      3 W:  114.84  w-cfl:   21.95  dETA:    0.01

Take a look at What is the most common reason for a segmentation fault? for information on how to overcome CFL errors.
 
Hello,

I've been working through some of the solutions provided in the segmentation fault help post, but I am still having issues with my simulations failing. I reduced the time_step to 20, and I included ulimit -s unlimited in the batch script. I am no longer getting the cfl errors when I execute the command grep cfl rsl.*, but the simulations still failed. I then found that some of the rsl* files included the following error messages
Code:
rrtm: TBOUND exceeds table limit: reset    344.767
rrtm: TBOUND exceeds table limit: reset    360.409
rrtm: TBOUND exceeds table limit: reset    360.058
rrtm: TBOUND exceeds table limit: reset    361.988
I found a couple posts discussing this message, and I've reduced my radt from 9 to 5, increased epssm from 0.1 to 0.5, and w_damping was kept as turned on. Unfortunately, the model kept failing, so I switched my radiation schemes from RRTM/Dudhia to RRTMG/RRTMG, but the simulations continue to fail. The only other error/warning message I've found is
Code:
**WARNING** Time in input file not equal to time on domain **WARNING**
 **WARNING** Trying next time in file wrflowinp_d01 ...
From what I can tell, all of the wrfinput, wrfbdy, and wrflowinp files and namelist.input have the same dates, so I'm not sure why it gives me the warning.
 
Hi Kyle,
I'm glad to hear you were at least able to get past the CFL errors. Since the new error is different, do you mind posting it as a new thread - this helps to keep the posts more readable in the future. Thanks!
 
Top