Silent WRF fail

smeech84 · Oct 27, 2020

Usually I am able to detect WRF runtime errors like CFL violations and other errors because the rsl error and out log files will print out a bread crumb trail of what may have gone wrong. If a problem occurs, i can grep for key words like "error" or "fatal" or "cfl" and find the problem in these logs.

It is hard though, when WRF fails silently.

I have a WRF run that is failing at *almost* the same time frame every time I restart it (within about 7 seconds of other previous failed jobs). When it fails, there are no messages in the rsl files that show me what went wrong.

I have been troubleshooting with the CISL helpdesk for the Cheyenne cluster and they don't see anything obviously wrong with the job set up. I have other WRF jobs that have identical name lists except for the time frame that have completed successfully.

Am I overlooking other key words in the rsl files that could point me to the problem? Has anyone else encountered a "silent" WRF fail? Why would WRF not output an error in the log?

WRF version 3.9.1

Thanks for your help!

kwerner · Oct 30, 2020

Hi,
As we also have access to Cheyenne, can you point me to your running directory (and any other relevant directories) so that I can take a look? Thanks!

smeech84 · Nov 4, 2020

Thanks for the reply!

Some extra information:
The parent job was launched 6 sub jobs with only differing dates spanning the month of October 2019:
/glade/scratch/smeech/wrf_workflow/tests/workflow_output/NGIC_20191001_50vert/

Here is one of the troublesome directories:
/glade/scratch/smeech/wrf_workflow/tests/workflow_output/NGIC_20191001_50vert/20191011000000_20191017000000/wrfjob

Three of the parent date ranges finished, but three of them silently failed. The sub directory I pasted above is an example of the silent fail. I tried restarting the job completely, starting from one of the two most recent WRF restart files, and also modified the launch script with the help of the CISL helpdesk (the file named 'runscript' in the working direotory).

Every time I relaunch trying new settings the simulation gets to about 2019-10-12_05:06:11 and silently quits without errors in the log.

Thanks again for looking into this. We have been pretty puzzled about it.

kwerner · Nov 6, 2020

Thanks for sending those. I took a look, and you're right - it's not evident what the cause of the stop is. However, there are a few things I noticed that should be adjusted with your namelist and setup.

1) Your domains are too small. We suggest every domain should at least be 100x100, if not larger to provide reasonable results. Take a look at this best practice WPS namelist page that gives several domain set-up suggestions. And if you're interested, there is a page for namelist.input, as well.
2) radt should be set to 10 for all domains. You shouldn't modify based on the resolution of each domain. Only the resolution of d01 should be used to set this.
3) Your time_step is too high. It should be no larger than 6xDX. I see you are using adaptive time stepping, so if you are able to maintain stability (which it seems you are because I don't see any cfl errors), then I guess it's okay, but it may be worth a quick test with a smaller time step just to make sure that doesn't make a difference.

Other than those changes, I'd suggest trying to do some debugging to figure out what is causing the problem.

I'd start with testing this with a newer version of WRF (the latest - v4.2.1)

If you've modified the code in any way, try a clean (unmodified) version of the code to make sure you didn't introduce any errors with the mods.

You could try a basic (default) namelist, only modifying the dates, domain size, etc. necessary for your run, to see if that runs. If so, then you know it's a setting in your namelist, and you can work to narrow that down.

You can read about some other debugging options here.

Ming Chen · Nov 6, 2020

I have run a few big cases (grids is ~ 900 x 800 x 41 in D02 for 9-3km nesting) in cheyenne using 16 nodes. The job quickly failed with segmentation fault. However, when I reduce the number of nodes to 4, all failed cases can be done successfully, although very slowly. This is the situation for all WRF versions from V3.9.1 to WRFV4.1.2. I don't know the reason yet.

Can you try to rerun your failed cases with reduced number of nodes? Please let us know whether any failed cases can run successfully with different number of nodes.

smeech84 · Mar 2, 2021

Interestingly enough, I finally got these runs to continue from their fail points by manually reducing the time step to 10 or lower. Adaptive time steps were turned on previously however did not stop a run from failing from presumably a CFL violation. There was no indication that the previous failed runs were due to CFL violations, (remember there was nothing printed in the logs) but I can't see how it was anything else since reducing the time step seemed to fix it.

Silent WRF fail

smeech84

New member

kwerner

Administrator

smeech84

New member

kwerner

Administrator

Ming Chen

Moderator

smeech84

New member