Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF v4 reanalysis runs dying

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

RCarpenter

Member
I am using WRF v4 to run yearly reanalyses on a nested grid, and am having difficulties that I hope you can help me diagnose. I’m attaching the namelist.wps and namelist.input files as well as a plot of the domain. The grid is 27/9/3 km, with the outer grid covering W CONUS and the adjacent Pacific and the inner grid covering most of California. The problem is that the runs are dying without giving any error messages. The rsl files stop growing, yet the job continues to run without producing further output. Unfortunately, this makes it difficult to diagnose the underlying problem. I have tried many different options, but perhaps not the right set. Let me start with some observations:

  • I am using CFSv2 reanalysis data. The runs are initialized on July 29 of various years. The runs tend to crash at various times, but generally run for at least several days before the first crash.
  • The runs die without producing any error output. (Actually they continue to run without producing any further output.) I have searched the logs for statements about CFL errors but have not found anything. When running with debugging on, in one instance one processor’s last call was to SFCLAY.
  • The problems do not seem to be related to the computing system. I have tried varying the number of processors, quilting, etc. The runs often fail in the middle of an hour, indicating it is not an I/O issue. System logs do not indicate any problems, and a host of other unrelated WRF runs do not have any issues.
  • In most cases near the time of the crash there is a strong dynamical feature that I suspect is causing numerical instability. Near the outer grid boundary, there have been small-scale convection resulting in vertical velocities of ±6 m/s; a hurricane; and upper-level jet gradients. On the 3-km grid, strong vertical motion (±6 m/s) is noted in steep terrain in the lee of the Sierra Nevada and other S Calif mountain ranges.
  • After a crash, the runs are restarted. Sometimes they are restarted without changing any dynamics settings, and sometimes (but not always), the runs proceed past the point where they initially failed. Other times, I have changed the settings (usually to reduce the timestep), and the runs will proceed for a while. But they always fail at some point.
  • I am using adaptive timestepping. I have adjusting some of the settings, including setting target_cfl and target_hcfl to small values (0.3 and 0.2). I’ve also tried adjusting max_step_increase_pct (25, 51, 51) and setting adaptation_domain to one of the nests. None of these solved the underlying problem, although they did reduce the timestep to 81/27/9 at times. A test with a fixed timestep of 120 also failed.
  • These failures have been noted with the YSU scheme (bl_pbl_physics=1, sf_sfclay_physics=1, sf_surface_physics=2). When the MYJ scheme is used (2, 2, 2), the runs do not crash – they continue for a full 365 days.
  • Similarly, real-time runs using the YSU scheme do not crash. These runs have the same grid and same physics/dynamics options, but are initialized with 0.25° GFS.

Some general questions:
  • Why would a run with MYJ succeed while YSU runs fail?
  • Why would real-time YSU runs succeed while reanalysis runs fail?
  • Are these failures consistent with numerical instability, even though no CFL violations are reported?

And now some questions related to various options:

  1. In the User’s Guide there is a recommendation for boundary relaxation for regional climate runs (wider relaxation zone, exponential decay factor). Are these settings also recommended for long runs involving nested grids? It seems that what may be optimal for the outermost grid may not be optimal for the nests.
  2. If I understand correctly, setting diff_6th_slopeopt=1 turns OFF 6th-order diffusion in steep terrain. What would you recommend for this option (because of the very steep terrain that exists on the 3-km grid)?
  3. I have observed that a run that has been restarted following a crash will sometimes continue past the point of the crash, even if there have been no changes to the namelist, other than those specifically related to restarting. Could it be that restarts with adaptive timestepping aren’t bit-reproduceable?
  4. Can you explain the interplay between target_[h]cfl and adaptation_domain? What happens if the target CFL is exceeded on grid 3 but the adaptation_domain is grid 1?
  5. I have observed cases where the 3-km grid timestep is half that of the 9-km grid, even though parent_time_step_ratio = 1, 3, 3. In other words, the ratio is 2:1 instead of the expected 3:1. For instance, the timesteps might be 81/27/13.5. Is this to be expected?

Apologies for the long letter, and thank you for your help.
 

Attachments

  • namelist.input
    3.7 KB · Views: 90
  • namelist.wps
    981 bytes · Views: 69
  • domain map.png
    domain map.png
    221.1 KB · Views: 2,972
  • domain map 3.png
    domain map 3.png
    285.6 KB · Views: 2,972
The symptoms you are describing sound like you have NaNs in some field which eventually spread to a location that causes a fatal crash. The most likely suspect is a missing or corrupt soil temperature or moisture value. Download and compile the read_wrf_nc program from the WRF site. Run "read_wrf_nc wrfinput_d01 -m" and look for suspicious values in any of the variables. (Run it for all 3 domains if necessary). If everything looks good, then you'll have to find when the NaNs first appear in the run. It could be hours before the actual crash. Use read_wrf_nc or ncview to look for NaNs.

I would start with the 27-km run only and see if that crashes or not. It would help to know what domain is causing the problem.

Also, adaptive time stepping should only be used for real-time runs, not 'science' runs. Adaptive time step runs are not reproducible. It's also not appropriate to use them on nested runs for the reasons you mentioned.

General answers:
A) Aspects of the YSU could be more sensitive to the presence of NaNs than the MYJ.
B) There is some flaw in the CFSv2 input fields that is not present in the GFS fields. That's why the input files need to be carefully checked.
C) Yes. It could be something like WSM6 is unstable for the long time steps used on the 27-km grid. Once Nans appear in a run they can cause a crash without creating a CFL error.
 
Please look at your wrfout files saved right before the model crashed. if everything looks fine, then please restart the failed case and save wrfout at every time step and check these files to see whether you can find weird values/patterns, including NaN. The first time/first point where unreasonable values appear will be the good starting point for you to trace back what is wrong.
If the same case can run when driven by GFS but failed when driven by CFSV2, that may imply that the input data from CFSv2 might include certain errors. However, I am not 100% sure.

Please keep us updated if you find something wrong. Thanks.

Ming Chen
 
I've reran with a 90 sec fixed time step but it still fails. I am attaching the summary from read_wrf_nc at the last time step on grid 3 before the run fails. I don't see anything unusual. I've also looked at some of the fields in ncview and don't see anything unusual there either. Other tests with only 2 grids did not fail. I see that 4.0.1 is out and I'll give that a try.
 

Attachments

  • stats.wrfout_d03_2017-08-23_131020.txt
    16.1 KB · Views: 67
I've done further testing with v3.9.1 and with longer forecasts using GFS data. In both cases the runs with YSU fail. In the 3.9.1 case, a similar run MYJ succeeded. The runs initialized with GFS and without grid or spectral nudging fail after about 7 days. I've also tried runs with reduced optimization (Intel: keeping -O3 but not -xAVX or -xHOST).

So, it appears that any long-range (more than a few days) YSU run fails. This happens with GFS and CFSR/CFSv2 data. It also happens with 3.9.1 and 4.0 and 4.0.1. I still do not see anything unusual in the logged output or in the data.
 
Without any error information, it is hard for me to tell the reason why your case failed. We haven't made any changes in YSU for quite a while, which makes me confused how this scheme can lead to failure.
I would like to confirm that you run the case using WRFV3.9.1; initial and boundary conditions are derived from GFS.
Please upload your wrfinput, wrfbdy, and wrflowinp files so that I can repeat your case. I suppose the namelist.input in your 9/28 post is the one I should use. Please confirm this is correct, or upload namelist.input that I should use to repeat your case.

Ming Chen
 
I don't have a 3.9.1 + GFS case, but I do have a 4.0 + GFS case. Will that work? I am trying to upload to the Nextcloud account but I do not have permission.
 
WRFv4.0 + GFS also works. I will ask our computer manager about the permission issue.

Ming Chen
 
Hi,
We had a server change for the Nextcloud uploading service, and therefore the link changed. If you go back to the home page of the forum and refresh the page, the link should be updated and you should be able to upload your file(s) now. I apologize for the inconvenience.
 
I will look into issue and get back to you once I know for sure what is wrong. It may take some time. Thank you for your patience.

Ming Chen
 
I repeated your case with a few changes in the namelist.input. Details are given below:

(1) time_step is changed from 180 to 150 because maximum time step should be 6 X grid interval
(2) radt is changed from 6 to 27, which should be set based on the outermost domain resolution
(3) nio_groups = 4 and nio_tasks_per_group = 2 are set to nio_groups = 1 and nio_tasks_per_group = 0 This is because I personally don't like the option of I/O quilting

This case is done successfully.

Ming Chen
 
Did you run a case with the original namelist? It would be good to know if you can repeat the crashing, and what is causing it to crash.

I'll try running with your settings and see if I have success.
 
I didn't run with exactly the same namelist.input as yours. This is because I notice some settings are not that reasonable.

Ming Chen
 
If I run with the settings you suggest, the run succeeds. However other cases with longer run times fail. This is why I asked for you to rerun exactly the case I submitted, to see if you could reproduce the failure. Regarding the settings:

  • Note that the case I submitted used adaptive time stepping, so the value of time_step should be ignored. If I try the options you suggest and turn adaptive time stepping off, the run fails.
  • I have tried varying time step options, including small fixed time steps. I've tried adaptive time steps with small CFL limits. The runs always fail.
  • Why do you say that radt should follow the outermost domain resolution? The User's Guide says only that they should be the same on each domain. 27 minutes is a long time for the sun to be still! Large radt can lead to significant oscillations in the 2m temp.
  • I've tried numerous tests with quilting on and off, and they always fail. Is there some reason why quilting shouldn't work? In my case it improves the run time by 40%, which is important because these jobs will be running for several weeks.

After many tests, the only constant is that YSU runs longer than a few days always eventually fail. Runs with MYJ, for instance, never have this problem. I never see any CFL or other warning messages. This makes me think that there are NaNs in the grid somewhere. Could you suggest where to look for those NaNs?

Thank you.
 
Radiation scheme is computationally expensive. This is why I don't feel like to call it frequently. radt can be different for different domains, but for two-way nests, it is recommended to be the same for all domains. You can set it following the resolution of your innermost domain, which is completely fine.

We have run many cases with the YSU schemes both for weather forecast and for climate simulation. We are not aware of any problem in this scheme. Please keep us updated if you find anything wrong with it.

Quilting is an option not frequently used. However, it does work. What makes you think this option caused the failure?

As for identifying the NAN values, please save wrfout at every time step before the model crash. Then you can use ncview to take a quick look at these files to figure out when and where the first NaN value appear.

Ming Chen
 
Ming, we've tried adjusting every setting we can think of. The one thing we always observe is that runs involving YSU over multiple days always die. (I've changed the subject to reflect this.) We have another group running on different hardware using a different compiler version that also experiences this problem. If you could rerun the test case, using exactly the provided namelist, that would be a big help to us. Some of the settings in the namelist may not be optimal, but we believe they are reasonable. It would greatly help us if you could run the case we provided exactly as-is. Thank you.
 
Can you rerun one of your failed cases with sf_sfclay_physics = 91 instead of 1?

Especially for longer runs, if you are not using adaptive stepping, a more conservative time step should be used. If you're using adaptive stepping, I'd suggest that you should set max_step_increase_pct for domain 1 at 5 %, as recommended.
 
Can you upload a restart file? I would like to repeat your case from the restart file to save some run time.

Ming Chen
 
I'll work on creating restart files. In the meantime, I've run a few more long-range tests. Using sf_sfclay_physics = 91 succeeds. Using bl_pbl_physics = 5, sfc_sfclay_physics = 5 succeeds. However bl_pbl_physics = 5, sfc_sfclay_physics = 1 fails. The time to fail is anywhere from 5 to 30+ days. This seems to isolate the problem to sfc_sfclay_physics = 1 and long duration runs. I wonder if there is some sort of counter that fills up?

In answer to the question about time steps, I have tried many variations, including short fixed time steps (60 sec). The case I submitted has max_step_increase_pct = 5, 51, 51, and I've tried other values as well.

I am also looking into an issue where specifying bucket_mm=100 fails immediately. Setting bucket_J=1e9 succeeds. Are there any known issues with these bucket parameters?
 
Top