WRF crash with ERA5 forcing for April/May, but FNL works

ywang32

New member
Hello,

I am running WRF over a nested Southeast Asia domain and encountered a persistent crash when using ERA5 forcing for April and May 2024. The same setup worked for February and March 2024, which makes the issue confusing.

Model/domain setup:
- 3 nested domains
- d01: 50 km
- d02: 10 km
- d03: 2 km
- parent_grid_ratio = 1, 5, 5
- parent_time_step_ratio = 1, 5, 5
- e_vert = 45
- p_top_requested = 5000 Pa
- Original time_step = 150 s
- Original cu_physics = 1, 1, 1

The model crashes around 2024-04-28_12:00. With the original setup, the error appears in the Kain-Fritsch cumulus scheme:

WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN
136 127 NaN 5000.00000

Backtrace includes:
module_cu_kfeta_MOD_kf_eta_para
module_cu_kfeta_MOD_kf_eta_cps
module_cumulus_driver
solve_em
module_integrate

The MPI error is a segmentation fault, for example:

mpiexec noticed that process rank XX exited on signal 11 (Segmentation fault)

I also tried:
1. Reducing time_step from 150 to 120
- Still crashed with similar KF_ETA_PARA / NaN issue.

2. Turning off cumulus on d03:
cu_physics = 1, 1, 0
- Still crashed, again in KF, but at a different I/J point.

3. Turning off cumulus on all domains:
cu_physics = 0, 0, 0
- The crash moved from KF to the RRTM longwave radiation scheme.

The new backtrace includes:
module_ra_rrtm_MOD_taugas
module_ra_rrtm_MOD_gasabs
module_ra_rrtm_MOD_rrtm
module_ra_rrtm_MOD_rrtmlwrad
module_radiation_driver
solve_em

This suggests the problem may not be only the cumulus scheme, but possibly an unstable or bad model state around 2024-04-28_12:00.

I checked the ERA5 met_em files around the crash time:
- met_em.d01/d02/d03.2024-04-28_12:00:00.nc exist
- Times/header look normal
- cdo infon did not show obvious NaN values
- The original crash grid point on d03 is around lat ~17.4 N, lon ~101.0 E, with terrain height roughly 700–1100 m, so it is a mountainous area.

The most interesting test is:
- ERA5 + my WRF GNU compiler: crashes for April/May
- ERA5 + another GNU WRF compiler: crashes for April/May
- FNL + the other GNU WRF compiler: running successfully

ERA5-driven WRF runs for February and March worked, but April and May crash. FNL works for the same general period.

So I am wondering whether this points to:
1. A problem in my ERA5/WPS/real.exe processing for April/May;
2. A bad or inconsistent ERA5 field around 2024-04-28_12:00;
3. ERA5 producing sharper meteorological gradients than FNL, causing WRF instability over mountainous Southeast Asia;
4. Some known issue with ERA5 surface/soil/SST/SKINTEMP/PSFC/PMSL/RH processing in WPS.

My planned next checks are:
- Compare ERA5 and FNL met_em files at 2024-04-28_12:00.
- Compare April ERA5 met_em/wrfinput ranges against successful March ERA5 files.

Has anyone seen ERA5-driven WRF crash only in certain months while FNL works? Are there specific ERA5 fields or WPS processing steps I should check first?

Any suggestions would be greatly appreciated.

Thanks,
Yuanlin
 

Attachments

Hi Yuanlin,

Where did you download ERA5? What Vtable did you use to ungrib ERA5?

Thanks.
Hello Ming,

I downlaoded the ERA5 data from the official website -- Climate Data Store via the API request.
The Vtable I used is Vtable.ECMWF.
It worked for Feburary and March simulations, but failed for April and MAy even though with WRF packages.

Many thanks,
Yuanlin
 
Hi Yuanlin,

Your process of ERA5 looks fine. Can you first modify some namelist options and try again? Hope this can fix your problem. Please try:

(1) time_step = 300
(2) cu_physics = 1, 1, 0
(3) ra_lw_physics = 4, 4, 4
(4) ra_sw_physics = 4, 4, 4
(5) sf_sfclay_physics = 1, 1, 1
(6) gwd_opt = 1, 1, 0
(7) grid_fdda = 0, 0, 0 (this will turn off grid nudging)
(8) radt = 30, 30, 30
(9) feedback =0

Note that the nesting of 50-10-2km may cause large spatial variations and eventually leads to model instability.

Please try and let me know whether your case can run successfully.
 
Hello Ming,

I have tested the model accoding to your suggestion.
The WRF model crashed at the second hour due to CFL issues, I reduced the time_step from 300 to 150 then resubmitted and it was running. However, the WRF model crashed at 13th hour as my previous many runs. Attached are the log files for these first test (cfl issue) and second test.
Look forward to hearing from you.

Many thanks
Yuanlin
 

Attachments

  • Screenshot 2026-05-21 135434.png
    Screenshot 2026-05-21 135434.png
    88.5 KB · Views: 0
  • Screenshot 2026-05-21 135542.png
    Screenshot 2026-05-21 135542.png
    100 KB · Views: 0
Hi Yuanlin,
I am sorry to know that your case failed again. I guess this is because your domain covers part of the Tibetan Plateau, where the large topography can easily lead to numerical instability. Now let's try one more option: please set epssm = 0.9. 0.9. 0.9, then rerun your case. Hope this option can suppress the instability. Let me know whether it works. Thanks.
 
Helpful discussion. The link between ERA5 forcing, steep terrain, and numerical stability is a useful point to check before treating it as only a data problem.
 
Hi Yuanlin,
I am sorry to know that your case failed again. I guess this is because your domain covers part of the Tibetan Plateau, where the large topography can easily lead to numerical instability. Now let's try one more option: please set epssm = 0.9. 0.9. 0.9, then rerun your case. Hope this option can suppress the instability. Let me know whether it works. Thanks.
Hello Ming,

Thanks for your reply. I set 'epssm = 0.9, 0.9, 0.9' in my namelist, however, it crashed again. Same reason as the last time. Attached is the log file and my updated namelist.
 

Attachments

  • Screenshot 2026-05-22 101024.png
    Screenshot 2026-05-22 101024.png
    127.8 KB · Views: 0
  • namelist.input
    namelist.input
    4.9 KB · Views: 0
  • Screenshot 2026-05-22 102132.png
    Screenshot 2026-05-22 102132.png
    39.7 KB · Views: 0
Now that all the options didn't work, I guess you have to recomplle WRF in debug mode (i.e., ./clean -a, ./configure -D), then rerun this case. The log file will tell you exactly when and where the model crashes first. This information can help you debug what is wrong.

The same case works when driven by FNL, suggesting that the forcing data of ERA5 could be an issue. ERA5 data is quarter-degree resolution, while your outermost domain resolution is 50km, which is coarser than the resolution of ERA5. I am thinking it may not be necessary to use ERA5 as input for this case and FNL is a reasonable option.
 
Back
Top