Hello,
I am running WRF over a nested Southeast Asia domain and encountered a persistent crash when using ERA5 forcing for April and May 2024. The same setup worked for February and March 2024, which makes the issue confusing.
Model/domain setup:
- 3 nested domains
- d01: 50 km
- d02: 10 km
- d03: 2 km
- parent_grid_ratio = 1, 5, 5
- parent_time_step_ratio = 1, 5, 5
- e_vert = 45
- p_top_requested = 5000 Pa
- Original time_step = 150 s
- Original cu_physics = 1, 1, 1
The model crashes around 2024-04-28_12:00. With the original setup, the error appears in the Kain-Fritsch cumulus scheme:
WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN
136 127 NaN 5000.00000
Backtrace includes:
module_cu_kfeta_MOD_kf_eta_para
module_cu_kfeta_MOD_kf_eta_cps
module_cumulus_driver
solve_em
module_integrate
The MPI error is a segmentation fault, for example:
mpiexec noticed that process rank XX exited on signal 11 (Segmentation fault)
I also tried:
1. Reducing time_step from 150 to 120
- Still crashed with similar KF_ETA_PARA / NaN issue.
2. Turning off cumulus on d03:
cu_physics = 1, 1, 0
- Still crashed, again in KF, but at a different I/J point.
3. Turning off cumulus on all domains:
cu_physics = 0, 0, 0
- The crash moved from KF to the RRTM longwave radiation scheme.
The new backtrace includes:
module_ra_rrtm_MOD_taugas
module_ra_rrtm_MOD_gasabs
module_ra_rrtm_MOD_rrtm
module_ra_rrtm_MOD_rrtmlwrad
module_radiation_driver
solve_em
This suggests the problem may not be only the cumulus scheme, but possibly an unstable or bad model state around 2024-04-28_12:00.
I checked the ERA5 met_em files around the crash time:
- met_em.d01/d02/d03.2024-04-28_12:00:00.nc exist
- Times/header look normal
- cdo infon did not show obvious NaN values
- The original crash grid point on d03 is around lat ~17.4 N, lon ~101.0 E, with terrain height roughly 700–1100 m, so it is a mountainous area.
The most interesting test is:
- ERA5 + my WRF GNU compiler: crashes for April/May
- ERA5 + another GNU WRF compiler: crashes for April/May
- FNL + the other GNU WRF compiler: running successfully
ERA5-driven WRF runs for February and March worked, but April and May crash. FNL works for the same general period.
So I am wondering whether this points to:
1. A problem in my ERA5/WPS/real.exe processing for April/May;
2. A bad or inconsistent ERA5 field around 2024-04-28_12:00;
3. ERA5 producing sharper meteorological gradients than FNL, causing WRF instability over mountainous Southeast Asia;
4. Some known issue with ERA5 surface/soil/SST/SKINTEMP/PSFC/PMSL/RH processing in WPS.
My planned next checks are:
- Compare ERA5 and FNL met_em files at 2024-04-28_12:00.
- Compare April ERA5 met_em/wrfinput ranges against successful March ERA5 files.
Has anyone seen ERA5-driven WRF crash only in certain months while FNL works? Are there specific ERA5 fields or WPS processing steps I should check first?
Any suggestions would be greatly appreciated.
Thanks,
Yuanlin
I am running WRF over a nested Southeast Asia domain and encountered a persistent crash when using ERA5 forcing for April and May 2024. The same setup worked for February and March 2024, which makes the issue confusing.
Model/domain setup:
- 3 nested domains
- d01: 50 km
- d02: 10 km
- d03: 2 km
- parent_grid_ratio = 1, 5, 5
- parent_time_step_ratio = 1, 5, 5
- e_vert = 45
- p_top_requested = 5000 Pa
- Original time_step = 150 s
- Original cu_physics = 1, 1, 1
The model crashes around 2024-04-28_12:00. With the original setup, the error appears in the Kain-Fritsch cumulus scheme:
WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN
136 127 NaN 5000.00000
Backtrace includes:
module_cu_kfeta_MOD_kf_eta_para
module_cu_kfeta_MOD_kf_eta_cps
module_cumulus_driver
solve_em
module_integrate
The MPI error is a segmentation fault, for example:
mpiexec noticed that process rank XX exited on signal 11 (Segmentation fault)
I also tried:
1. Reducing time_step from 150 to 120
- Still crashed with similar KF_ETA_PARA / NaN issue.
2. Turning off cumulus on d03:
cu_physics = 1, 1, 0
- Still crashed, again in KF, but at a different I/J point.
3. Turning off cumulus on all domains:
cu_physics = 0, 0, 0
- The crash moved from KF to the RRTM longwave radiation scheme.
The new backtrace includes:
module_ra_rrtm_MOD_taugas
module_ra_rrtm_MOD_gasabs
module_ra_rrtm_MOD_rrtm
module_ra_rrtm_MOD_rrtmlwrad
module_radiation_driver
solve_em
This suggests the problem may not be only the cumulus scheme, but possibly an unstable or bad model state around 2024-04-28_12:00.
I checked the ERA5 met_em files around the crash time:
- met_em.d01/d02/d03.2024-04-28_12:00:00.nc exist
- Times/header look normal
- cdo infon did not show obvious NaN values
- The original crash grid point on d03 is around lat ~17.4 N, lon ~101.0 E, with terrain height roughly 700–1100 m, so it is a mountainous area.
The most interesting test is:
- ERA5 + my WRF GNU compiler: crashes for April/May
- ERA5 + another GNU WRF compiler: crashes for April/May
- FNL + the other GNU WRF compiler: running successfully
ERA5-driven WRF runs for February and March worked, but April and May crash. FNL works for the same general period.
So I am wondering whether this points to:
1. A problem in my ERA5/WPS/real.exe processing for April/May;
2. A bad or inconsistent ERA5 field around 2024-04-28_12:00;
3. ERA5 producing sharper meteorological gradients than FNL, causing WRF instability over mountainous Southeast Asia;
4. Some known issue with ERA5 surface/soil/SST/SKINTEMP/PSFC/PMSL/RH processing in WPS.
My planned next checks are:
- Compare ERA5 and FNL met_em files at 2024-04-28_12:00.
- Compare April ERA5 met_em/wrfinput ranges against successful March ERA5 files.
Has anyone seen ERA5-driven WRF crash only in certain months while FNL works? Are there specific ERA5 fields or WPS processing steps I should check first?
Any suggestions would be greatly appreciated.
Thanks,
Yuanlin