WRF repeatedly crashes near the same time/location after reducing timestep; Thompson microphysics overflow

a8626086s

New member
Hi all,

I am running a 3-domain nested WRF-ARW case over Lagos using ERA5 forcing. The model repeatedly crashes around 2022-02-23 12:43-12:44 UTC.

Earlier tests around the same time/location produced CFL-related errors like:

points exceeded cfl=2 in domain d01

After reducing the timestep, the CFL errors no longer appear, but the model still crashes at nearly the same simulation time.

To investigate this, I restarted the run shortly before the crash with wrf.exe of debug version. The crash appears to occur in Thompson microphysics, around the condensation iteration involving:

fcd = qvs(k)*EXP(lvt2(k)*clap) - qv(k) + clap

I then added a local diagnostic guard before this EXP() call, so that the model stops before the original floating overflow and prints the relevant values. This diagnostic is not a native WRF message. It was only added locally to help identify the problematic grid point and variables.

The diagnostic output is:

THOMPSON EXP overflow risk ii,jj,k,n= 112 60 2 1
qv,qvs,ssatw,clap,lvt2,arg=
3.473003 1.7083695E-03 2031.934 3.188251 51.96595 165.6805


The very large qv and supersaturation suggest that the immediate failure is an overflow in the Thompson condensation iteration. The last normal history output before the crash does not show obvious NaN/Inf values.

My question is: what is the recommended way to handle this kind of failure? Since reducing the timestep removed the CFL warning but did not prevent the crash, should I continue reducing the timestep, adjust nest feedback/smoothing, modify the domain/nesting setup, or switch microphysics schemes?

The namelist.input and relevant rsl.error/rsl.out files are attached. Any suggestions for diagnosing or stabilizing this case would be appreciated.
 

Attachments

Hi, many apologies for the delay. If this is still an issue, will you run this again without the added prints/stop to the code? Afterward, please package all of the rsl* files into a single *.tar or zipped file and attach that, as well as the namelist.input file you used (just in case there have been modifications since you originally posted this)? Thanks!
 
Hi, many apologies for the delay. If this is still an issue, will you run this again without the added prints/stop to the code? Afterward, please package all of the rsl* files into a single *.tar or zipped file and attach that, as well as the namelist.input file you used (just in case there have been modifications since you originally posted this)? Thanks!
Hi,

Thanks for your reply. I have attached the requested files, including the relevant rsl.error/rsl.out files, the namelist.input file, and the wrfout_d03 file generated shortly before the crash.

Please note that these files are not from the exact experiment for which I originally reported the issue. In this diagnostic run, the source-location mapping of the SIGSEGV backtrace points to a different location. However, the Lagos cases appear to show a similar failure pattern: before the crash, some urban grid cells develop extremely high surface skin temperature and abnormal surface fluxes, and these cells are located within the MPI block where warning/error messages are raised.

For this run, the model stopped around 2022-02-28 13:42 UTC. The explicit SIGSEGV messages were found in rsl.error.0066 and rsl.error.0078.
With the help of GPT-5.5-Codex-xhigh, the following diagnosis was concluded:
Mapping the reported addresses with addr2line showed that the immediate crash occurred in the Kain-Fritsch cumulus code:

module_cu_kfeta_mp_kf_eta_para_
module_cu_kfeta_mp_kf_eta_cps_
module_cumulus_driver_mp_cumulus_driver_

However, the last available wrfout_d03 file before the crash shows a suspicious urban grid cell at:
d03 i=99, j=55
lat/lon = 6.67963, 3.56284
LU_INDEX = 13

At 2022-02-28 13:00 UTC, this grid cell had:

TSK = 344.77 K
HFX = 1563 W m-2
LH = -9955 W m-2
QFX = -0.00408 kg m-2 s-1
GRDFLX = -4646 W m-2

One hour earlier, the same point was much more reasonable:

TSK = 321.39 K
HFX = 323 W m-2
LH = 49.6 W m-2
QFX = 1.98e-5 kg m-2 s-1

Moreover, an RRTM warning was found in rsl.error.0066:
rrtm: TBOUND exceeds table limit: reset 344.769

I also tested the case with urban_physics = 0, and the model was able to run successfully. However, this is not acceptable for my experiments, since the urban canopy scheme is required.

More generally, as long as urban_physics is enabled, this type of problem seems difficult to avoid in my Lagos simulations, even when I change other physics options. In some cases, changing ra_lw_physics, ra_sw_physics, or mp_physics allows the model to pass a specific crash time that occurred with another physics combination, but a similar failure tends to appear later in the run. At the moment, I have been able to force one experiment to finish by changing some of these physics options, but I would like to know whether there is a better and more physically defensible way to handle this.

For comparison, I have also run several high-temperature experiments in China using ERA5 data and a similar model configuration, but I did not encounter this type of problem there.


Thank you very much.
 

Attachments

Back
Top