Observation Nudging Run Crash with No Error

Soroush · Dec 18, 2018

Hello Everyone,

I'm running a WRF (V3.8) simulation with a single domain and 12 km horizontal resolution and 60 s time steps. I have followed WRF instruction (http://www2.mmm.ucar.edu/wrf/users/wrfv3.1/How_to_run_obs_fdda.html) to convert Little_R format data to WRF input format and then merged all the files. I also added the required namelist options:

&time_control
auxinput11_interval = 1, 1, 1
auxinput11_end_h = 960, 960, 960
/

&fdda
obs_nudge_opt = 1,1,1
max_obs = 150000,
fdda_start = 0., 0., 0.
fdda_end = 57600., 57600., 57600.
obs_nudge_wind = 1,1,1
obs_coef_wind = 6.E-4,6.E-4,6.E-4
obs_nudge_temp = 0,0,0
obs_coef_temp = 6.E-4,6.E-4,6.E-4
obs_nudge_mois = 0,0,0
obs_coef_mois = 6.E-4,6.E-4,6.E-4
obs_rinxy = 50.,50.,50.
obs_rinsig = 0.1,
obs_twindo = 0.6666667,0.6666667,0.6666667
obs_npfi = 10,
obs_ionf = 1, 1, 1,
obs_idynin = 0,
obs_dtramp = 60.,
obs_prt_freq = 10, 10, 10
obs_prt_max = 10
obs_ipf_errob = .true.
obs_ipf_nudob = .true.
obs_ipf_in4dob = .true.
obs_ipf_init = .true.
/

The problem is that near the beginning of simulation (usually 2hr 30min), the model crashes without any error, even with debug_level = 1000 (there is no problem when nudging is turned off). I tried increase/decrease max_obs, obs_rinxy, obs_twindo, and obs_ionf, but none of them resolved the problem. However, decreasing time window (obs_twindo) to 0.3 resulted in a few more minutes of simulation (around 3hr).
Little-R observations are obtained from NCEP Global Weather Data.
Does anyone have any idea what might be wrong with my case?
Thank you in advance.

Best,
Soroush

Ming Chen · Dec 18, 2018

Please take a look at all your RSL files and find the possible error messages, which may not be in rsl.error.0000.
One option is to rebuild WRF with ./configure -D, then rerun this case. With the -D option, the model will tell exactly at which line in which code the case crashed. Such kind of information is helpful for us to figure out what is wrong.

Soroush · Dec 19, 2018

Thank you for the reply Ming. I contact the cluster staff to reconfigure the model with -D option to find out what is wrong.

kwthomas · Jan 7, 2019

Soroush...

WRF crashing without leaving an error message normally means one thing, OOM as in Out Of Memory.

Your WRF run requested so much memory that the computer ran out. For computers, this is an emergency
state. Running out of memory can lead to a kernel hang (things stop working) or a kernal panic (computer
crashes and reboots). As a defense, computers have a program that looks for this situation. When it is
encountered, SIGKILL, aka "kill -9" is sent to the offending processes. Any script that collects the exit status
should see a return value of "9".

The computer that triggered the SIGKILL *will* log the event. "dmesg" or "dmesg -T" on newer systems, will
show system log message including any killed off processes.

If you are running MPI, you'll need to add more nodes to your job. If not running MPI, you'll have to reduce
the domain size.

Observation Nudging Run Crash with No Error

Soroush

New member

Ming Chen

Moderator

Soroush

New member

kwthomas

New member