Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Observation Nudging Run Crash with No Error

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Soroush

New member
Hello Everyone,

I'm running a WRF (V3.8) simulation with a single domain and 12 km horizontal resolution and 60 s time steps. I have followed WRF instruction (http://www2.mmm.ucar.edu/wrf/users/wrfv3.1/How_to_run_obs_fdda.html) to convert Little_R format data to WRF input format and then merged all the files. I also added the required namelist options:
&time_control
auxinput11_interval = 1, 1, 1
auxinput11_end_h = 960, 960, 960
/

&fdda
obs_nudge_opt = 1,1,1
max_obs = 150000,
fdda_start = 0., 0., 0.
fdda_end = 57600., 57600., 57600.
obs_nudge_wind = 1,1,1
obs_coef_wind = 6.E-4,6.E-4,6.E-4
obs_nudge_temp = 0,0,0
obs_coef_temp = 6.E-4,6.E-4,6.E-4
obs_nudge_mois = 0,0,0
obs_coef_mois = 6.E-4,6.E-4,6.E-4
obs_rinxy = 50.,50.,50.
obs_rinsig = 0.1,
obs_twindo = 0.6666667,0.6666667,0.6666667
obs_npfi = 10,
obs_ionf = 1, 1, 1,
obs_idynin = 0,
obs_dtramp = 60.,
obs_prt_freq = 10, 10, 10
obs_prt_max = 10
obs_ipf_errob = .true.
obs_ipf_nudob = .true.
obs_ipf_in4dob = .true.
obs_ipf_init = .true.
/

The problem is that near the beginning of simulation (usually 2hr 30min), the model crashes without any error, even with debug_level = 1000 (there is no problem when nudging is turned off). I tried increase/decrease max_obs, obs_rinxy, obs_twindo, and obs_ionf, but none of them resolved the problem. However, decreasing time window (obs_twindo) to 0.3 resulted in a few more minutes of simulation (around 3hr).
Little-R observations are obtained from NCEP Global Weather Data.
Does anyone have any idea what might be wrong with my case?
Thank you in advance.


Best,
Soroush
 
Please take a look at all your RSL files and find the possible error messages, which may not be in rsl.error.0000.
One option is to rebuild WRF with ./configure -D, then rerun this case. With the -D option, the model will tell exactly at which line in which code the case crashed. Such kind of information is helpful for us to figure out what is wrong.
 
Thank you for the reply Ming. I contact the cluster staff to reconfigure the model with -D option to find out what is wrong.
 
Soroush...

WRF crashing without leaving an error message normally means one thing, OOM as in Out Of Memory.

Your WRF run requested so much memory that the computer ran out. For computers, this is an emergency
state. Running out of memory can lead to a kernel hang (things stop working) or a kernal panic (computer
crashes and reboots). As a defense, computers have a program that looks for this situation. When it is
encountered, SIGKILL, aka "kill -9" is sent to the offending processes. Any script that collects the exit status
should see a return value of "9".

The computer that triggered the SIGKILL *will* log the event. "dmesg" or "dmesg -T" on newer systems, will
show system log message including any killed off processes.

If you are running MPI, you'll need to add more nodes to your job. If not running MPI, you'll have to reduce
the domain size.
 
Top