Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF hangs with nested domain

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

parth17

New member
Hello,

I have started using WRF past few months and have focused on two-way nesting. I have no issues with preprocessing uptil real.exe. I have wrfinput for both domains along with met files for both domains. The Coarse Grid is 24 km with parent grid ratio of 3. My parent domain is 320x360 whereas my nested domain is 352x364. Following posts from multiple forum, I have followed guidelines for nested runs by keeping a thick buffer zone of around 1/3 distance in both directions.
Using SLURM with openmpi I found that wrf runs absolutely fine with 64 processes but starts hanging on mpi cluster with higher number of processes showing no errors. Many people have resolved this issue by changing either number of processes or increasing the inner domain size. I have tried multiple configurations but WRF hangs with more than 64 processes. It would be really helpful if I can get some hint on why this happens. I have tried debug level to 9999 but no errors still show up.

rsl.error.0000:

Ntasks in X 8 , ntasks in Y 16
Setting blank km_opt entries to domain #1 values.
--> The km_opt entry in the namelist.input is now max_domains.
Setting blank diff_opt entries to domain #1 values.
--> The diff_opt entry in the namelist.input is now max_domains.
--- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
--- NOTE: grid_fdda is 0 for domain 1, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain 1, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain 1, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: grid_fdda is 0 for domain 2, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain 2, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain 2, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
Need MYNN PBL for icloud_bl = 1, resetting to 0
--- NOTE: RRTMG radiation is not used, setting: o3input=0 to avoid data pre-processing
--- NOTE: num_soil_layers has been set to 4
WRF V3.8.1 MODEL
*************************************
Parent domain
ids,ide,jds,jde 1 320 1 360
ims,ime,jms,jme 274 325 331 365
ips,ipe,jps,jpe 281 320 338 360
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 71040788 bytes allocated
med_initialdata_input: calling input_input
Max map factor in domain 1 = 1.57. Scale the dt in the model accordingly.
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND 33 CATEGORIES 2 SEASONS WATER CATEGORY = 17 SNOW CATEGORY = 15
Climatological albedo is used instead of table values
*************************************
Nesting domain
ids,ide,jds,jde 1 352 1 364
ims,ime,jms,jme 299 357 332 369
ips,ipe,jps,jpe 309 352 342 364
INTERMEDIATE domain
ids,ide,jds,jde 103 225 113 239
ims,ime,jms,jme 198 230 219 244
ips,ipe,jps,jpe 208 227 229 241
*************************************
alloc_space_field: domain 2 , 7886736 bytes allocated
alloc_space_field: domain 2 , 85629804 bytes allocated

namelist.input:

&time_control
run_days = 10,
run_hours = 0,
run_minutes = 0,
run_seconds = 0,
start_year = 2011, 2011,
start_month = 07, 07,
start_day = 01, 01,
start_hour = 12, 12,
start_minute = 00, 00,
start_second = 00, 00,
end_year = 2011, 2011,
end_month = 07, 07,
end_day = 11, 11,
end_hour = 12, 12,
end_minute = 00, 00,
end_second = 00, 00,
auxinput4_inname = "wrflowinp_d<domain>",
auxinput4_interval = 360, 360,
io_form_auxinput4 = 2,
interval_seconds = 21600,
input_from_file = .true.,.true.,
history_interval = 180, 180,
frames_per_outfile = 1, 1,
restart = .false.,
restart_interval = 720,
io_form_history = 2,
io_form_restart = 2,
io_form_input = 2,
io_form_boundary = 2,
debug_level = 0
/

&domains
time_step = 144,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 2,
use_adaptive_time_step = .true.,
step_to_output_time = .true.,
target_cfl = 1.2, 1.2,
max_step_increase_pct = 5, 51,
starting_time_step = -1, -1
max_time_step = -1, -1
min_time_step = -1, -1
adaptation_domain = 1,
e_we = 320, 352,
e_sn = 360, 364,
e_vert = 50, 50,
p_top_requested = 1000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 24000, 8000,
dy = 24000, 8000,
grid_id = 1, 2,
parent_id = 1, 1,
i_parent_start = 1, 105,
j_parent_start = 1, 115,
parent_grid_ratio = 1, 3,
parent_time_step_ratio = 1, 3,
feedback = 1,
smooth_option = 1,
/

&physics
mp_physics = 6, 6,
ra_lw_physics = 1, 1,
ra_sw_physics = 1, 1,
radt = 10, 10,
sf_sfclay_physics = 1, 1,
sf_surface_physics = 2, 2,
bl_pbl_physics = 1, 1,
bldt = 0, 0,
cu_physics = 1, 0,
cudt = 5, 0,
isfflx = 1,
ifsnow = 1,
icloud = 1,
surface_input_source = 1,
num_soil_layers = 4,
sf_urban_physics = 0, 0,
num_land_cat = 21,
/

&fdda
/

&dynamics
w_damping = 0,
diff_opt = 1,
km_opt = 4,
diff_6th_opt = 2, 2,
diff_6th_factor = 0.12, 0.12,
base_temp = 290.,
damp_opt = 1,
epssm = 0.5,
zdamp = 5000.,5000.,
dampcoef = 0.2, 0.2,
khdif = 0, 0,
kvdif = 0, 0,
non_hydrostatic = .true.,.true.,
moist_adv_opt = 1, 1,
scalar_adv_opt = 1, 1,
!time_step_sound = 4, 4,
/

&bdy_control
spec_bdy_width = 10,
spec_zone = 1,
relax_zone = 4,
spec_exp = 0.33,
specified = .true.,.false.,
nested = .false.,.true.,
/

&grib2
/

&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
/
 
Hi,
I'm not sure any of these would be causing the problem you are experiencing, but I do a couple of suggestions for you, to at least rule out some things.
1) diff_opt is something that should be set for each domain, so add a value in the 2nd column for this option.
2) km_opt is also a variable that expects a value for all domains, so again, add a value for the 2nd column.
3) Set debug_level back to 0. This is something that was put in the namelist many years ago to be used for particular testing applications and was not removed until V4.0. It really is not useful, and just makes your rsl files much larger and hard to read through.
4) Do you know whether your code was built with large file support? If not, try to recompile with that option set:
(csh example): setenv WRFIO_NCD_LARGE_FILE_SUPPORT 1
You would need to issue a 'clean -a' then set this, then reconfigure, and then recompile. This will not delete any of the files you have created already.
5) If none of the above help, can you try to run this with the most recent released version (4.0.3) to just see if there was a problem with the older code that may have been corrected since? If so, and if you need to still use the old code, we can try to track down the changes that were implemented to correct the problem and guide you through modifying your V3.8.1 files.

Thanks,
Kelly
 
Top