Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF got stuck writing NC files,and WRF didn't report errors and task failures when io_form_history = 13,nocolons = .true.

I encountered the problem of using WRF compiled by parallelize netcdf4(io_form_history = 13,nocolons = .true.,). After hours of simulation, WRF got stuck writing NC files, and WRF didn't report errors and task failures. WRF outputs WRFOUT hour by hour as normal during the first few hours of the simulation. If io_form_history = 2 instead, the WRF simulation will run successfully.
My linux is Rocky linux 9.4,
Intel onapi 2023.1
netcdf-c 4.9.2(--enable-netcdf-4 )
hdf5 1.12.1(--enable-fortran --enable-parallel)
netcdf-f 4.6.0
no pnetcdf
file system:Lustre
WRF is V4.7.1
frames_per_outfile=1
Who has encountered this problem?
 
I don't think it should matter, but do you have striping enabled for the output directory, and no other job writing to that directory?
Also, what do the timings for output look like compared to serial netcdf4? (And how large is the domain?). Are there any messages from the system in your job output?
 
I don't think it should matter, but do you have striping enabled for the output directory, and no other job writing to that directory?
Also, what do the timings for output look like compared to serial netcdf4? (And how large is the domain?). Are there any messages from the system in your job output?
Thanks! I turned on ncd_nofill , and also tested ncd_nofill=.false. ; the results is the same.
And that directory is only for this job.
compared to serial netcdf4, the parellel netcdf4 job is faster.
Hourly wrfout is 1.8G with 716×646×51.
There is no messages from the system in my job output.
Every time WRF gets stuck writing a wrfout file.
 
I don't think it should matter, but do you have striping enabled for the output directory, and no other job writing to that directory?
Also, what do the timings for output look like compared to serial netcdf4? (And how large is the domain?). Are there any messages from the system in your job output?
here is my namelist:
history_interval = 360, 60,
frames_per_outfile = 1, 1,
restart = .false.,
restart_interval = 1440,
nocolons = .true.,
io_form_history = 13,
io_form_restart = 13,
io_form_input = 2,
io_form_boundary = 2,
nwp_diagnostics = 1,
output_diagnostics = 1,
io_form_auxhist3 = 2,
auxhist3_interval = 1440, 1440,
frames_per_auxhist3 = 1, 1,
ncd_nofill = .true.,
adjust_output_times = .true.,
debug_level = 0,
 
here is my namelist:
history_interval = 360, 60,
frames_per_outfile = 1, 1,
restart = .false.,
restart_interval = 1440,
nocolons = .true.,
io_form_history = 13,
io_form_restart = 13,
io_form_input = 2,
io_form_boundary = 2,
nwp_diagnostics = 1,
output_diagnostics = 1,
io_form_auxhist3 = 2,
auxhist3_interval = 1440, 1440,
frames_per_auxhist3 = 1, 1,
ncd_nofill = .true.,
adjust_output_times = .true.,
debug_level = 0,
try io form history and restart = 2
 
I didn't realize that the documentation hadn't been updated. I'll have to find where to do that.
If you have enough disk space, an alternative is to use pnetcdf output and then ncks to convert the files to compressed format. It would of course be better if the parallel write worked consistently. It would be worth opening a new issue on the netcdf side to see if any ideas come up there. The WRF interface is essentially the same as the netcdf4 with just some specific changes for parallel netcdf4 (hdf5 underneath). The netcdf is being an interface to hdf5, so the phdf5 option should work with your current libraries. If that works, then it would point to an issue with the netcdf4 layer. (I'd have to check if the WRF phdf5 has the compression option, however, and if it doesn't then it's not a good test.)

It looks like you have a nested domain. Can you share the 'domains' namelist?

And does WRF always get stuck at the same output time or semi-randomly?
 
I didn't realize that the documentation hadn't been updated. I'll have to find where to do that.
If you have enough disk space, an alternative is to use pnetcdf output and then ncks to convert the files to compressed format. It would of course be better if the parallel write worked consistently. It would be worth opening a new issue on the netcdf side to see if any ideas come up there. The WRF interface is essentially the same as the netcdf4 with just some specific changes for parallel netcdf4 (hdf5 underneath). The netcdf is being an interface to hdf5, so the phdf5 option should work with your current libraries. If that works, then it would point to an issue with the netcdf4 layer. (I'd have to check if the WRF phdf5 has the compression option, however, and if it doesn't then it's not a good test.)

It looks like you have a nested domain. Can you share the 'domains' namelist?

And does WRF always get stuck at the same output time or semi-randomly?
Thanks.
WRF get stuck semi-randomly.
My max_dom is 2.
Disk space is 100T.

&time_control
run_days = 0,
run_hours = 96,
run_minutes = 0,
run_seconds = 0,
start_year = 2025, 2025, 2025,
start_month = 08, 08, 08,
start_day = 05, 05, 05,
start_hour = 12, 12, 12,
start_minute = 00, 00, 00,
start_second = 00, 00, 00,
end_year = 2025, 2025, 2025,
end_month = 08, 08, 08,
end_day = 09, 09, 10,
end_hour = 00, 00, 00,
end_minute = 00, 00, 00,
end_second = 00, 00, 00,
interval_seconds = 10800,
input_from_file = .true.,.true.,.true.,
history_interval = 360, 60, 60,
frames_per_outfile = 1, 1, 1,
restart = .false.,
restart_interval = 1440,
nocolons = .true.,
io_form_history = 13,
io_form_restart = 13,
io_form_input = 2,
io_form_boundary = 2,
nwp_diagnostics = 1,
output_diagnostics = 1,
io_form_auxhist3 = 2,
auxhist3_interval = 1440, 1440, 1440,
frames_per_auxhist3 = 1, 1, 1,
ncd_nofill = .true.,
adjust_output_times = .true.,
debug_level = 0,
/

&domains
time_step = 60,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 2,
step_to_output_time = .true.,
e_we = 260, 716, 292,
e_sn = 260, 646, 262,
e_vert = 51, 51, 51,
p_top_requested = 5000,
num_metgrid_levels = 34,
num_metgrid_soil_levels = 4,
dx = 15000,3000,1000,
dy = 15000,3000,1000,
grid_id = 1, 2, 3,
parent_id = 0, 1, 2,
i_parent_start = 1, 59, 156,
j_parent_start = 1, 66, 109,
parent_grid_ratio = 1, 5, 5,
parent_time_step_ratio = 1, 5, 5,
feedback = 0,
smooth_option = 1,
smooth_cg_topo = .true.,
sfcp_to_sfcp = .false.,
/

&physics
mp_physics = 8, 8, 8,
gsfcgce_hail = 1,
aer_opt = 3,
use_aero_icbc =.true.,
progn = 1, 1, 1,
hail_opt = 1,
do_radar_ref = 1,
lightning_option = 2, 2, 2, 3,
ra_lw_physics = 4, 4, 4,
ra_sw_physics = 4, 4, 4,
ra_sw_eclipse = 1,
radt = 2, 2, 2,
sf_sfclay_physics = 5, 5, 5 ,1,
sf_surface_physics= 4, 4, 4, 3,
num_soil_layers = 4,
bl_pbl_physics = 5, 5, 5 ,1,
bl_mynn_closure = 2.6,
bldt = 0, 0, 0,
tke_budget = 0,0,0,1,
bl_mynn_tkeadvect = .true.,.true.,.true.,.false.,
bl_mynn_cloudpdf = 2,
bl_mynn_edmf = 1,1,1,
bl_mynn_edmf_mom = 1,1,1,
cu_physics = 3, 0, 0,
cugd_avedx = 1,
cudt = 0, 0, 0,
cu_rad_feedback = .true.,.false.,.false.,
cu_diag = 1, 0, 0,
isfflx = 1,
ifsnow = 1,
icloud = 3,
icloud_bl = 1,
surface_input_source = 3,
opt_thcnd = 1,
sf_surface_mosaic = 0,
usemonalb = .false.,
rdmaxalb = .false.,
rdlai2d = .false.,
mosaic_lu = 0,
mosaic_soil = 0,
mosaic_cat = 3,
num_land_cat = 21,
sf_urban_physics = 0, 0, 0,
tmn_update = 0,
maxiens =1,
maxens =3,
maxens2 =3,
maxens3 =16,
ensdim =144,
sf_ocean_physics = 1,
fractional_seaice = 0,
seaice_threshold = 271.4,
sst_update = 0,
/

&fdda
grid_fdda = 1, 0, 0,
gfdda_inname = "wrffdda_d<domain>",
gfdda_interval_m = 360, 360, 360,
gfdda_end_h = 276, 120, 360,
io_form_gfdda = 2,
fgdt = 0, 0, 0,
if_no_pbl_nudging_uv= 0, 0, 0,
if_no_pbl_nudging_t = 0, 0, 0,
if_no_pbl_nudging_q = 0, 0, 0,
if_zfac_uv = 0, 0, 0,
k_zfac_uv = 10, 10, 10,
if_zfac_t = 0, 0, 0,
k_zfac_t = 10, 10, 10,
if_zfac_q = 0, 0, 0,
k_zfac_q = 10 ,10,10,
guv = 0.0003,0.0003,0.0003,
gt = 0.0003,0.0003,0.0003,
gq = 0.0003,0.0003,0.0003,
if_ramping = 0,
dtramp_min = 60.0,
/

&dynamics
hybrid_opt = 2,
etac = 0.1,
rk_ord = 3,
zadvect_implicit = 1,
w_damping = 1,
diff_opt = 2, 2, 2,
km_opt = 4, 4, 4,
diff_6th_opt = 2, 2, 2,
diff_6th_factor = 0.12, 0.12, 0.12,
diff_6th_slopeopt = 1, 1, 1,
diff_6th_thresh = 0.05, 0.05, 0.05,
moist_mix6_off = .false.,
chem_mix6_off = .true.,
tracer_mix6_off = .true.,
scalar_mix6_off = .false.,
tke_mix6_off = .true.,
! base_temp = 290.,
damp_opt = 3,
zdamp = 5000., 5000., 5000.,
dampcoef = 0.2, 0.2, 0.2,
khdif = 0, 0, 0,
kvdif = 0, 0, 0,
smdiv = 0.1, 0.1, 0.1,
emdiv = 0.01, 0.01, 0.01,
epssm = 0.9, 0.9, 0.9, 0.1,
non_hydrostatic = .true., .true., .true.,
moist_adv_opt = 1, 1, 1,
scalar_adv_opt = 1, 1, 1,
h_mom_adv_order = 5, 5, 5,
v_mom_adv_order = 5, 5, 5,
h_sca_adv_order = 5, 5, 5,
v_sca_adv_order = 5, 5, 5,
/
 
Last edited:
I have run into an issue before that was caused by an interaction of netcdf with LSF. So it may well be a netcdf-file system issue. Although in that case I think it resulted in file corruption, not simply hanging.
 
I have run into an issue before that was caused by an interaction of netcdf with LSF. So it may well be a netcdf-file system issue. Although in that case I think it resulted in file corruption, not simply hanging.
Thank you so much!
I tested the newest netcdf-C and netcdf-f, but they did not resolve this problem.
I will test io=13 with new hdf5.
Thanks again!
 
Top