Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPICH2 Error when restart wrf simulations

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

samuel

New member
Hi there,

I am running a wrf simulation with a relatively large domain (1350*1450 grids). It has run 36 hours successfully and stopped because of timeout. Then I am trying to restart it using wrfrst file. But the restart simulation immediately failed because of error as below (from rsl.error file), which shows a MPICH2 error at first place and are followed by other errors like "unrecoverable application network transaction error". I doubt the failure is due to the large domain, generating a large wrfrst file (about 19 GB) and thereby requiring too much computer resource. Did anybody have clues how to resolve this issue?
FYI, the namelist is also posted after the error message.

rsl.error message:
=======================
WRF TILE 1 IS 82 IE 108 JS 1222 JE 1244
WRF NUMBER OF TILES = 1
MPICH2 ERROR [Rank 2703] [job id 36558962] [Tue Nov 24 13:07:34 2020] [c1-5c0s14n1] [nid11769] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_SMSG had error (SOURCE_SSID:AT_MDD_INV:CPLTN_DREQ)
Rank 2703 [Tue Nov 24 13:07:34 2020] [c1-5c0s14n1] Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(207)....................: MPI_Wait(request=0x44d57854, status=0x7ffffe7d47f0) failed
MPIR_Wait_impl(81)................:
MPIDI_CH3I_Progress(537)..........:
MPID_nem_mpich_blocking_recv(1125):
MPID_nem_gni_poll(1582)...........:
MPID_nem_gni_check_localCQ(651)...: unrecoverable application network transaction error
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
wrf.exe 0000000022F80B84 for__signal_handl Unknown Unknown
libpthread-2.26.s 00002AAAAD2C62D0 Unknown Unknown Unknown
libc-2.26.so 00002AAAAD843520 gsignal Unknown Unknown


namelist:
=======================
&time_control
!run_days = ddd,
!run_hours = 0,
!run_minutes = 0,
!run_seconds = 0,
start_year = 2013,
start_month = 03,
start_day = 31,
start_hour = 12,
start_minute = 00,
start_second = 00,
end_year = 2013,
end_month = 04,
end_day = 02,
end_hour = 00,
end_minute = 00,
end_second = 00,
interval_seconds = 21600,
input_from_file = .true.,
history_interval = 180,
frames_per_outfile = 2,
restart = .true.,
restart_interval = 360,
write_hist_at_0h_rst = .true.,
io_form_history = 2,
io_form_restart = 2,
io_form_input = 2,
io_form_boundary = 2,
io_form_auxinput2 = 2,
debug_level = 0,
/

&domains
time_step = 4,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 1,
e_we = 1351,
e_sn = 1451,
p_top_requested = 10000,
e_vert = 60,
num_metgrid_levels = 38,
dx = 1800,
dy = 1800,
grid_id = 1,
parent_id = 0,
i_parent_start = 1,
j_parent_start = 1,
parent_grid_ratio = 1,
parent_time_step_ratio = 1,
feedback = 1,
smooth_option = 0,
/

&physics
mp_physics = 17,
ra_lw_physics = 4,
ra_sw_physics = 4,
radt = 1,
sf_sfclay_physics = 2,
sf_surface_physics = 2,
bl_pbl_physics = 2,
sf_urban_physics = 2,
bldt = 0,
cu_physics = 0,
cudt = 1,
cugd_avedx = 3,
isfflx = 1,
ifsnow = 0,
icloud = 1,
surface_input_source = 1,
num_soil_layers = 4,
slope_rad = 1,
topo_shading = 1,
shadlen = 25000.,
mp_zero_out = 2,
/

&fdda
grid_fdda = 2,
gfdda_inname = "wrffdda_d<domain>",
gfdda_interval = 2880,
xwavenum = 3,
ywavenum = 3,
/

&dynamics
w_damping = 1,
diff_opt = 2,
km_opt = 2,
mix_full_fields = .true.,
diff_6th_opt = 0,
diff_6th_factor = 0.12,
base_temp = 290.
damp_opt = 0,
zdamp = 1000.,
dampcoef = 0.2,
khdif = 0,
kvdif = 0,
non_hydrostatic = .true.,
moist_adv_opt = 1,
scalar_adv_opt = 1,
/

&bdy_control
spec_bdy_width = 15,
spec_zone = 1,
relax_zone = 14,
specified = .true.,
nested = .false.,
/

&grib2
/

&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
/



Thank you,
Samuel
 
Samuel,

I am not sure whether the large wrfrst file is an issue here. Please let me know which version of WRF you are running. I will talk to our expert and get back to you once we know the answer.

By the way, can you run a small test case to confirm that the the restart function of WRF works fine?

Thanks.
 
Thanks for reply and help.

The wrf version I used is version 4.0.

In fact I had another simulation job using exactly same namelist settings but without spectral nudging. I also met the similar restart issue for this simulation except that its restart stopped after running for 30 min. So I think the restart function of this WRF version works fine, right?

Thanks,
Sam
 
BTW, I have rerun my restart simulations for several times, e.g., by reducing cluster nodes required, to exclude that this issue is due to a node or some network component behaving irregular in the supercomputer.

Sam
 
Sam,
I have forwarded this message to our software engineer. Let's see whether he has an answer to your question. Thanks for your patience.
 
Hi,

FYI, I did several tests and would like to update you:
1) I did a test run using a smaller domain (450*450 in grids) and it worked well for the restart run;
2) I re-ran the original simulation (i.e., with a large domain of 1350*1450 in grids) by setting io_form_restart=102, and it seems that the restart ran well (although currently I only tested running for more than one hour but at least it didn't fail at about 30 min).
I am not very sure this is the right way to resolve the issue since I don't understand the root reason why io_form_restart=102 is required. Also, a large amount of restart files are generated, which I don't like.

Hope this is useful for you. But I will continue to do tests.

Thank you,
Sam
 
Sam,
Thank you for the update. I do believe you are on right track. Your new result confirms that this is a data issue caused by large file size.
By setting io_form_restart=102, the restart file is split into several small files, and thus overcome the problem caused by large file size.
I am sorry that somehow I didn't come up with the option of io_form_restart=102. Hope it didn't waste you too much time on this issue.
Thanks again.
 
Top