Hi there,
I am running a wrf simulation with a relatively large domain (1350*1450 grids). It has run 36 hours successfully and stopped because of timeout. Then I am trying to restart it using wrfrst file. But the restart simulation immediately failed because of error as below (from rsl.error file), which shows a MPICH2 error at first place and are followed by other errors like "unrecoverable application network transaction error". I doubt the failure is due to the large domain, generating a large wrfrst file (about 19 GB) and thereby requiring too much computer resource. Did anybody have clues how to resolve this issue?
FYI, the namelist is also posted after the error message.
rsl.error message:
=======================
WRF TILE 1 IS 82 IE 108 JS 1222 JE 1244
WRF NUMBER OF TILES = 1
MPICH2 ERROR [Rank 2703] [job id 36558962] [Tue Nov 24 13:07:34 2020] [c1-5c0s14n1] [nid11769] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_SMSG had error (SOURCE_SSID:AT_MDD_INV:CPLTN_DREQ)
Rank 2703 [Tue Nov 24 13:07:34 2020] [c1-5c0s14n1] Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(207)....................: MPI_Wait(request=0x44d57854, status=0x7ffffe7d47f0) failed
MPIR_Wait_impl(81)................:
MPIDI_CH3I_Progress(537)..........:
MPID_nem_mpich_blocking_recv(1125):
MPID_nem_gni_poll(1582)...........:
MPID_nem_gni_check_localCQ(651)...: unrecoverable application network transaction error
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
wrf.exe 0000000022F80B84 for__signal_handl Unknown Unknown
libpthread-2.26.s 00002AAAAD2C62D0 Unknown Unknown Unknown
libc-2.26.so 00002AAAAD843520 gsignal Unknown Unknown
namelist:
=======================
&time_control
!run_days = ddd,
!run_hours = 0,
!run_minutes = 0,
!run_seconds = 0,
start_year = 2013,
start_month = 03,
start_day = 31,
start_hour = 12,
start_minute = 00,
start_second = 00,
end_year = 2013,
end_month = 04,
end_day = 02,
end_hour = 00,
end_minute = 00,
end_second = 00,
interval_seconds = 21600,
input_from_file = .true.,
history_interval = 180,
frames_per_outfile = 2,
restart = .true.,
restart_interval = 360,
write_hist_at_0h_rst = .true.,
io_form_history = 2,
io_form_restart = 2,
io_form_input = 2,
io_form_boundary = 2,
io_form_auxinput2 = 2,
debug_level = 0,
/
&domains
time_step = 4,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 1,
e_we = 1351,
e_sn = 1451,
p_top_requested = 10000,
e_vert = 60,
num_metgrid_levels = 38,
dx = 1800,
dy = 1800,
grid_id = 1,
parent_id = 0,
i_parent_start = 1,
j_parent_start = 1,
parent_grid_ratio = 1,
parent_time_step_ratio = 1,
feedback = 1,
smooth_option = 0,
/
&physics
mp_physics = 17,
ra_lw_physics = 4,
ra_sw_physics = 4,
radt = 1,
sf_sfclay_physics = 2,
sf_surface_physics = 2,
bl_pbl_physics = 2,
sf_urban_physics = 2,
bldt = 0,
cu_physics = 0,
cudt = 1,
cugd_avedx = 3,
isfflx = 1,
ifsnow = 0,
icloud = 1,
surface_input_source = 1,
num_soil_layers = 4,
slope_rad = 1,
topo_shading = 1,
shadlen = 25000.,
mp_zero_out = 2,
/
&fdda
grid_fdda = 2,
gfdda_inname = "wrffdda_d<domain>",
gfdda_interval = 2880,
xwavenum = 3,
ywavenum = 3,
/
&dynamics
w_damping = 1,
diff_opt = 2,
km_opt = 2,
mix_full_fields = .true.,
diff_6th_opt = 0,
diff_6th_factor = 0.12,
base_temp = 290.
damp_opt = 0,
zdamp = 1000.,
dampcoef = 0.2,
khdif = 0,
kvdif = 0,
non_hydrostatic = .true.,
moist_adv_opt = 1,
scalar_adv_opt = 1,
/
&bdy_control
spec_bdy_width = 15,
spec_zone = 1,
relax_zone = 14,
specified = .true.,
nested = .false.,
/
&grib2
/
&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
/
Thank you,
Samuel
I am running a wrf simulation with a relatively large domain (1350*1450 grids). It has run 36 hours successfully and stopped because of timeout. Then I am trying to restart it using wrfrst file. But the restart simulation immediately failed because of error as below (from rsl.error file), which shows a MPICH2 error at first place and are followed by other errors like "unrecoverable application network transaction error". I doubt the failure is due to the large domain, generating a large wrfrst file (about 19 GB) and thereby requiring too much computer resource. Did anybody have clues how to resolve this issue?
FYI, the namelist is also posted after the error message.
rsl.error message:
=======================
WRF TILE 1 IS 82 IE 108 JS 1222 JE 1244
WRF NUMBER OF TILES = 1
MPICH2 ERROR [Rank 2703] [job id 36558962] [Tue Nov 24 13:07:34 2020] [c1-5c0s14n1] [nid11769] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_SMSG had error (SOURCE_SSID:AT_MDD_INV:CPLTN_DREQ)
Rank 2703 [Tue Nov 24 13:07:34 2020] [c1-5c0s14n1] Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(207)....................: MPI_Wait(request=0x44d57854, status=0x7ffffe7d47f0) failed
MPIR_Wait_impl(81)................:
MPIDI_CH3I_Progress(537)..........:
MPID_nem_mpich_blocking_recv(1125):
MPID_nem_gni_poll(1582)...........:
MPID_nem_gni_check_localCQ(651)...: unrecoverable application network transaction error
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
wrf.exe 0000000022F80B84 for__signal_handl Unknown Unknown
libpthread-2.26.s 00002AAAAD2C62D0 Unknown Unknown Unknown
libc-2.26.so 00002AAAAD843520 gsignal Unknown Unknown
namelist:
=======================
&time_control
!run_days = ddd,
!run_hours = 0,
!run_minutes = 0,
!run_seconds = 0,
start_year = 2013,
start_month = 03,
start_day = 31,
start_hour = 12,
start_minute = 00,
start_second = 00,
end_year = 2013,
end_month = 04,
end_day = 02,
end_hour = 00,
end_minute = 00,
end_second = 00,
interval_seconds = 21600,
input_from_file = .true.,
history_interval = 180,
frames_per_outfile = 2,
restart = .true.,
restart_interval = 360,
write_hist_at_0h_rst = .true.,
io_form_history = 2,
io_form_restart = 2,
io_form_input = 2,
io_form_boundary = 2,
io_form_auxinput2 = 2,
debug_level = 0,
/
&domains
time_step = 4,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 1,
e_we = 1351,
e_sn = 1451,
p_top_requested = 10000,
e_vert = 60,
num_metgrid_levels = 38,
dx = 1800,
dy = 1800,
grid_id = 1,
parent_id = 0,
i_parent_start = 1,
j_parent_start = 1,
parent_grid_ratio = 1,
parent_time_step_ratio = 1,
feedback = 1,
smooth_option = 0,
/
&physics
mp_physics = 17,
ra_lw_physics = 4,
ra_sw_physics = 4,
radt = 1,
sf_sfclay_physics = 2,
sf_surface_physics = 2,
bl_pbl_physics = 2,
sf_urban_physics = 2,
bldt = 0,
cu_physics = 0,
cudt = 1,
cugd_avedx = 3,
isfflx = 1,
ifsnow = 0,
icloud = 1,
surface_input_source = 1,
num_soil_layers = 4,
slope_rad = 1,
topo_shading = 1,
shadlen = 25000.,
mp_zero_out = 2,
/
&fdda
grid_fdda = 2,
gfdda_inname = "wrffdda_d<domain>",
gfdda_interval = 2880,
xwavenum = 3,
ywavenum = 3,
/
&dynamics
w_damping = 1,
diff_opt = 2,
km_opt = 2,
mix_full_fields = .true.,
diff_6th_opt = 0,
diff_6th_factor = 0.12,
base_temp = 290.
damp_opt = 0,
zdamp = 1000.,
dampcoef = 0.2,
khdif = 0,
kvdif = 0,
non_hydrostatic = .true.,
moist_adv_opt = 1,
scalar_adv_opt = 1,
/
&bdy_control
spec_bdy_width = 15,
spec_zone = 1,
relax_zone = 14,
specified = .true.,
nested = .false.,
/
&grib2
/
&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
/
Thank you,
Samuel