Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPT ERROR SIGSEGV(11)

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

cjones

Member
hello,

I am trying to transition to wrf 4.3 but am getting segmentation fault errors. I am using cheyenne and created all wrf input files with WPS 4.3

WRF integrates for about 10 minutes and then crashes with errors like this:
MPT ERROR: Rank 392(g:392) received signal SIGSEGV(11).
** there are no cfl errors

the additional problem is that the same exact configuration and wrf input files works fine with version 4.2.2.
If possible to take a look at the my runs, the paths are:

** version 4.3
/glade/work/cjones/WRF/test/em_labfees
** version 4.2.2
/glade/work/cjones/WRF-422/test/em_labfees/

any suggestion or ideas are greatly appreciated.
my namelist.input files i below.

Cheers,

Charles.

---

&time_control
run_days = 0,
run_hours = 0,
run_minutes = 0,
run_seconds = 0,
start_year = 2018,2018,2018,2018,
start_month = 11, 11, 11, 11,
start_day = 06, 06, 06, 06,
start_hour = 00, 00, 00, 00,
start_minute = 00, 00, 00, 00,
start_second = 00, 00, 00, 00,
end_year = 2018,2018,2018,2018,
end_month = 11, 11,11,11,
end_day = 06, 06, 08, 08,
end_hour = 06, 06, 00, 00,
end_minute = 00, 00, 00, 00,
end_second = 00, 00, 00, 00,
interval_seconds = 3600,
input_from_file = .true., .true., .true., .true.,
history_interval = 720, 60, 60, 60,
frames_per_outfile = 999999, 999999, 500, 24,
auxinput1_inname = '/glade/scratch/cjones/labfees/prod/metgrid/met_em.d<domain>.<date>',
history_outname = '/glade/scratch/cjones/labfees/prod/camp/245-1.5km/wrfout_d<domain>_<date>',
adjust_output_times = .true.,
restart = .false.,
restart_interval = 7200,
write_hist_at_0h_rst = .true.,
auxinput4_inname = "wrflowinp_d<domain>"
auxinput4_interval = 60, 60, 180, 360,
io_form_auxhist23 = 2,
io_form_auxinput4 = 2,
all_ic_times = .false.,
io_form_history = 2,
io_form_restart = 102,
io_form_input = 2,
io_form_boundary = 2,
debug_level = 0,
/

&domains
time_step = 30,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 1,
e_we = 831, 746, 796, 259,
e_sn = 881, 766, 787, 235,
e_vert = 55, 55, 55, 55,
p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
interp_type = 2,
extrap_type = 2,
t_extrap_type = 2,
lowest_lev_from_sfc = .false.
use_levels_below_ground = .true.
use_surface = .true.
lagrange_order = 1,
force_sfc_in_vinterp = 1,
zap_close_levels = 500,
dx = 7500, 1500, 1600,
dy = 7500, 1500, 1600,
grid_id = 1, 2, 3, 4,
parent_id = 0, 1, 2, 3,
i_parent_start = 1, 340, 132, 68,
j_parent_start = 1, 325, 105, 64,
parent_grid_ratio = 1, 5, 3, 3,
parent_time_step_ratio = 1, 5, 3, 3,
feedback = 1,
smooth_option = 0,
!nproc_x = 36,
!nproc_y = 72,
/

&physics
mp_physics = 6, 6, 6, 6,
ra_lw_physics = 4, 4, 4, 4,
ra_sw_physics = 4, 4, 4, 4,
radt = 10, 10, 10, 10,
sf_sfclay_physics = 2, 2, 5, 5,
sf_surface_physics = 4, 4, 4, 4,
bl_pbl_physics = 5, 5, 5, 5,
grav_settling = 2, 2, 2, 2,
bldt = 0, 0, 0, 0,
cu_physics = 6, 0, 0, 0,
kfeta_trigger = 1,
cudt = 0, 0, 0, 0,
ishallow = 0,
shcu_physics = 0, 0, 0, 0,
isfflx = 1,
ifsnow = 0,
icloud = 1,
cu_rad_feedback = .true., .true., .true., .true.,
cu_diag = 0,
topo_wind = 0, 0, 0, 0,
sf_surface_mosaic = 0,
mosaic_cat = 3,
slope_rad = 0, 1, 0, 1,
topo_shading = 0, 1, 1, 1,
shadlen = 25000.,
surface_input_source = 1,
num_soil_layers = 4,
! num_land_cat = 20,
sst_update = 1,
usemonalb = .true.,
tmn_update = 1,
lagday = 150,
sst_skin = 1,
sf_urban_physics = 0, 0, 0,
cam_abs_freq_s = 21600,
levsiz = 59,
paerlev = 29,
bucket_mm = 100.0,
bucket_J = 1.e9,
do_radar_ref = 1,
/
&noah_mp
dveg = 4,
opt_crs = 1,
opt_btr = 1,
opt_run = 1,
opt_sfc = 1,
opt_frz = 1,
opt_inf = 1,
opt_rad = 3,
opt_alb = 2,
opt_snf = 1,
opt_tbot = 2,
opt_stc = 1,
/

&fdda
grid_fdda = 1, 0, 0, 0,
gfdda_inname = "wrffdda_d<domain>",
gfdda_end_h = 99999999, 4320, 0, 0,
gfdda_interval_m = 60, 360, 0, 0,
fgdt = 0, 0, 0, 0,
fgdtzero = 0, 0, 0, 0,
if_no_pbl_nudging_uv = 0, 1, 0, 0,
if_no_pbl_nudging_t = 0, 1, 0, 0,
if_no_pbl_nudging_q = 0, 1, 0, 0,
if_zfac_uv = 0, 0, 0, 0,
k_zfac_uv = 0, 0, 0, 0,
if_zfac_t = 0, 0, 0, 0,
k_zfac_t = 0, 0, 0, 0,
if_zfac_q = 0, 0, 0, 0,
k_zfac_q = 0, 0, 0, 0,
guv = 0.0003, 0.0003, 0,
gt = 0.0003, 0.0003, 0,
gq = 0.0003, 0.0003, 0,
if_ramping = 1,
dtramp_min = 60.0,
io_form_gfdda = 2,
/

&dynamics
w_damping = 0,
diff_opt = 2,
km_opt = 4,
hybrid_opt = 0,
etac = 0.2,
diff_6th_opt = 0, 0, 0, 0,
diff_6th_factor = 0.12, 0.12, 0.12, 0.12,
base_temp = 300.
damp_opt = 3,
zdamp = 5000., 5000., 5000., 5000.,
dampcoef = 0.2, 0.2, 0.2, 0.2,
khdif = 0, 0, 0, 0,
kvdif = 0, 0, 0, 0,
non_hydrostatic = .true., .true., .true., .true.,
moist_adv_opt = 1, 1, 1, 1,
scalar_adv_opt = 1, 1, 1, 1,
/

&bdy_control
spec_bdy_width = 5,
spec_zone = 1,
relax_zone = 4,
specified = .true., .false., .false., .false.,
nested = .false., .true., .true., .true.,
/

&grib2
/

&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
! nio_tasks_per_group = 36,
! nio_groups = 12,
/
 
Hi,
I was trying to look at your files, but it looks like you're currently running things in this directory. Let me know when you're done, or perhaps move/copy some of the necessary files (namelist.input, wrfinput* wrfbdy_d01, rsl.*) to a different directory that I can view. A couple of quick thoughts:

1) I notice the namelists for V4.3 and V4.2.2 use different dates and input data interval. It could be possible that the input for one is erroneous. Have you tried to run this exact namelist.input (from V4.3) in WRFV4.2.2?

2) I looked at the wrfout_d02* file in your history_outname path and notice it's extremely large (800+ GB). If you set frames_per_outfile to something smaller so the output file is smaller, would that help?

3) For domain of size 831x881, and 746x766, you can technically be using a lot more processors (see this FAQ regarding reasonable numbers of processors). Have you tried to run with more?

4) I would also recommend taking a look at this FAQ to see if anything is useful there - especially regarding a simulation that stops immediately.

5) Are you able to run with only a single domain?
 
hi Kelly,

sorry I was running a job but I moved the files where you can see (listed below)
to answer your questions.
1) the different dates was probably because I was running a new job there and the namelist file changed.
2) can you please tell which wrfout_d02 file you looked. for the successful run, wrfout_d02 file has 118Gb
3) the choice of number of nodes was just to quickly get in the queue. for version 4.2.2 that runs fine, I use 82 nodes (72 for computation + 12 for quilting).
4) I will take a look at the FAQ link
5) no, 4.3 crashes with the same error message.

The output from 4.3 is still in:
/glade/work/cjones/WRF/test/em_labfees

the output from 4.2.2 is in:
/glade/scratch/cjones/labfees/prod/camp/245-1.5km

thanks

Charles
 
Hi Charles,
Thanks for sharing those directories. I ran a test using your input, namelist, etc. and got the same segmentation fault. I then ran a test using the default namelist.input file, and only modifying what was absolutely necessary to run with your input. I was able to run an hour-long simulation successfully, without any errors. I would suggest grabbing the namelist.input file I used and compare it to yours. It may take a while, but you should add or subtract settings one at a time to see if you can pinpoint the variable that caused the issue. Once you figure that out, let me know. You can find the namelist in /glade/scratch/kkeene/cjones/wrfv43/test/em_real.
 
I was having the same problem, and found that the issue was using ECMWF input (operational HRES 137 model level in my case) with Noah MP and WRF V4.3. I was getting SIGSEGV errors on the very first timestep. With everything else the same, changing to WRF 4.2.1 or Noah LSM resulted in no errors and the simulation proceeded. The specific MPI processes with the SIGSEGV errors were located over some of the Aleutian Islands, where (among other areas) ECMWF input has SNOW of 10 m weq. I am wondering if some of the snow-related changes in Noah-MP this version are responsible, but I have not looked further yet. Such large snow input values did not cause immediate crashes like this in past versions.
 
To follow on to my previous message, when I changed the cap on SNOW in phys/module_sf_noahmpdrv.F from 5000 mm weq back to 2000 mm weq, the segmentation fault errors at the first timestep went away.
 
thanks for the suggestion dsteinhoff !!

Indeed, the MPT ERROR happens when the cap SNOW is 5000. changing to 2000 works. I should say that this error seems to be case and scale dependent. Another person in my group used the same combination of physics shown in my namelist in this thread and it worked without any modifications in the CAP snow. That run was for a different domain and model resolution.

Cheers,

Charles.
 
Top