Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Fatal error in PMPI_wait

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

kaelel18

Member
Hi guys. I would like to ask for your assistance. I am running WRF in 3 nested domains over an HPC. Real.exe runs fine but there seems to be problem with wrf.exe. It just stops all of a sudden. I am using 16 and 64 processors for real and wrf, respectively. Using
Code:
grep -i error rsl*
, rsl.error shows

Code:
WRF TILE   1 IS    234 IE    272 JS    125 JE    165
WRF NUMBER OF TILES =   1
Fatal error in PMPI_Wait: Unknown error class, error stack:
PMPI_Wait(203)........................: MPI_Wait(request=0x4a3b8f4, status=0x7ffd251757e0) failed
MPIR_Wait_impl(100)...................:
MPIDU_Complete_posted_with_error(1137): Process failed

wrf produces the following only and then terminates the job.
wrfout_d01_2018-08-01_00:00:00
wrfout_d02_2018-08-01_00:00:00
wrfout_d03_2018-08-01_00:00:00

Here is my namelist.input

Code:
&time_control
run_days                 = 1,
run_hours                = 0,
run_minutes              = 0,
run_seconds              = 0,
start_year               = 2018,     2018,     2018,
start_month              = 08,       08,       08,
start_day                = 01,       01,       01,
start_hour               = 00,       00,       00,
start_minute             = 00,       00,       00,
start_second             = 00,       00,       00,
end_year                 = 2018,     2018,     2018,
end_month                = 08,       08,       08,
end_day                  = 02,       02,       02,
end_hour                 = 00,       00,       00,
end_minute               = 00,       00,       00,
end_second               = 00,       00,       00,
interval_seconds         = 21600,
input_from_file          = .true.,   .true.,   .true.,
history_interval         = 180,       60,       60,
frames_per_outfile       = 1,        1,        1,
restart                  = .false.,
restart_interval         = 5000,
io_form_history          = 2,
io_form_restart          = 2,
io_form_input            = 2,
io_form_boundary         = 2,
debug_level              = 0, 
/


&domains
eta_levels               = 1.000, 0.9947, 0.9895, 0.9843, 0.979,
                           0.9739, 0.9684, 0.9626, 0.9564, 0.9498,
                           0.9426, 0.9348, 0.9262, 0.9167, 0.9062,
                           0.8946, 0.8816, 0.8671, 0.8509, 0.833,
                           0.813, 0.7909, 0.7667, 0.7402, 0.7116,
                           0.6809, 0.6483, 0.6141, 0.5785, 0.5419,
                           0.5047, 0.4672, 0.4299, 0.3931, 0.357,
                           0.322, 0.2883, 0.256, 0.2253, 0.1963,
                           0.169, 0.1435, 0.1171, 0.0952, 0.0753,
                           0.0571, 0.0407, 0.0257, 0.0122, 0.000,
          
time_step                = 45,
time_step_fract_num      = 0,
time_step_fract_den      = 1,
max_dom                  = 3,
e_we                     = 161,      221,      311,
e_sn                     = 211,      261,      331,
e_vert                   = 50,       50,       50,
p_top_requested          = 10286.1,
num_metgrid_levels       = 32,
num_metgrid_soil_levels  = 4,
dx                       = 7500,     1500,      300,
dy                       = 7500,     1500,      300,
grid_id                  = 1,        2,        3,
parent_id                = 1,        1,        2,
i_parent_start           = 1,       46,       77,
j_parent_start           = 1,      117,      105,
parent_grid_ratio        = 1,        5,        5,
parent_time_step_ratio   = 1,        5,        5,
feedback                 = 1,
smooth_option            = 0,
/

&physics
mp_physics               = 6,        6,        6,
ra_lw_physics            = 1,        1,        1,
ra_sw_physics            = 1,        1,        1,
radt                     = 30,       30,       30,
sf_sfclay_physics        = 1,        1,        1,
sf_surface_physics       = 2,        2,        2,
bl_pbl_physics           = 1,        1,        1,
bldt                     = 0,        0,        0,
cu_physics               = 1,        1,        0,
cudt                     = 5,        5,        5,
isfflx                   = 1,
ifsnow                   = 0,
icloud                   = 1,
surface_input_source     = 1,
num_land_cat             = 20,
num_soil_layers          = 4,
sf_urban_physics         = 0,        0,        0,
maxiens                  = 1,
maxens                   = 3,
maxens2                  = 3,
maxens3                  = 16,
ensdim                   = 144,
/
&fdda
/

&dynamics
w_damping                = 0,
diff_opt                 = 1,
km_opt                   = 4,
diff_6th_opt             = 0,        0,        0,
diff_6th_factor          = 0.12,     0.12,     0.12,
base_temp                = 290.,
damp_opt                 = 0,
zdamp                    = 5000.,    5000.,    5000.,
dampcoef                 = 0.2,      0.2,      0.2,
khdif                    = 0,        0,        0,
kvdif                    = 0,        0,        0,
non_hydrostatic          = .true.,   .true.,   .true.,
moist_adv_opt            = 1,        1,        1,
scalar_adv_opt           = 1,        1,        1,
/

&bdy_control
spec_bdy_width           = 5,
spec_zone                = 1,
relax_zone               = 4,
specified                = .true.,  .false.,  .false.,
nested                   = .false.,   .true.,   .true.,
/

&grib2
/

&namelist_quilt
nio_tasks_per_group      = 0,
nio_groups               = 1,
/

I tried looking for answers in the net, but there was no definite answers. Thanks for your help
 
Hi,
Can you package all of your rsl* files together into one *.TAR file and attach that? Thank you!
 
Thanks for sending those! It looks like the model is stopping due to CFL errors. If you issue
Code:
grep cfl rsl.*

you will see several prints like these:
Code:
rsl.out.0045:d03 2018-08-02_07:40:41+03/05            6 points exceeded cfl=2 in domain d03 at time 2018-08-02_07:40:41+03/05 hours
rsl.out.0045:d03 2018-08-02_07:40:41+03/05  MAX AT i,j,k:          201         213           5 vert_cfl,w,d(eta)=   2.460042       2.515425      5.0999522E-03
rsl.out.0045:d03 2018-08-02_07:40:41+03/05            5 points exceeded cfl=2 in domain d03 at time 2018-08-02_07:40:41+03/05 hours
rsl.out.0045:d03 2018-08-02_07:40:41+03/05  MAX AT i,j,k:          202         213           3 vert_cfl,w,d(eta)=   3.518402      -32.82181      5.1999688E-03
rsl.out.0045:d03 2018-08-02_07:40:41+03/05           11 points exceeded cfl=2 in domain d03 at time 2018-08-02_07:40:41+03/05 hours
rsl.out.0045:d03 2018-08-02_07:40:41+03/05  MAX AT i,j,k:          201         213           4 vert_cfl,w,d(eta)=   3.998977      -50.32920      5.3000450E-03
rsl.out.0045:d03 2018-08-02_07:40:43+01/05           11 points exceeded cfl=2 in domain d03 at time 2018-08-02_07:40:43+01/05 hours
rsl.out.0045:d03 2018-08-02_07:40:43+01/05  MAX AT i,j,k:          202         213           3 vert_cfl,w,d(eta)=   9.816260      -175.5627      5.1999688E-03

Take a look at this FAQ that describes this error, and some ways you can try to fix it.
 
Hi kwerner. I updated my namelist based on the recommendations given from your link.


Code:
&time_control
run_days                 = 15,
run_hours                = 0,
run_minutes              = 0,
run_seconds              = 0,
start_year               = 2018,     2018,     2018,
start_month              = 08,       08,       08,
start_day                = 01,       01,       01,
start_hour               = 00,       00,       00,
start_minute             = 00,       00,       00,
start_second             = 00,       00,       00,
end_year                 = 2018,     2018,     2018,
end_month                = 08,       08,       08,
end_day                  = 15,       15,       15,
end_hour                 = 00,       00,       00,
end_minute               = 00,       00,       00,
end_second               = 00,       00,       00,
interval_seconds         = 21600,
input_from_file          = .true.,   .true.,   .true.,
history_interval         = 180,       180,       60,
frames_per_outfile       = 1,        1,        1,
restart                  = .false.,
restart_interval         = 5000,
io_form_history          = 2,
io_form_restart          = 2,
io_form_input            = 2,
io_form_boundary         = 2,
debug_level              = 0,
/
&domains
eta_levels               = 1.000, 0.9947, 0.9895, 0.9843, 0.979,
                           0.9739, 0.9684, 0.9626, 0.9564, 0.9498,
                           0.9426, 0.9348, 0.9262, 0.9167, 0.9062,
                           0.8946, 0.8816, 0.8671, 0.8509, 0.833,
                           0.813, 0.7909, 0.7667, 0.7402, 0.7116,
                           0.6809, 0.6483, 0.6141, 0.5785, 0.5419,
                           0.5047, 0.4672, 0.4299, 0.3931, 0.357,
                           0.322, 0.2883, 0.256, 0.2253, 0.1963,
                           0.169, 0.1435, 0.1171, 0.0952, 0.0753,
                           0.0571, 0.0407, 0.0257, 0.0122, 0.000,

time_step                = 20,
time_step_fract_num      = 0,
time_step_fract_den      = 1,
max_dom                  = 3,
e_we                     = 161,      221,      311,
e_sn                     = 211,      261,      331,
e_vert                   = 50,       50,       50,
p_top_requested          = 10286.1,
num_metgrid_levels       = 32,
smooth_cg_topo           = .true.,
num_metgrid_soil_levels  = 4,
dx                       = 7500,     1500,      300,
dy                       = 7500,     1500,      300,
grid_id                  = 1,        2,        3,
parent_id                = 1,        1,        2,
i_parent_start           = 1,       46,       77,
j_parent_start           = 1,      117,      105,
parent_grid_ratio        = 1,        5,        5,
parent_time_step_ratio   = 1,        5,        5,
feedback                 = 1,
smooth_option            = 0,
/
&physics
mp_physics               = 6,        6,        6,
ra_lw_physics            = 1,        1,        1,
ra_sw_physics            = 1,        1,        1,
radt                     = 30,       30,       30,
sf_sfclay_physics        = 1,        1,        1,
sf_surface_physics       = 2,        2,        2,
bl_pbl_physics           = 1,        1,        1,
bldt                     = 0,        0,        0,
cu_physics               = 1,        1,        0,
cudt                     = 5,        5,        5,
isfflx                   = 1,
ifsnow                   = 0,
icloud                   = 1,
surface_input_source     = 1,
num_land_cat             = 20,
num_soil_layers          = 4,
sf_urban_physics         = 0,        0,        0,
maxiens                  = 1,
maxens                   = 3,
maxens2                  = 3,
maxens3                  = 16,
ensdim                   = 144,
/
&fdda
/

&dynamics
w_damping                = 0,
epssm                    = 0.5,
diff_opt                 = 1,
km_opt                   = 4,
diff_6th_opt             = 0,        0,        0,
diff_6th_factor          = 0.12,     0.12,     0.12,
base_temp                = 290.,
damp_opt                 = 0,
zdamp                    = 5000.,    5000.,    5000.,
dampcoef                 = 0.2,      0.2,      0.2,
khdif                    = 0,        0,        0,
kvdif                    = 0,        0,        0,
non_hydrostatic          = .true.,   .true.,   .true.,
moist_adv_opt            = 1,        1,        1,
scalar_adv_opt           = 1,        1,        1,
/

&bdy_control
spec_bdy_width           = 5,
spec_zone                = 1,
relax_zone               = 4,
specified                = .true.,  .false.,  .false.,
nested                   = .false.,   .true.,   .true.,
/

&grib2
/

&namelist_quilt
nio_tasks_per_group      = 0,
nio_groups               = 1,
/

Still, I am receiving the same error. What seems to be the reason for this?

Code:
rsl.error.0022:d03 2018-08-01_00:01:33+03/05            4 points exceeded cfl=2 in domain d03 at time 2018-08-01_00:01:33+03/05 hours
rsl.error.0022:d03 2018-08-01_00:01:33+03/05  MAX AT i,j,k:          202         213           7 vert_cfl,w,d(eta)=   2.858521       91.87828      5.8000088E-03
rsl.error.0022:d03 2018-08-01_00:01:34+02/05            8 points exceeded cfl=2 in domain d03 at time 2018-08-01_00:01:34+02/05 hours
rsl.error.0022:d03 2018-08-01_00:01:34+02/05  MAX AT i,j,k:          202         213          10 vert_cfl,w,d(eta)=   4.605796      -157.4470      7.2000027E-03
rsl.out.0022:d03 2018-08-01_00:01:33+03/05            4 points exceeded cfl=2 in domain d03 at time 2018-08-01_00:01:33+03/05 hours
rsl.out.0022:d03 2018-08-01_00:01:33+03/05  MAX AT i,j,k:          202         213           7 vert_cfl,w,d(eta)=   2.858521       91.87828      5.8000088E-03
rsl.out.0022:d03 2018-08-01_00:01:34+02/05            8 points exceeded cfl=2 in domain d03 at time 2018-08-01_00:01:34+02/05 hours
rsl.out.0022:d03 2018-08-01_00:01:34+02/05  MAX AT i,j,k:          202         213          10 vert_cfl,w,d(eta)=   4.605796      -157.4470      7.2000027E-03
 
Can you attach your namelist.wps file so that I can see your domain set-up? To attach files, when you are in the text edit box, click on the tab below with 3 horizontal lines. This will allow you to attach files. Thanks.
 
Hi,
Thank you for sending that. Your domains look okay, and it doesn't look like you have any high terrain in the domain. Is there extremely strong convection in your simulation? Do you mind attaching the met_em* files for each domain for the initial time period? If the files are too large to attach, please see the home page of this forum for instructions regarding uploading large files. Thanks!
 
Hi kwerner. When you said the initial time period, is it the first time-step of the simulation? if so, attached is the met.em file. Once again, thanks!
 

Attachments

  • file.tar.gz
    103.1 MB · Views: 50
Yes, that is what I meant, and thank you for sending those, but I actually meant to ask for the first 2 time periods. Can you send the consecutive time for each domain (so, met_em.d0*.2018-08-01_06:00:00). I apologize for the inconvenience.
 
Hi,
I think the problem is your p_top_requested setting in your namelist. You have it set as:
Code:
p_top_requested          = 10286.1,
which is pretty low in the atmosphere, and also an oddly specific value. The default for this setting is 5000. I was able to repeat your problem with your namelist as you had it. But when I set this to 5000, it ran to completion (at least for the 6 hours of input data I had). Furthermore, I was able to increase the time_step back up to 45, as it wasn't necessary to have it so low, when the p_top_requested setting was changed. This will allow your run to move faster. Try that out and let me know if it works for you.
 
Hi kwerner, just wanted to give you an update. Regarding the p_top_requested, I just copied the value from wrf domain wizard given the eta levels. I tried running it again this time at p_top = 5000, but 5 minutes into the simulation, I get cfl errors.

When I run grep error rsl.*, I get these errors,

Code:
rsl.error.0023:Fatal error in PMPI_Wait: Unknown error class, error stack:
rsl.error.0023:MPIDU_Complete_posted_with_error(1137): Process failed
rsl.error.0031:Fatal error in PMPI_Wait: Unknown error class, error stack:
rsl.error.0031:MPIDU_Complete_posted_with_error(1137): Process failed

While doing grep cfl rsl.*, I get this
Code:
rsl.error.0041:d03 2018-07-31_00:02:05+03/05            2 points exceeded cfl=2 in domain d03 at time 2018-07-31_00:02:05+03/05 hours
rsl.error.0041:d03 2018-07-31_00:02:05+03/05  MAX AT i,j,k:          198         180           6 vert_cfl,w,d(eta)=   2.172715      -39.81982      5.5000186E-03
rsl.error.0041:d03 2018-07-31_00:02:05+03/05            5 points exceeded cfl=2 in domain d03 at time 2018-07-31_00:02:05+03/05 hours
rsl.error.0041:d03 2018-07-31_00:02:05+03/05  MAX AT i,j,k:          198         180           6 vert_cfl,w,d(eta)=   2.621895      -12.91800      5.5000186E-03
rsl.error.0041:d03 2018-07-31_00:02:05+03/05            1 points exceeded cfl=2 in domain d03 at time 2018-07-31_00:02:05+03/05 hours
rsl.error.0041:d03 2018-07-31_00:02:05+03/05  MAX AT i,j,k:          198         180           6 vert_cfl,w,d(eta)=   2.039395       2.417800      5.5000186E-03
rsl.error.0041:d03 2018-07-31_00:02:07+01/05            6 points exceeded cfl=2 in domain d03 at time 2018-07-31_00:02:07+01/05 hours

I am attaching the rsl files and namelist again for your review. Thanks!
 

Attachments

  • rsl.tar
    1.2 MB · Views: 51
  • namelist.input
    4.7 KB · Views: 53
Hi,
I failed to mention that when I ran the test (that was successful), I removed the 'eta_levels' listed in the namelist - I apologize for that. As a test, can you remove that and see if you still get the CFL errors? If so, then it's likely also a problem with those levels.
 
Hi kwerner, thank you for your response. Removing the specified eta levels actually worked. Though another question comes into mind. How do I set my eta level so I wont have the same errors same as above? Thanks
 
If you run without setting your own eta_levels, you can then view the "good" levels that are provided automatically by the model by issuing this command:
Code:
ncdump -v ZNW wrfout_d01_YYYY_MM_DD_HH:mm:ss

(obviously using the actual file name, in the place of the wrfout file listed above in the command).

You can then check the levels you are using to see where you deviate the most from the set the model provides. In general, you don't want your levels to be too close together. If you have a specific need for specifying your own levels, then using the eta_levels variable can be useful, but if not, you should probably just opt to not use eta_levels, and use the levels provided by the model.
 
I am getting similar error while running wrf.exe .
But I do not have any cfl error. and my p_top is 5000.
The error:
rsl.error.0080:Fatal error in PMPI_Wait: Unknown error class, error stack:
rsl.error.0080:MPIDU_Complete_posted_with_error(1137): Process failed

I am using 64 processors. The wrfout files stop generating after the first timestep but the job does not stop immediately.
Can any one help me with this?
 
@ashishnavale,
Can you please package your rsl files together in a single *.tar file (not *.rar - we are unable to open that file type), and attach them? Please also send your namelist.input file and let me know which version of the WRF code you are using. Thanks!
 
Top