Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

SIGSEV

New member
Hi,
I'm doing LES for real-case simulations on big domains. I have a LES with dx=200m on a 555x835x81 grid which runs smooth. However, if I run a simulation with dx=40m on a 1550x1700x81 grid I get these error messages on my HPC system:

"Non-fatal temporary exhaustion of send tid dma descriptors (elapsed=15.649s, source LID=0xaf3/context=8, count=1) (err=0)"

which keep repeating in my rsl-files but WRF is not crashing. I'm running on 2048 CPUs with 64GB each, so memory or hardware should not be causing any trouble and as indicated above, other simualtions run smooth. Furhtermore, I've already done simualtions with the samte dx=40m domain with WRF3.9.1 and WRF 4.0.3 but after the cahgne to V4.1 these error keep appearing.

Has anybody seen these error messages before?

For reference, this is the namelist for the dx=40m LES
Code:
 &time_control
 run_days                            = 0,
 run_hours                           = 24,
 run_minutes                         = 0,
 run_seconds                         = 0,
 start_year                          = 2017,
 start_month                         = 11,
 start_day                           = 03,
 start_hour                          = 18,
 start_minute                        = 00,
 start_second                        = 00,
 end_year                            = 2017,
 end_month                           = 11,
 end_day                             = 04,
 end_hour                            = 18,
 end_minute                          = 00,
 end_second                          = 00,
 interval_seconds                    = 1800,
 input_from_file                     = .true.,
 fine_input_stream                   = 0,
 io_form_auxinput2                   = 2,
 history_interval                    = 30,
 frames_per_outfile                  = 1,
 restart                             = .false.,
 restart_interval                    = 60,
 io_form_history                     = 11,
 io_form_restart                     = 102,
 io_form_input                       = 2,
 io_form_boundary                    = 2,
 auxhist7_outname                    = "fastout_d<domain>_<date>"
 auxhist7_interval                   = 10,
 frames_per_auxhist7                 = 1,
 io_form_auxhist7                    = 11,
 auxhist8_outname                    = "meanout_d<domain>_<date>"
 auxhist8_interval                   = 30,
 frames_per_auxhist8                 = 1,
 io_form_auxhist8                    = 11,
 debug_level                         = 0,
 iofields_filename                   = "LES_IO.txt",
 ignore_iofields_warning             = .true.,
 /
&domains
 time_step                           = 0,
 time_step_fract_num                 = 1,
 time_step_fract_den                 = 4,
 max_dom                             = 1,
 e_we                                = 1551,
 e_sn                                = 1701,
 e_vert                              = 81, 81,
 p_top_requested = 4000.
 eta_levels      = 1.0000, 0.9974, 0.9949, 0.9924, 0.9898,
                   0.9871, 0.9843, 0.9814, 0.9784, 0.9752,
                   0.9718, 0.9681, 0.9642, 0.9600, 0.9555,
                   0.9507, 0.9455, 0.9399, 0.9339, 0.9274,
                   0.9206, 0.9132, 0.9053, 0.8970, 0.8881,
                   0.8786, 0.8686, 0.8580, 0.8468, 0.8350,
                   0.8226, 0.8096, 0.7960, 0.7818, 0.7670,
                   0.7515, 0.7355, 0.7189, 0.7017, 0.6840,
                   0.6658, 0.6470, 0.6278, 0.6081, 0.5880,
                   0.5676, 0.5467, 0.5256, 0.5043, 0.4827,
                   0.4610, 0.4392, 0.4173, 0.3954, 0.3736,
                   0.3519, 0.3304, 0.3091, 0.2881, 0.2674,
                   0.2472, 0.2274, 0.2081, 0.1904, 0.1811,
                   0.1717, 0.1540, 0.1373, 0.1215, 0.1066,
                   0.0927, 0.0798, 0.0677, 0.0565, 0.0461,
                   0.0366, 0.0278, 0.0198, 0.0125, 0.0059,
                   0.0,
 num_metgrid_levels                  = 138,
 num_metgrid_soil_levels             = 4,
 dx                                  = 40,
 dy                                  = 40,
 grid_id                             = 1,
 parent_id                           = 0,
 i_parent_start                      = 1,
 j_parent_start                      = 1,
 parent_grid_ratio                   = 1,
 parent_time_step_ratio              = 1,
 feedback                            = 0,
 smooth_option                       = 0,
 max_ts_locs                         = 200,
 max_ts_level                        = 50,
 interp_theta                        = .true.,
 interp_type                         = 2,
 extrap_type                         = 2,
 use_surface                         = .true.,
 smooth_cg_topo                      = .true.,
 tslist_unstagger_winds              = .true.,
/

 &physics
 mp_physics                          = 8,
 ra_lw_physics                       = 4,
 ra_sw_physics                       = 4,
 radt                                = 5,
 sf_sfclay_physics                   = 1,
 sf_surface_physics                  = 4,
 bl_pbl_physics                      = 0,
 bldt                                = 0,
 cu_physics                          = 0,
 cudt                                = 0,
 shcu_physics                        = 0,
 isfflx                              = 1,
 ifsnow                              = 1,
 icloud                              = 1,
 surface_input_source                = 3,
 num_soil_layers                     = 4,
 num_land_cat                        = 33,
 sf_urban_physics                    = 0,
 slope_rad                           = 1,
 topo_shading                        = 1,
 /

 &dynamics
 w_damping                           = 1,
 diff_opt                            = 2,
 km_opt                              = 2,
 diff_6th_opt                        = 0,
 diff_6th_factor                     = 0.12,
 base_temp                           = 290.
 damp_opt                            = 3,
 zdamp                               = 8000.,
 dampcoef                            = 0.2,
 khdif                               = 0,
 kvdif                               = 0,
 non_hydrostatic                     = .true.,
 moist_adv_opt                       = 1,
 scalar_adv_opt                      = 1,
 use_theta_m                         = 1,
 time_step_sound                     = 6,
 m_opt                               = 1,     1,
 epssm                               = 0.9,
 smdiv                               = 0.2,
 emdiv                               = 0.02,
 sfs_opt                             = 0,
 use_input_w                         = .false.,
 c_k                                 = 0.08,
 do_avgflx_em                        = 1,
 mix_isotropic                       = 1,
/

 &bdy_control
 spec_bdy_width                      = 20,
 spec_zone                           = 1,
 relax_zone                          = 19,
 specified                           = .true.,
 nested                              = .false.,
 have_bcs_moist                      = .true.,
 have_bcs_scalar                     = .true.,
 /

 &grib2
 /

 &namelist_quilt
 nio_tasks_per_group = 0,
 nio_groups = 0,
 /
 
Setting a higher debug_level indicates hat the model always gets stuck when
"d01 2017-11-03_18:00:00 calling inc/HALO_EM_SCALAR_E_5_inline.inc"
is called.
I cannot understand why other halos work but this specific one fails
 
Hi,
When you say that the model seems to get stuck on that particular halo, is the run still completing without errors?
We don't typically advise setting debug_level, as this sometimes can actually introduce problems, making the RSL files incredibly large, and we rarely get much useful information out of it. If the runs are completing, I would advise checking through some of the output to see if things look reasonable. The message you are seeing may be something that a systems administrator at your institution could help to diagnose, as it's likely specific to your particular environment.
 
hi,
the WRF-job does not abort on the HPC system, but rather stops at that particular halo. As soon as this halo is reached in the code, the model stalls but does not die, therefore CPUs are idleing but still reserved on the HPC system. So the model does not complete and I only get the first wrfoutfile for the initialization time.
I'm already in contact with admins but until now they do not have an idea about this error message, so I though I'll try my luck here :)
I'll keep you posted if we can find and solve the problem.
 
Hi,
Are you familiar with using git? If so, you could test out the commits that went into the code between V4.0.3 and V4.1 to see if you can track down the specific change that caused the new problem. In case you're interested, you can find the WRF repository here. Another option would be to try simplifying your run by using your same domain and dates, but using a namelist that is closer to the default namelist. You can then slowly add in options that you are using for this failed run to see if you can pinpoint what change causes the problem.
 
Top