Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

Topics specifically related to running the model in an HPC environment
Post Reply
SIGSEV
Posts: 20
Joined: Sat Oct 20, 2018 4:16 pm

Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

Post by SIGSEV » Mon Jul 15, 2019 8:52 am

Hi,
I'm doing LES for real-case simulations on big domains. I have a LES with dx=200m on a 555x835x81 grid which runs smooth. However, if I run a simulation with dx=40m on a 1550x1700x81 grid I get these error messages on my HPC system:

"Non-fatal temporary exhaustion of send tid dma descriptors (elapsed=15.649s, source LID=0xaf3/context=8, count=1) (err=0)"

which keep repeating in my rsl-files but WRF is not crashing. I'm running on 2048 CPUs with 64GB each, so memory or hardware should not be causing any trouble and as indicated above, other simualtions run smooth. Furhtermore, I've already done simualtions with the samte dx=40m domain with WRF3.9.1 and WRF 4.0.3 but after the cahgne to V4.1 these error keep appearing.

Has anybody seen these error messages before?

For reference, this is the namelist for the dx=40m LES

Code: Select all

 &time_control
 run_days                            = 0,
 run_hours                           = 24,
 run_minutes                         = 0,
 run_seconds                         = 0,
 start_year                          = 2017,
 start_month                         = 11,
 start_day                           = 03,
 start_hour                          = 18,
 start_minute                        = 00,
 start_second                        = 00,
 end_year                            = 2017,
 end_month                           = 11,
 end_day                             = 04,
 end_hour                            = 18,
 end_minute                          = 00,
 end_second                          = 00,
 interval_seconds                    = 1800,
 input_from_file                     = .true.,
 fine_input_stream                   = 0,
 io_form_auxinput2                   = 2,
 history_interval                    = 30,
 frames_per_outfile                  = 1,
 restart                             = .false.,
 restart_interval                    = 60,
 io_form_history                     = 11,
 io_form_restart                     = 102,
 io_form_input                       = 2,
 io_form_boundary                    = 2,
 auxhist7_outname                    = "fastout_d<domain>_<date>"
 auxhist7_interval                   = 10,
 frames_per_auxhist7                 = 1,
 io_form_auxhist7                    = 11,
 auxhist8_outname                    = "meanout_d<domain>_<date>"
 auxhist8_interval                   = 30,
 frames_per_auxhist8                 = 1,
 io_form_auxhist8                    = 11,
 debug_level                         = 0,
 iofields_filename                   = "LES_IO.txt",
 ignore_iofields_warning             = .true.,
 /
&domains
 time_step                           = 0,
 time_step_fract_num                 = 1,
 time_step_fract_den                 = 4,
 max_dom                             = 1,
 e_we                                = 1551,
 e_sn                                = 1701,
 e_vert                              = 81, 81,
 p_top_requested = 4000.
 eta_levels      = 1.0000, 0.9974, 0.9949, 0.9924, 0.9898,
                   0.9871, 0.9843, 0.9814, 0.9784, 0.9752,
                   0.9718, 0.9681, 0.9642, 0.9600, 0.9555,
                   0.9507, 0.9455, 0.9399, 0.9339, 0.9274,
                   0.9206, 0.9132, 0.9053, 0.8970, 0.8881,
                   0.8786, 0.8686, 0.8580, 0.8468, 0.8350,
                   0.8226, 0.8096, 0.7960, 0.7818, 0.7670,
                   0.7515, 0.7355, 0.7189, 0.7017, 0.6840,
                   0.6658, 0.6470, 0.6278, 0.6081, 0.5880,
                   0.5676, 0.5467, 0.5256, 0.5043, 0.4827,
                   0.4610, 0.4392, 0.4173, 0.3954, 0.3736,
                   0.3519, 0.3304, 0.3091, 0.2881, 0.2674,
                   0.2472, 0.2274, 0.2081, 0.1904, 0.1811,
                   0.1717, 0.1540, 0.1373, 0.1215, 0.1066,
                   0.0927, 0.0798, 0.0677, 0.0565, 0.0461,
                   0.0366, 0.0278, 0.0198, 0.0125, 0.0059,
                   0.0,
 num_metgrid_levels                  = 138,
 num_metgrid_soil_levels             = 4,
 dx                                  = 40,
 dy                                  = 40,
 grid_id                             = 1,
 parent_id                           = 0,
 i_parent_start                      = 1,
 j_parent_start                      = 1,
 parent_grid_ratio                   = 1,
 parent_time_step_ratio              = 1,
 feedback                            = 0,
 smooth_option                       = 0,
 max_ts_locs                         = 200,
 max_ts_level                        = 50,
 interp_theta                        = .true.,
 interp_type                         = 2,
 extrap_type                         = 2,
 use_surface                         = .true.,
 smooth_cg_topo                      = .true.,
 tslist_unstagger_winds              = .true.,
/

 &physics
 mp_physics                          = 8,
 ra_lw_physics                       = 4,
 ra_sw_physics                       = 4,
 radt                                = 5,
 sf_sfclay_physics                   = 1,
 sf_surface_physics                  = 4,
 bl_pbl_physics                      = 0,
 bldt                                = 0,
 cu_physics                          = 0,
 cudt                                = 0,
 shcu_physics                        = 0,
 isfflx                              = 1,
 ifsnow                              = 1,
 icloud                              = 1,
 surface_input_source                = 3,
 num_soil_layers                     = 4,
 num_land_cat                        = 33,
 sf_urban_physics                    = 0,
 slope_rad                           = 1,
 topo_shading                        = 1,
 /

 &dynamics
 w_damping                           = 1,
 diff_opt                            = 2,
 km_opt                              = 2,
 diff_6th_opt                        = 0,
 diff_6th_factor                     = 0.12,
 base_temp                           = 290.
 damp_opt                            = 3,
 zdamp                               = 8000.,
 dampcoef                            = 0.2,
 khdif                               = 0,
 kvdif                               = 0,
 non_hydrostatic                     = .true.,
 moist_adv_opt                       = 1,
 scalar_adv_opt                      = 1,
 use_theta_m                         = 1,
 time_step_sound                     = 6,
 m_opt                               = 1,     1,
 epssm                               = 0.9,
 smdiv                               = 0.2,
 emdiv                               = 0.02,
 sfs_opt                             = 0,
 use_input_w                         = .false.,
 c_k                                 = 0.08,
 do_avgflx_em                        = 1,
 mix_isotropic                       = 1,
/

 &bdy_control
 spec_bdy_width                      = 20,
 spec_zone                           = 1,
 relax_zone                          = 19,
 specified                           = .true.,
 nested                              = .false.,
 have_bcs_moist                      = .true.,
 have_bcs_scalar                     = .true.,
 /

 &grib2
 /

 &namelist_quilt
 nio_tasks_per_group = 0,
 nio_groups = 0,
 /
 

SIGSEV
Posts: 20
Joined: Sat Oct 20, 2018 4:16 pm

Re: Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

Post by SIGSEV » Mon Jul 15, 2019 12:15 pm

Setting a higher debug_level indicates hat the model always gets stuck when
"d01 2017-11-03_18:00:00 calling inc/HALO_EM_SCALAR_E_5_inline.inc"
is called.
I cannot understand why other halos work but this specific one fails

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

Post by kwerner » Mon Jul 15, 2019 7:33 pm

Hi,
When you say that the model seems to get stuck on that particular halo, is the run still completing without errors?
We don't typically advise setting debug_level, as this sometimes can actually introduce problems, making the RSL files incredibly large, and we rarely get much useful information out of it. If the runs are completing, I would advise checking through some of the output to see if things look reasonable. The message you are seeing may be something that a systems administrator at your institution could help to diagnose, as it's likely specific to your particular environment.
NCAR/MMM

SIGSEV
Posts: 20
Joined: Sat Oct 20, 2018 4:16 pm

Re: Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

Post by SIGSEV » Tue Jul 16, 2019 8:47 am

hi,
the WRF-job does not abort on the HPC system, but rather stops at that particular halo. As soon as this halo is reached in the code, the model stalls but does not die, therefore CPUs are idleing but still reserved on the HPC system. So the model does not complete and I only get the first wrfoutfile for the initialization time.
I'm already in contact with admins but until now they do not have an idea about this error message, so I though I'll try my luck here :)
I'll keep you posted if we can find and solve the problem.

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: Runtime error: Non-fatal temporary exhaustion of send tid dma descriptors

Post by kwerner » Thu Jul 18, 2019 8:19 pm

Hi,
Are you familiar with using git? If so, you could test out the commits that went into the code between V4.0.3 and V4.1 to see if you can track down the specific change that caused the new problem. In case you're interested, you can find the WRF repository here. Another option would be to try simplifying your run by using your same domain and dates, but using a namelist that is closer to the default namelist. You can then slowly add in options that you are using for this failed run to see if you can pinpoint what change causes the problem.
NCAR/MMM

Post Reply

Return to “High-performance Computing”