wrf.exe keeps running without any progress when increasing the number of processors

Topics specifically related to running the model in an HPC environment
Post Reply
Xinxu Zhao
Posts: 1
Joined: Wed Jan 13, 2021 2:34 pm

wrf.exe keeps running without any progress when increasing the number of processors

Post by Xinxu Zhao » Fri Jan 15, 2021 9:00 am

Hi All,
My simulation is based on WRF v3.9.1.1. Since I wanna run another long-period test (~2 years), I would like to use more processors (i.e., number of nodes) to speed up my running. The key settings in the namelist.input related to time step and domain size are:

Code: Select all

&time_control
 run_days                            = 0,
 run_hours                           = 24,
 run_minutes                         = 0,
 run_seconds                         = 0, 
 start_year                          = 2018,   2018,   2018,
 start_month                         = 08,   08,   08,
 start_day                           = 01,   01,   01,
 start_hour                          = 00,   00,   00,
 start_minute                        = 00,   00,   00,
 start_second                        = 00,   00,   00,
 end_year                            = 2018,   2018,   2018,
 end_month                           = 08,   08,   08,
 end_day                             = 02,   02,   02,
 end_hour                            = 00,   00,   00,
 end_minute                          = 00,   00,   00,
 end_second                          = 00,   00,   00,
 interval_seconds                    = 21600,
 input_from_file                     = .true.,.true.,.true.,
 history_interval                    = 180,  60,   15,
 frames_per_outfile                  = 1000, 1000, 1000,
 write_hist_at_0h_rst		     = .true.,
 restart                             = .false.,
 restart_interval                    = 360,
 override_restart_timers             = .true.,
 rst_outname                         = 'wrfrst_d<domain>_<date>',
 history_outname                     = 'wrfout_d<domain>_<date>',
 auxinput1_inname                    = 'met_em.d<domain>.<date>',   
 io_form_history                     = 2,
 io_form_restart                     = 2,
 io_form_input                       = 2,
 io_form_boundary                    = 2,
 io_form_restart                     = 2,                                                  
 debug_level                         = 1,
 auxinput15_inname        	     = 'vprm_input_d<domain>_<date>',
 io_form_auxinput15       	     = 2,
 auxinput15_interval_m    	     = 1800,      1800,     1800,
 frames_per_auxinput15    	     = 1,         1,        1,
 auxinput5_inname 		     = 'wrfchemi_d<domain>_<date>',
 io_form_auxinput5                   = 2,
 auxinput5_interval_m                = 60, 60, 60,
 frames_per_auxinput5                = 1, 1, 1,
 /

 &domains
 time_step                           = 30,
 time_step_fract_num                 = 0,
 time_step_fract_den                 = 1,
 max_dom                             = 3,
 e_we              	             = 150, 126, 196, 
 e_sn              		     = 150, 126, 196, 
 e_vert                              = 46,    46,    46,
 p_top_requested                     = 5000,
 num_metgrid_levels                  = 138,
 num_metgrid_soil_levels             = 4,
 dx                                  = 10000, 2000,  400,
 dy                                  = 10000, 2000,  400,
 grid_id                             = 1,     2,     3,
 parent_id                           = 1,     1,     2,
 i_parent_start                      = 1,     63,    44,
 j_parent_start                      = 1,     63,    44,
 parent_grid_ratio                   = 1,     5,     5,
 parent_time_step_ratio              = 1,     5,     5,
 feedback                            = 0,
 smooth_option                       = 0
 eta_levels 			     = 1.000,0.998,0.996,0.994,0.992,0.990,0.988,0.984,0.980,0.976,0.970,0.964,
					0.958,0.952,0.945,0.938,0.930,0.922,0.914,0.904,0.894,0.884,0.874,0.860,0.846,
					0.832,0.818,0.800,0.775,0.750,0.720,0.700,0.650,0.600,0.550,0.500,
					0.450,0.400,0.350,0.300,0.250,0.200,0.150,0.100,0.050,0.000,
 /

 &physics
 mp_physics               = 3,       3,       3,
 ra_lw_physics            = 4,        4,        4,
 ra_sw_physics            = 4,        4,        4,
 radt                     = 30,       30,       30,
 sf_sfclay_physics        = 2,        2,        2,
 sf_surface_physics       = 2,        2,        2,
 bl_pbl_physics           = 2,        2,        2,
 bldt                     = 0,        0,        0,
 cu_physics               = 3,        0,        0,   
 cudt                     = 5,        5,        5,
 cu_diag		  = 1,        0,        0,	      
 isfflx                   = 1,
 ifsnow                   = 0,
 icloud                   = 1,
 surface_input_source     = 1,
 num_soil_layers          = 4,
 num_land_cat             = 40,
 sf_urban_physics         = 2,        2,        2,
 maxiens                  = 1,
 maxens                   = 3,
 maxens2                  = 3,
 maxens3                  = 16,
 ensdim                   = 144,
 /

&chem
 chem_opt                            = 17,17,17,
 emiss_opt                           = 17,17,17,
 vprm_opt			     = 'VPRM_table_EUROPE','VPRM_table_EUROPE','VPRM_table_EUROPE',
 phot_opt                            = 0,0,0,
 chem_in_opt                         = 0,0,0,
 io_style_emissions                  = 2,
 kemit                               = 7,
 chemdt                              = 1.,1.,1.,
 bioemdt                             = 30,30,30,
 photdt                              = 30,30,30,
 gas_drydep_opt 		     = 0,0,0,
 aer_drydep_opt 		     = 0,0,0,
 bio_emiss_opt                       = 17,17,17,
 emiss_inpt_opt                      = 16,16,16,
 biomass_burn_opt                    = 0,0,0,
 plumerisefire_frq                   = 0,0,0,
 gas_bc_opt                          = 1,1,1,
 gas_ic_opt                          = 1,1,1,
 aer_bc_opt                          = 1,1,1,
 aer_ic_opt                          = 1,1,1,
 gaschem_onoff                       = 0,0,0,
 aerchem_onoff                       = 0,0,0,
 vertmix_onoff                       = 1,1,1,
 chem_conv_tr                        = 1,0,0,
 have_bcs_chem                       = .true,.true,.true,
 have_bcs_tracer                     = .true,.true,.true,
 aer_ra_feedback                     = 0,0,0,
 wetscav_onoff                       = 0,0,0,
 cldchem_onoff                       = 0,0,0,
 conv_tr_wetscav 		     = 0,0,0,
 /

 &fdda
 /
 &dynamics
 w_damping                           = 0,
 diff_opt                            = 2,      2,      2,
 km_opt                              = 4,      4,      4,
 diff_6th_opt                        = 2,      2,       2,
 diff_6th_factor                     = 0.12,   0.12,   0.12,
 base_temp                           = 290.
 damp_opt                            = 0,
 zdamp                               = 5000.,  5000.,  5000.,
 dampcoef                            = 0.2,    0.2,    0.2
 khdif                               = 0,      0,      0,
 kvdif                               = 0,      0,      0,
 non_hydrostatic                     = .true., .true., .true.,
 moist_adv_opt                       = 1,      1,      1,     
 scalar_adv_opt                      = 1,      1,      1,     
 /

 &bdy_control
 spec_bdy_width                      = 5,
 spec_zone                           = 1,
 relax_zone                          = 4,
 specified                           = .true., .false.,.false.,
 nested                              = .false., .true., .true.,
 /

 &grib2
 /

 &namelist_quilt
 nio_tasks_per_group = 0,
 nio_groups = 1,
 /
I always used 12 cores (28 nodes per core) for my previous running which ran smoothly and worked well. The beginning of rsl.error.0000 is:

Code: Select all

taskid: 0 hostname: i22r01c06s10
 module_io_quilt_old.F        2931 F
Quilting with   1 groups of   0 I/O tasks.
 Ntasks in X           16 , ntasks in Y           21
Then I try to run the case using over 12 cores. When using 24 nodes, the wrf.exe seems to just hang/stop (the last line below) there without errors and progress until the running is out of time. It seems that the running hanged when nesting to d02.
In the rsl.error.0000:

Code: Select all

taskid: 0 hostname: i22r01c01s07
 module_io_quilt_old.F        2931 F
Quilting with   1 groups of   0 I/O tasks.
 Ntasks in X           24 , ntasks in Y           28
--- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
--- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
--- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
--- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
--- NOTE: grid_fdda is 0 for domain      1, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      1, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain      1, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: grid_fdda is 0 for domain      2, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      2, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain      2, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: grid_fdda is 0 for domain      3, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      3, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain      3, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
Need MYNN PBL for icloud_bl = 1, resetting to 0
*************************************
No physics suite selected.
Physics options will be used directly from the namelist.
*************************************
--- NOTE: RRTMG radiation is in use, setting:  levsiz=59, alevsiz=12, no_src_types=6
--- NOTE: num_soil_layers has been set to      4
WRF V3.9.1.1 MODEL
 *************************************
 Parent domain
 ids,ide,jds,jde            1         150           1         150
 ims,ime,jms,jme           -4          14          -4          13
 ips,ipe,jps,jpe            1           7           1           6
 *************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
   alloc_space_field: domain            1 ,               58299468  bytes allocated
  wrf main: calling open_r_dataset for wrfinput
  med_initialdata_input: calling input_input
  mminlu = 'MODIFIED_IGBP_MODIS_NOAH'
Timing for processing wrfinput file (stream 0) for domain        1:    7.42752 elapsed seconds
Max map factor in domain 1 =  0.98. Scale the dt in the model accordingly.
  WRF TILE   1 IS      1 IE      7 JS      1 JE      6
  set_tiles3: NUMBER OF TILES =   1
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
 LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND          40  CATEGORIES           2  SEASONS WATER CATEGORY =           17  SNOW CATEGORY =           15
  Do not have ozone.  Must read it in.
  Master rank reads ozone.
  Broadcast ozone to other ranks.
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
 LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND          20  CATEGORIES
 INPUT SOIL TEXTURE CLASSIFICATION = STAS
 SOIL TEXTURE CLASSIFICATION = STAS FOUND          19  CATEGORIES
*********************************************************************
*             PROGRAM:WRF-Chem V3.9.1.1 MODEL
*                                                                   *
*    PLEASE REPORT ANY BUGS TO WRF-Chem HELP at                     *
*                                                                   *
*              wrfchemhelp.gsd@noaa.gov                             *
*                                                                   *
*********************************************************************
WARNING: Users interested in the GHG options should check the comments/references in header of module_ghg_fluxes
Warning: the VPRM parameters may need to be optimized depending on the season, year and region!
The parameters provided here should be used for testing purposes only!
 *************************************
 Nesting domain
 ids,ide,jds,jde            1         126           1         126
 ims,ime,jms,jme           -4          20          -4          15
 ips,ipe,jps,jpe            1           6           1           5
 INTERMEDIATE domain
 ids,ide,jds,jde           61          91          61          91
 ims,ime,jms,jme           56          73          56          73
 ips,ipe,jps,jpe           59          63          59          63
 *************************************
d01 2018-08-01_18:00:00  alloc_space_field: domain            2 ,                7659360  bytes allocated
d01 2018-08-01_18:00:00  alloc_space_field: domain            2 ,               82253220  bytes allocated
I also tried other numbers of nodes (e.g., 14, 16, 20) which also do not work. Do you have any idea why this hang happened? Any idea how to debug it?
Additionally, based on the info provided in this link: https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?t=5082, I am also trying to figure out an appropriate number of processors I should use for my running.
But it seems too fewer processors are recommended in my domain, i.e.,
For your smallest-sized domain:
((e_we)/25) * ((e_sn)/25) = most amount of processors you should use
For your smallest-sized domain:
e_we=126
e_sn=126
The most amount of processors are around 25.
For your largest-sized domain:
((e_we)/100) * ((e_sn)/100) = least amount of processors you should use
For your largest-sized domain:
e_we=196
e_sn=196
The least amount of processors are around 4.These limits do not make sense for my running. I would like to know whether something wrong with my understanding.
Thx a lot!
Xinxu

kwerner
Posts: 2251
Joined: Wed Feb 14, 2018 9:21 pm

Re: wrf.exe keeps running without any progress when increasing the number of processors

Post by kwerner » Fri Jan 15, 2021 6:00 pm

Hi Xinxu,
To explain the problem you're seeing, I would focus more on the first part of the FAQ question you mention.

We have a basic rule-of-thumb when it comes to choosing an appropriate number of processors. You will need to consider the decomposition of the processes in relation to the size of the domains. Decomposition will be determined based on the 2 closest factors for the total number of processors.

Depending on the total number of processors you use, your domain is divided up into tiles - 1 per processor. Each tile will have a minimum of 5 rows/columns on each side (called ‘halo’ regions), and are used to pass information from each cell/processor to the neighboring tile. You do not want your entire tile to be halo regions, as you will want some actual space for computation in the middle of each tile. If the computation space does not exist, it can cause the model to crash, or the output to be unrealistic.

The simple equations we use to test for this is to take the total number of grid spaces in the west-east direction and divide by the number of tiles in the x-direction [(e_we)/(x-tiles)]. You want the resulting number to be at the very least, greater than 10. Then do the same for the south-north direction [(e_sn)/(y-tiles)], again making sure it’s greater than 10.


So when you are using large numbers of processors, you have fewer than 10 grid cells in each tile, meaning there is no room for computation. I know you said you were able to use 12 cores (336 total processors), but I'm actually surprised this ran okay. You are using V3.9.1.1 and we didn't put in a check for this until version 4 (I think). If you ran with that many processors in a more recent version the model would stop immediately with an error because we don't want you to use that many for a small domain. At some point, though, even with earlier versions of the model, it just simply cannot run because of the decomposition problem. I know you want the model to run faster, but unfortunately for the domain size you're using, you cannot increase the number of processors. Per my calculations, you should be using no more than 3 nodes.
NCAR/MMM

Ming Chen
Posts: 1398
Joined: Mon Apr 23, 2018 9:42 pm

Re: wrf.exe keeps running without any progress when increasing the number of processors

Post by Ming Chen » Fri Jan 15, 2021 8:20 pm

We know for sure that WRFV3.9.1.1 has trouble to run with too many processors. Based on your grid numbers specified in namelist, I guess 16 processors will be enough for running this case. Too many processors will cause the module to hang, or to crash with segmentation fault.
WRF Help Desk

Post Reply

Return to “High-performance Computing”