Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe keeps running without any progress when increasing the number of processors

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Xinxu Zhao

New member
Hi All,
My simulation is based on WRF v3.9.1.1. Since I wanna run another long-period test (~2 years), I would like to use more processors (i.e., number of nodes) to speed up my running. The key settings in the namelist.input related to time step and domain size are:
Code:
&time_control
 run_days                            = 0,
 run_hours                           = 24,
 run_minutes                         = 0,
 run_seconds                         = 0, 
 start_year                          = 2018,   2018,   2018,
 start_month                         = 08,   08,   08,
 start_day                           = 01,   01,   01,
 start_hour                          = 00,   00,   00,
 start_minute                        = 00,   00,   00,
 start_second                        = 00,   00,   00,
 end_year                            = 2018,   2018,   2018,
 end_month                           = 08,   08,   08,
 end_day                             = 02,   02,   02,
 end_hour                            = 00,   00,   00,
 end_minute                          = 00,   00,   00,
 end_second                          = 00,   00,   00,
 interval_seconds                    = 21600,
 input_from_file                     = .true.,.true.,.true.,
 history_interval                    = 180,  60,   15,
 frames_per_outfile                  = 1000, 1000, 1000,
 write_hist_at_0h_rst		     = .true.,
 restart                             = .false.,
 restart_interval                    = 360,
 override_restart_timers             = .true.,
 rst_outname                         = 'wrfrst_d<domain>_<date>',
 history_outname                     = 'wrfout_d<domain>_<date>',
 auxinput1_inname                    = 'met_em.d<domain>.<date>',   
 io_form_history                     = 2,
 io_form_restart                     = 2,
 io_form_input                       = 2,
 io_form_boundary                    = 2,
 io_form_restart                     = 2,                                                  
 debug_level                         = 1,
 auxinput15_inname        	     = 'vprm_input_d<domain>_<date>',
 io_form_auxinput15       	     = 2,
 auxinput15_interval_m    	     = 1800,      1800,     1800,
 frames_per_auxinput15    	     = 1,         1,        1,
 auxinput5_inname 		     = 'wrfchemi_d<domain>_<date>',
 io_form_auxinput5                   = 2,
 auxinput5_interval_m                = 60, 60, 60,
 frames_per_auxinput5                = 1, 1, 1,
 /

 &domains
 time_step                           = 30,
 time_step_fract_num                 = 0,
 time_step_fract_den                 = 1,
 max_dom                             = 3,
 e_we              	             = 150, 126, 196, 
 e_sn              		     = 150, 126, 196, 
 e_vert                              = 46,    46,    46,
 p_top_requested                     = 5000,
 num_metgrid_levels                  = 138,
 num_metgrid_soil_levels             = 4,
 dx                                  = 10000, 2000,  400,
 dy                                  = 10000, 2000,  400,
 grid_id                             = 1,     2,     3,
 parent_id                           = 1,     1,     2,
 i_parent_start                      = 1,     63,    44,
 j_parent_start                      = 1,     63,    44,
 parent_grid_ratio                   = 1,     5,     5,
 parent_time_step_ratio              = 1,     5,     5,
 feedback                            = 0,
 smooth_option                       = 0
 eta_levels 			     = 1.000,0.998,0.996,0.994,0.992,0.990,0.988,0.984,0.980,0.976,0.970,0.964,
					0.958,0.952,0.945,0.938,0.930,0.922,0.914,0.904,0.894,0.884,0.874,0.860,0.846,
					0.832,0.818,0.800,0.775,0.750,0.720,0.700,0.650,0.600,0.550,0.500,
					0.450,0.400,0.350,0.300,0.250,0.200,0.150,0.100,0.050,0.000,
 /

 &physics
 mp_physics               = 3,       3,       3,
 ra_lw_physics            = 4,        4,        4,
 ra_sw_physics            = 4,        4,        4,
 radt                     = 30,       30,       30,
 sf_sfclay_physics        = 2,        2,        2,
 sf_surface_physics       = 2,        2,        2,
 bl_pbl_physics           = 2,        2,        2,
 bldt                     = 0,        0,        0,
 cu_physics               = 3,        0,        0,   
 cudt                     = 5,        5,        5,
 cu_diag		  = 1,        0,        0,	      
 isfflx                   = 1,
 ifsnow                   = 0,
 icloud                   = 1,
 surface_input_source     = 1,
 num_soil_layers          = 4,
 num_land_cat             = 40,
 sf_urban_physics         = 2,        2,        2,
 maxiens                  = 1,
 maxens                   = 3,
 maxens2                  = 3,
 maxens3                  = 16,
 ensdim                   = 144,
 /

&chem
 chem_opt                            = 17,17,17,
 emiss_opt                           = 17,17,17,
 vprm_opt			     = 'VPRM_table_EUROPE','VPRM_table_EUROPE','VPRM_table_EUROPE',
 phot_opt                            = 0,0,0,
 chem_in_opt                         = 0,0,0,
 io_style_emissions                  = 2,
 kemit                               = 7,
 chemdt                              = 1.,1.,1.,
 bioemdt                             = 30,30,30,
 photdt                              = 30,30,30,
 gas_drydep_opt 		     = 0,0,0,
 aer_drydep_opt 		     = 0,0,0,
 bio_emiss_opt                       = 17,17,17,
 emiss_inpt_opt                      = 16,16,16,
 biomass_burn_opt                    = 0,0,0,
 plumerisefire_frq                   = 0,0,0,
 gas_bc_opt                          = 1,1,1,
 gas_ic_opt                          = 1,1,1,
 aer_bc_opt                          = 1,1,1,
 aer_ic_opt                          = 1,1,1,
 gaschem_onoff                       = 0,0,0,
 aerchem_onoff                       = 0,0,0,
 vertmix_onoff                       = 1,1,1,
 chem_conv_tr                        = 1,0,0,
 have_bcs_chem                       = .true,.true,.true,
 have_bcs_tracer                     = .true,.true,.true,
 aer_ra_feedback                     = 0,0,0,
 wetscav_onoff                       = 0,0,0,
 cldchem_onoff                       = 0,0,0,
 conv_tr_wetscav 		     = 0,0,0,
 /

 &fdda
 /
 &dynamics
 w_damping                           = 0,
 diff_opt                            = 2,      2,      2,
 km_opt                              = 4,      4,      4,
 diff_6th_opt                        = 2,      2,       2,
 diff_6th_factor                     = 0.12,   0.12,   0.12,
 base_temp                           = 290.
 damp_opt                            = 0,
 zdamp                               = 5000.,  5000.,  5000.,
 dampcoef                            = 0.2,    0.2,    0.2
 khdif                               = 0,      0,      0,
 kvdif                               = 0,      0,      0,
 non_hydrostatic                     = .true., .true., .true.,
 moist_adv_opt                       = 1,      1,      1,     
 scalar_adv_opt                      = 1,      1,      1,     
 /

 &bdy_control
 spec_bdy_width                      = 5,
 spec_zone                           = 1,
 relax_zone                          = 4,
 specified                           = .true., .false.,.false.,
 nested                              = .false., .true., .true.,
 /

 &grib2
 /

 &namelist_quilt
 nio_tasks_per_group = 0,
 nio_groups = 1,
 /
I always used 12 cores (28 nodes per core) for my previous running which ran smoothly and worked well. The beginning of rsl.error.0000 is:
Code:
taskid: 0 hostname: i22r01c06s10
 module_io_quilt_old.F        2931 F
Quilting with   1 groups of   0 I/O tasks.
 Ntasks in X           16 , ntasks in Y           21
Then I try to run the case using over 12 cores. When using 24 nodes, the wrf.exe seems to just hang/stop (the last line below) there without errors and progress until the running is out of time. It seems that the running hanged when nesting to d02.
In the rsl.error.0000:
Code:
taskid: 0 hostname: i22r01c01s07
 module_io_quilt_old.F        2931 F
Quilting with   1 groups of   0 I/O tasks.
 Ntasks in X           24 , ntasks in Y           28
--- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
--- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
--- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
--- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
--- NOTE: grid_fdda is 0 for domain      1, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      1, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain      1, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: grid_fdda is 0 for domain      2, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      2, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain      2, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: grid_fdda is 0 for domain      3, setting gfdda interval and ending time to 0 for that domain.
--- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      3, setting sgfdda interval and ending time to 0 for that domain.
--- NOTE: obs_nudge_opt is 0 for domain      3, setting obs nudging interval and ending time to 0 for that domain.
--- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
Need MYNN PBL for icloud_bl = 1, resetting to 0
*************************************
No physics suite selected.
Physics options will be used directly from the namelist.
*************************************
--- NOTE: RRTMG radiation is in use, setting:  levsiz=59, alevsiz=12, no_src_types=6
--- NOTE: num_soil_layers has been set to      4
WRF V3.9.1.1 MODEL
 *************************************
 Parent domain
 ids,ide,jds,jde            1         150           1         150
 ims,ime,jms,jme           -4          14          -4          13
 ips,ipe,jps,jpe            1           7           1           6
 *************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
   alloc_space_field: domain            1 ,               58299468  bytes allocated
  wrf main: calling open_r_dataset for wrfinput
  med_initialdata_input: calling input_input
  mminlu = 'MODIFIED_IGBP_MODIS_NOAH'
Timing for processing wrfinput file (stream 0) for domain        1:    7.42752 elapsed seconds
Max map factor in domain 1 =  0.98. Scale the dt in the model accordingly.
  WRF TILE   1 IS      1 IE      7 JS      1 JE      6
  set_tiles3: NUMBER OF TILES =   1
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
 LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND          40  CATEGORIES           2  SEASONS WATER CATEGORY =           17  SNOW CATEGORY =           15
  Do not have ozone.  Must read it in.
  Master rank reads ozone.
  Broadcast ozone to other ranks.
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
 LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND          20  CATEGORIES
 INPUT SOIL TEXTURE CLASSIFICATION = STAS
 SOIL TEXTURE CLASSIFICATION = STAS FOUND          19  CATEGORIES
*********************************************************************
*             PROGRAM:WRF-Chem V3.9.1.1 MODEL
*                                                                   *
*    PLEASE REPORT ANY BUGS TO WRF-Chem HELP at                     *
*                                                                   *
*              wrfchemhelp.gsd@noaa.gov                             *
*                                                                   *
*********************************************************************
WARNING: Users interested in the GHG options should check the comments/references in header of module_ghg_fluxes
Warning: the VPRM parameters may need to be optimized depending on the season, year and region!
The parameters provided here should be used for testing purposes only!
 *************************************
 Nesting domain
 ids,ide,jds,jde            1         126           1         126
 ims,ime,jms,jme           -4          20          -4          15
 ips,ipe,jps,jpe            1           6           1           5
 INTERMEDIATE domain
 ids,ide,jds,jde           61          91          61          91
 ims,ime,jms,jme           56          73          56          73
 ips,ipe,jps,jpe           59          63          59          63
 *************************************
d01 2018-08-01_18:00:00  alloc_space_field: domain            2 ,                7659360  bytes allocated
d01 2018-08-01_18:00:00  alloc_space_field: domain            2 ,               82253220  bytes allocated
I also tried other numbers of nodes (e.g., 14, 16, 20) which also do not work. Do you have any idea why this hang happened? Any idea how to debug it?
Additionally, based on the info provided in this link: https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?t=5082, I am also trying to figure out an appropriate number of processors I should use for my running.
But it seems too fewer processors are recommended in my domain, i.e.,
For your smallest-sized domain:
((e_we)/25) * ((e_sn)/25) = most amount of processors you should use
For your smallest-sized domain:
e_we=126
e_sn=126
The most amount of processors are around 25.
For your largest-sized domain:
((e_we)/100) * ((e_sn)/100) = least amount of processors you should use
For your largest-sized domain:
e_we=196
e_sn=196
The least amount of processors are around 4.These limits do not make sense for my running. I would like to know whether something wrong with my understanding.
Thx a lot!
Xinxu
 
Hi Xinxu,
To explain the problem you're seeing, I would focus more on the first part of the FAQ question you mention.

We have a basic rule-of-thumb when it comes to choosing an appropriate number of processors. You will need to consider the decomposition of the processes in relation to the size of the domains. Decomposition will be determined based on the 2 closest factors for the total number of processors.

Depending on the total number of processors you use, your domain is divided up into tiles - 1 per processor. Each tile will have a minimum of 5 rows/columns on each side (called ‘halo’ regions), and are used to pass information from each cell/processor to the neighboring tile. You do not want your entire tile to be halo regions, as you will want some actual space for computation in the middle of each tile. If the computation space does not exist, it can cause the model to crash, or the output to be unrealistic.

The simple equations we use to test for this is to take the total number of grid spaces in the west-east direction and divide by the number of tiles in the x-direction [(e_we)/(x-tiles)]. You want the resulting number to be at the very least, greater than 10. Then do the same for the south-north direction [(e_sn)/(y-tiles)], again making sure it’s greater than 10.


So when you are using large numbers of processors, you have fewer than 10 grid cells in each tile, meaning there is no room for computation. I know you said you were able to use 12 cores (336 total processors), but I'm actually surprised this ran okay. You are using V3.9.1.1 and we didn't put in a check for this until version 4 (I think). If you ran with that many processors in a more recent version the model would stop immediately with an error because we don't want you to use that many for a small domain. At some point, though, even with earlier versions of the model, it just simply cannot run because of the decomposition problem. I know you want the model to run faster, but unfortunately for the domain size you're using, you cannot increase the number of processors. Per my calculations, you should be using no more than 3 nodes.
 
We know for sure that WRFV3.9.1.1 has trouble to run with too many processors. Based on your grid numbers specified in namelist, I guess 16 processors will be enough for running this case. Too many processors will cause the module to hang, or to crash with segmentation fault.
 
Hi, i'm with a similar problem, where i want apply as many cores as it's possible over a little domain. I know the limit of 10 cell per proccesors, but there are any system for increase the speed of run, when there are still free cores in the computer?
 
@ag1993,
Unfortunately you are limited by the size of your domain, and even if you have access to many processors, you can't run with more than is appropriate. There may be other factors that are slowing down your run - such as specific physics settings, or other settings. If you'd like us to take a look, you can attach your namelist and let me know how many processors you're running with, how long the run is typically taking, what version of the model you're using, and which compiler you're using to run. Thanks!
 
@kwerner i will very greatfull if you can take a look of my namelist.input and i think than i asked you something similar in another post, so please ignore that :D .

the simulation than i design is very small because the objective is get a long time series of wind in a point, so i looking for the fastest run than i can get and, related with this, i use a few vertical layer.

In the search of the shorter run i use adaptative time-step and for some problems of stability, i put a maximum time of 300 seconds in the parent domain.

And about the run, i compiled the model with GNU/gfortran (dmpar), so i can use the paralelization with mpich. I run the model with 16 processors (mpich -np 16 ./wrf.exe) so i can split the domain in near 10 x 10 cells. By now the simulation of 1 coarse domain and 4 nest (two way nested runs), it resolved a time of 300 sec in 2.5 sec. So i need shorter time to get the time series of several years.

PD: there are free processor and memory ram yet.

i will greatfull of any clue.
 

Attachments

  • namelist.input
    5.4 KB · Views: 60
@ag1993,
What type of input data are you using for this? What is the resolution of that input data?

If you are looking for realistic wind values for your time series, then you will still need larger domain sizes and many more vertical levels. We now recommend 45+ vertical levels. If you do not care anything about whether the results are close to accurate and only want values, then I suppose it is okay to use the smaller domains.

At this point, unfortunately I do not have any additional suggestions for making the simulation run faster.
 
Dear @kwerner

I used the data from the GFS (ds083.2) with a size of 1°, because that input data have more years.

I appreciated your answer, i already tried with domains biggers and with the more vertical levels, but the results don't change very much. I will investigate about the meteorological phenomenons than i losted, if you have or you see any post or info about that i really appreciate.

Thanks anyway.
 
Top