Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault at a certain domain size

lkugler

New member
Dear support team,
Thank you for the incredible support here. It has become a great resource for help even without opening new threads.

My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?

Below: rsl.error.0000, namelist attached (ideal case LES).

taskid: 0 hostname: n3503-058
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 48 , ntasks in Y 48
Domain # 1: dx = 250.000 m
WRF V4.5.1 MODEL
No git found or not a git repository, git commit version not available.
*************************************
Parent domain
ids,ide,jds,jde 1 2400 1 2400
ims,ime,jms,jme -4 57 -4 57
ips,ipe,jps,jpe 1 50 1 50
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 211220720 bytes allocated
med_initialdata_input: calling input_input
Input data is acceptable to use: wrfinput_d01
CURRENT DATE = 2008-07-30_12:00:00
SIMULATION START DATE = 2008-07-30_12:00:00
[n3503-058:mpi_rank_0][dreg_register] [Performance Impact Warning]: Entries are being evicted from the InfiniBand registration cache. This can lead to degraded performance. Consider increasing MV2_NDREG_ENTRIES_MAX (current value: 16384) and MV2_NDREG_ENTRIES (current value: 12800)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.28.s 0000148BF7F76CF0 Unknown Unknown Unknown
wrf.exe 0000000003451400 __intel_avx_rep_m Unknown Unknown
libmpi.so.12.1.1 0000148BF915FC78 MRAILI_Fill_start Unknown Unknown
libmpi.so.12.1.1 0000148BF914B388 MPIDI_CH3_Rendezv Unknown Unknown
libmpi.so.12.1.1 0000148BF914ACCB MPIDI_CH3_Rendezv Unknown Unknown
libmpi.so.12.1.1 0000148BF914AB23 MPIDI_CH3I_MRAILI Unknown Unknown
libmpi.so.12.1.1 0000148BF913EE5A MPIDI_CH3I_Progre Unknown Unknown
libmpi.so.12.1.1 0000148BF905F070 MPIR_Waitall_impl Unknown Unknown
libmpi.so.12.1.1 0000148BF8EFBBE2 MPIR_Scatterv Unknown Unknown
libmpi.so.12.1.1 0000148BF8EFB23A MPIR_Scatterv_imp Unknown Unknown
libmpi.so.12.1.1 0000148BF8EF9E6D MPI_Scatterv Unknown Unknown
wrf.exe 0000000000AE66E9 Unknown Unknown Unknown
wrf.exe 0000000000876BD5 Unknown Unknown Unknown
wrf.exe 0000000001781C81 Unknown Unknown Unknown
wrf.exe 000000000177CD98 Unknown Unknown Unknown
wrf.exe 00000000017727BC Unknown Unknown Unknown
wrf.exe 0000000001771D3A Unknown Unknown Unknown
wrf.exe 0000000001771786 Unknown Unknown Unknown
wrf.exe 0000000001CB0CC2 Unknown Unknown Unknown
wrf.exe 0000000001589461 Unknown Unknown Unknown
wrf.exe 0000000001655F21 Unknown Unknown Unknown
wrf.exe 00000000004145C0 Unknown Unknown Unknown
wrf.exe 00000000004134C7 Unknown Unknown Unknown
 

Attachments

  • namelist.input
    6.3 KB · Views: 2
If wrf.exe failed immediately after it is launched, possible reason could be that: (1) the input data is wrong, or (2) the memory is not sufficient, or (3) the file size is larger than 4GB. WRF large file support enables file size larger than 2GB but must be smaller than 4GB.

Could you please check how large your wrfout is for the case of 2400x2400x80, and estimate the possible file size for the case of 2400x2400x100?
 
Last edited:
wrf.exe fails after
med_initialdata_input: calling input_input
Input data is acceptable to use: wrfinput_d01
CURRENT DATE = 2008-07-30_12:00:00
SIMULATION START DATE = 2008-07-30_12:00:00

1) Input data: I checked multiple sizes and all other smaller sizes worked, e.g. 2400x2400x80 on 2304 processes and 200x200x200 on 4 processes.
2) Memory is not sufficient: HPC support confirmed that memory usage is too high (close to 500GB per node of 18 nodes), he said that the result is basically the same with 120 nodes (15K cores).
3) wrfout file size: with io_form_history = 102, the size is 1.4MB per process, with io_form_history = 11, the size is 24 GB (pnetcdf).

Extrapolating RAM
From 200x200x200: 24 GB RAM on a single node, scaling that up by a factor of 12^2 and dividing by 18 nodes means there should be enough RAM.
From 2400x2400x80: 65-75 GB per node (of 18), so memory usage for 100 levels should go up by 25%, which should by far not exceed 500GB.
 
Thank you for the suggestion @islas.

With debug_level=100 and the MPI_Gatherv/MPI_Scatterv fix,
the log file continues after the line SIMULATION START DATE = 2008-07-30_12:00:00 until another segfault occurs:
Code:
taskid: 0 hostname: n3501-020
 module_io_quilt_old.F        2931 F
Quilting with   1 groups of   0 I/O tasks.
 Ntasks in X           48 , ntasks in Y           48
  *************************************
  No physics suite selected.
  Physics options will be used directly from the namelist.
  *************************************
  Domain # 1: dx =   250.000 m
--- ERROR: ghg_input available only for these radiation schemes: CAM, RRTM, RRTMG, RRTMG_fast
           And the LW and SW schemes must be reasonably paired together:
           OK = CAM LW with CAM SW
           OK = RRTM, RRTMG LW or SW, RRTMG_fast LW or SW may be mixed
  --- WARNING: traj_opt is zero, but num_traj is not zero; setting num_traj to zero.
  --- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
  --- NOTE: qna_update is 0, setting io_form_auxinput17 = 0 and auxinput17_interval = 0 for all domains
  --- NOTE: grid_fdda is 0 for domain      1, setting gfdda interval and ending time to 0 for that domain.
  --- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      1, setting sgfdda interval and ending time to 0 for that domain.
  --- NOTE: obs_nudge_opt is 0 for domain      1, setting obs nudging interval and ending time to 0 for that domain.
  --- NOTE: bl_pbl_physics /= 4, implies mfshconv must be 0, resetting
  --- NOTE: RRTMG radiation is not used, setting:  o3input=0 to avoid data pre-processing
  --- NOTE: num_soil_layers has been set to      5
WRF V4.5.1 MODEL
No git found or not a git repository, git commit version not available.
  wrf: calling alloc_and_configure_domain
 *************************************
 Parent domain
 ids,ide,jds,jde            1        2400           1        2400
 ims,ime,jms,jme           -4          57          -4          57
 ips,ipe,jps,jpe            1          50           1          50
 *************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
   alloc_space_field: domain            1 ,              188187196  bytes allocated
  wrf: calling model_to_grid_config_rec
  wrf: calling set_scalar_indices_from_config
  wrf: calling init_wrfio
  Entering ext_gr1_ioinit
   setup_timekeeping:  set xtime to   0.0000000E+00
   setup_timekeeping:  set julian to    211.5000
  setup_timekeeping:  returning...
  wrf main: calling open_r_dataset for wrfinput
  med_initialdata_input: calling input_input
   Input data is acceptable to use: wrfinput_d01
   Warning LENGTH < 1 in ext_ncd_get_dom_ti.code CHAR, line         107
  mminlu = ''
   NOTE:  Ideal cases always use hypsometric_opt=1, regardless of namelist setting
 CURRENT DATE          = 2008-07-30_12:00:00
 SIMULATION START DATE = 2008-07-30_12:00:00
  med_initialdata_input: back from input_input
Timing for processing wrfinput file (stream 0) for domain        1:  225.98969 elapsed seconds
   checking boundary conditions for grid
   boundary conditions OK for grid
Max map factor in domain 1 =  0.00. Scale the dt in the model accordingly.
  start_domain_em: Before call to phy_init
  WRF TILE   1 IS      1 IE     50 JS      1 JE     50
  set_tiles3: NUMBER OF TILES =   1
  top of phy_init
   phy_init:  start_of_simulation =  T
  calling nl_get_iswater, nl_get_isice, nl_get_mminlu_loc
  after nl_get_iswater, nl_get_isice, nl_get_mminlu_loc
  start_domain_em: After call to phy_init
  start_em: calling lightning_init
  start_em: after calling lightning_init
  calling inc/HALO_EM_INIT_1_inline.inc
  calling inc/HALO_EM_INIT_2_inline.inc
  calling inc/HALO_EM_INIT_3_inline.inc
  calling inc/HALO_EM_INIT_4_inline.inc
  calling inc/HALO_EM_INIT_5_inline.inc
  calling inc/PERIOD_BDY_EM_INIT_inline.inc
  calling inc/PERIOD_BDY_EM_MOIST_inline.inc
  calling inc/PERIOD_BDY_EM_TKE_inline.inc
  calling inc/PERIOD_BDY_EM_SCALAR_inline.inc
  calling inc/PERIOD_BDY_EM_CHEM_inline.inc
  calling inc/HALO_EM_INIT_1_inline.inc
  calling inc/HALO_EM_INIT_2_inline.inc
  calling inc/HALO_EM_INIT_3_inline.inc
  calling inc/HALO_EM_INIT_4_inline.inc
  calling inc/HALO_EM_INIT_5_inline.inc
  calling inc/PERIOD_BDY_EM_INIT_inline.inc
  calling inc/PERIOD_BDY_EM_MOIST_inline.inc
  calling inc/PERIOD_BDY_EM_TKE_inline.inc
  calling inc/PERIOD_BDY_EM_SCALAR_inline.inc
  calling inc/PERIOD_BDY_EM_CHEM_inline.inc
  start_domain_em: Returning
  wrf: calling integrate
d01 2008-07-30_12:00:00 open_hist_w : opening ./output/wrfout_d01_2008-07-30_12_00_00 for writing.
d01 2008-07-30_12:00:00 calling wrf_open_for_write_begin in open_w_dataset
d01 2008-07-30_12:00:00  module_io.F: in wrf_open_for_write_begin, FileName = ./output/wrfout_d01_2008-07-30_12_00_00
d01 2008-07-30_12:00:00 calling outsub in open_w_dataset
d01 2008-07-30_12:00:00 back from outsub in open_w_dataset
d01 2008-07-30_12:00:00 calling wrf_open_for_write_commit in open_w_dataset
d01 2008-07-30_12:00:00  Information: NOFILL being set for writing to ./output/wrfout_d01_2008-07-30_12_00_00_0000
d01 2008-07-30_12:00:00 back from wrf_open_for_write_commit in open_w_dataset
d01 2008-07-30_12:00:00  med_hist_out: opened ./output/wrfout_d01_2008-07-30_12_00_00 as DATASET=HISTORY
 mediation_integrate.G        1242 DATASET=HISTORY
 mediation_integrate.G        1243  grid%id            1  grid%oid            1
Timing for Writing ./output/wrfout_d01_2008-07-30_12_00_00 for domain        1:    2.96829 elapsed seconds
d01 2008-07-30_12:00:00 module_integrate: calling solve interface
 Tile Strategy is not specified. Assuming 1D-Y
WRF TILE   1 IS      1 IE     50 JS      1 JE     50
WRF NUMBER OF TILES =   1
d01 2008-07-30_12:00:00  grid spacing, dt, time_step_sound=   250.0000       1.000000               4
d01 2008-07-30_12:00:00 calling inc/HALO_EM_MOIST_OLD_E_7_inline.inc
d01 2008-07-30_12:00:00 calling inc/PERIOD_BDY_EM_MOIST_OLD_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_A_inline.inc
d01 2008-07-30_12:00:00 calling inc/PERIOD_BDY_EM_A_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_PHYS_A_inline.inc
d01 2008-07-30_12:00:00 Top of Radiation Driver
d01 2008-07-30_12:00:00 calling inc/HALO_PWP_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_FDDA_SFC_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_TKE_C_inline.inc
d01 2008-07-30_12:00:00 calling inc/PERIOD_BDY_EM_A1_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_HELICITY_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_TKE_D_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_TKE_E_inline.inc
d01 2008-07-30_12:00:00 calling inc/PERIOD_BDY_EM_PHY_BC_inline.inc
d01 2008-07-30_12:00:00 calling inc/PERIOD_BDY_EM_CHEM_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_PHYS_DIFFUSION_inline.inc
d01 2008-07-30_12:00:00 calling inc/HALO_EM_TKE_5_inline.inc
d01 2008-07-30_12:00:00  ----------------------------------------
d01 2008-07-30_12:00:00  W-DAMPING  BEGINS AT W-COURANT NUMBER =    1.000000
d01 2008-07-30_12:00:00  ----------------------------------------
d01 2008-07-30_12:00:00 calling inc/HALO_EM_B_inline.inc
d01 2008-07-30_12:00:00 calling inc/PERIOD_BDY_EM_B_inline.inc
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  00001498FB62ECF0  Unknown               Unknown  Unknown
libmpi.so.12.1.1   00001498FC48A0EA  MPIDI_CH3I_SMP_pu     Unknown  Unknown
libmpi.so.12.1.1   00001498FC488EC3  MPIDI_CH3I_SMP_re     Unknown  Unknown
libmpi.so.12.1.1   00001498FC474BE3  MPIDI_CH3I_Progre     Unknown  Unknown
libmpi.so.12.1.1   00001498FC394636  MPIR_Wait_impl        Unknown  Unknown
libmpi.so.12.1.1   00001498FC3942E0  MPI_Wait              Unknown  Unknown
wrf.exe            0000000003140B7C  Unknown               Unknown  Unknown
wrf.exe            00000000018AA833  Unknown               Unknown  Unknown
wrf.exe            00000000016B2C87  Unknown               Unknown  Unknown
wrf.exe            0000000001504D98  Unknown               Unknown  Unknown
wrf.exe            00000000005B9343  Unknown               Unknown  Unknown
wrf.exe            0000000000416BA1  Unknown               Unknown  Unknown
wrf.exe            0000000000416B5F  Unknown               Unknown  Unknown
wrf.exe            0000000000416AFD  Unknown               Unknown  Unknown
libc-2.28.so       00001498FB291D85  __libc_start_main     Unknown  Unknown
wrf.exe            0000000000416A1E  Unknown               Unknown  Unknown
 
Just to confirm, this choice of i/o format should work, right?
io_form_history = 102,
io_form_restart = 102,
io_form_input = 2,
io_form_boundary = 2,
 
The mixture of 102 and 2 was the problem.

The solution was to use the same number for io_form_history and io_form_input, otherwise WRF fails with weird MPI segmentation faults!
It worked with io_form_* = 11 and 102.
Below a domain size of about 2260x2260x100, it also seems to work with mixed values for different io_form_* parameters.

See also Writing large WRF model output files with pnetcdf
 
Top