Dear support team,
Thank you for the incredible support here. It has become a great resource for help even without opening new threads.
My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?
Below: rsl.error.0000, namelist attached (ideal case LES).
Thank you for the incredible support here. It has become a great resource for help even without opening new threads.
My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?
Below: rsl.error.0000, namelist attached (ideal case LES).
taskid: 0 hostname: n3503-058
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 48 , ntasks in Y 48
Domain # 1: dx = 250.000 m
WRF V4.5.1 MODEL
No git found or not a git repository, git commit version not available.
*************************************
Parent domain
ids,ide,jds,jde 1 2400 1 2400
ims,ime,jms,jme -4 57 -4 57
ips,ipe,jps,jpe 1 50 1 50
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 211220720 bytes allocated
med_initialdata_input: calling input_input
Input data is acceptable to use: wrfinput_d01
CURRENT DATE = 2008-07-30_12:00:00
SIMULATION START DATE = 2008-07-30_12:00:00
[n3503-058:mpi_rank_0][dreg_register] [Performance Impact Warning]: Entries are being evicted from the InfiniBand registration cache. This can lead to degraded performance. Consider increasing MV2_NDREG_ENTRIES_MAX (current value: 16384) and MV2_NDREG_ENTRIES (current value: 12800)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.28.s 0000148BF7F76CF0 Unknown Unknown Unknown
wrf.exe 0000000003451400 __intel_avx_rep_m Unknown Unknown
libmpi.so.12.1.1 0000148BF915FC78 MRAILI_Fill_start Unknown Unknown
libmpi.so.12.1.1 0000148BF914B388 MPIDI_CH3_Rendezv Unknown Unknown
libmpi.so.12.1.1 0000148BF914ACCB MPIDI_CH3_Rendezv Unknown Unknown
libmpi.so.12.1.1 0000148BF914AB23 MPIDI_CH3I_MRAILI Unknown Unknown
libmpi.so.12.1.1 0000148BF913EE5A MPIDI_CH3I_Progre Unknown Unknown
libmpi.so.12.1.1 0000148BF905F070 MPIR_Waitall_impl Unknown Unknown
libmpi.so.12.1.1 0000148BF8EFBBE2 MPIR_Scatterv Unknown Unknown
libmpi.so.12.1.1 0000148BF8EFB23A MPIR_Scatterv_imp Unknown Unknown
libmpi.so.12.1.1 0000148BF8EF9E6D MPI_Scatterv Unknown Unknown
wrf.exe 0000000000AE66E9 Unknown Unknown Unknown
wrf.exe 0000000000876BD5 Unknown Unknown Unknown
wrf.exe 0000000001781C81 Unknown Unknown Unknown
wrf.exe 000000000177CD98 Unknown Unknown Unknown
wrf.exe 00000000017727BC Unknown Unknown Unknown
wrf.exe 0000000001771D3A Unknown Unknown Unknown
wrf.exe 0000000001771786 Unknown Unknown Unknown
wrf.exe 0000000001CB0CC2 Unknown Unknown Unknown
wrf.exe 0000000001589461 Unknown Unknown Unknown
wrf.exe 0000000001655F21 Unknown Unknown Unknown
wrf.exe 00000000004145C0 Unknown Unknown Unknown
wrf.exe 00000000004134C7 Unknown Unknown Unknown