Dear support team,
Thank you for the incredible support here. It has become a great resource for help even without opening new threads.
My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?
Below: rsl.error.0000, namelist attached (ideal case LES).
				
			Thank you for the incredible support here. It has become a great resource for help even without opening new threads.
My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?
Below: rsl.error.0000, namelist attached (ideal case LES).
taskid: 0 hostname: n3503-058 module_io_quilt_old.F        2931 FQuilting with   1 groups of   0 I/O tasks. Ntasks in X           48 , ntasks in Y           48  Domain # 1: dx =   250.000 mWRF V4.5.1 MODELNo git found or not a git repository, git commit version not available. ************************************* Parent domain ids,ide,jds,jde            1        2400           1        2400 ims,ime,jms,jme           -4          57          -4          57 ips,ipe,jps,jpe            1          50           1          50 *************************************DYNAMICS OPTION: Eulerian Mass Coordinate   alloc_space_field: domain            1 ,              211220720  bytes allocated  med_initialdata_input: calling input_input   Input data is acceptable to use: wrfinput_d01 CURRENT DATE          = 2008-07-30_12:00:00 SIMULATION START DATE = 2008-07-30_12:00:00[n3503-058:mpi_rank_0][dreg_register] [Performance Impact Warning]: Entries are being evicted from the InfiniBand registration cache. This can lead to degraded performance. Consider increasing MV2_NDREG_ENTRIES_MAX (current value: 16384) and MV2_NDREG_ENTRIES (current value: 12800)forrtl: severe (174): SIGSEGV, segmentation fault occurredImage              PC                Routine            Line        Sourcelibpthread-2.28.s  0000148BF7F76CF0  Unknown               Unknown  Unknownwrf.exe            0000000003451400  __intel_avx_rep_m     Unknown  Unknownlibmpi.so.12.1.1   0000148BF915FC78  MRAILI_Fill_start     Unknown  Unknownlibmpi.so.12.1.1   0000148BF914B388  MPIDI_CH3_Rendezv     Unknown  Unknownlibmpi.so.12.1.1   0000148BF914ACCB  MPIDI_CH3_Rendezv     Unknown  Unknownlibmpi.so.12.1.1   0000148BF914AB23  MPIDI_CH3I_MRAILI     Unknown  Unknownlibmpi.so.12.1.1   0000148BF913EE5A  MPIDI_CH3I_Progre     Unknown  Unknownlibmpi.so.12.1.1   0000148BF905F070  MPIR_Waitall_impl     Unknown  Unknownlibmpi.so.12.1.1   0000148BF8EFBBE2  MPIR_Scatterv         Unknown  Unknownlibmpi.so.12.1.1   0000148BF8EFB23A  MPIR_Scatterv_imp     Unknown  Unknownlibmpi.so.12.1.1   0000148BF8EF9E6D  MPI_Scatterv          Unknown  Unknownwrf.exe            0000000000AE66E9  Unknown               Unknown  Unknownwrf.exe            0000000000876BD5  Unknown               Unknown  Unknownwrf.exe            0000000001781C81  Unknown               Unknown  Unknownwrf.exe            000000000177CD98  Unknown               Unknown  Unknownwrf.exe            00000000017727BC  Unknown               Unknown  Unknownwrf.exe            0000000001771D3A  Unknown               Unknown  Unknownwrf.exe            0000000001771786  Unknown               Unknown  Unknownwrf.exe            0000000001CB0CC2  Unknown               Unknown  Unknownwrf.exe            0000000001589461  Unknown               Unknown  Unknownwrf.exe            0000000001655F21  Unknown               Unknown  Unknownwrf.exe            00000000004145C0  Unknown               Unknown  Unknownwrf.exe            00000000004134C7  Unknown               Unknown  Unknown