Dear support team,
Thank you for the incredible support here. It has become a great resource for help even without opening new threads.
My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?
Below: rsl.error.0000, namelist attached (ideal case LES).
Thank you for the incredible support here. It has become a great resource for help even without opening new threads.
My team has some experience with WRF. In our current application, we want to run simulations with large domains, but now we ran into segmentation faults.
The largest domain sizes that worked for me were: 2260x2260x100 and 2400x2400x80, but 2400x2400x100 failed.
I'm using 2304 processes (18 nodes with 128 processes each), decomposing the domain in 48x48 tasks.
HPC support told me he sees an OOM kill in the log. When I checked on the compute nodes, each node used 65-75 GB memory (of total 512 GB).
Do you have any advice on how to proceed?
Below: rsl.error.0000, namelist attached (ideal case LES).
taskid: 0 hostname: n3503-058 module_io_quilt_old.F 2931 FQuilting with 1 groups of 0 I/O tasks. Ntasks in X 48 , ntasks in Y 48 Domain # 1: dx = 250.000 mWRF V4.5.1 MODELNo git found or not a git repository, git commit version not available. ************************************* Parent domain ids,ide,jds,jde 1 2400 1 2400 ims,ime,jms,jme -4 57 -4 57 ips,ipe,jps,jpe 1 50 1 50 *************************************DYNAMICS OPTION: Eulerian Mass Coordinate alloc_space_field: domain 1 , 211220720 bytes allocated med_initialdata_input: calling input_input Input data is acceptable to use: wrfinput_d01 CURRENT DATE = 2008-07-30_12:00:00 SIMULATION START DATE = 2008-07-30_12:00:00[n3503-058:mpi_rank_0][dreg_register] [Performance Impact Warning]: Entries are being evicted from the InfiniBand registration cache. This can lead to degraded performance. Consider increasing MV2_NDREG_ENTRIES_MAX (current value: 16384) and MV2_NDREG_ENTRIES (current value: 12800)forrtl: severe (174): SIGSEGV, segmentation fault occurredImage PC Routine Line Sourcelibpthread-2.28.s 0000148BF7F76CF0 Unknown Unknown Unknownwrf.exe 0000000003451400 __intel_avx_rep_m Unknown Unknownlibmpi.so.12.1.1 0000148BF915FC78 MRAILI_Fill_start Unknown Unknownlibmpi.so.12.1.1 0000148BF914B388 MPIDI_CH3_Rendezv Unknown Unknownlibmpi.so.12.1.1 0000148BF914ACCB MPIDI_CH3_Rendezv Unknown Unknownlibmpi.so.12.1.1 0000148BF914AB23 MPIDI_CH3I_MRAILI Unknown Unknownlibmpi.so.12.1.1 0000148BF913EE5A MPIDI_CH3I_Progre Unknown Unknownlibmpi.so.12.1.1 0000148BF905F070 MPIR_Waitall_impl Unknown Unknownlibmpi.so.12.1.1 0000148BF8EFBBE2 MPIR_Scatterv Unknown Unknownlibmpi.so.12.1.1 0000148BF8EFB23A MPIR_Scatterv_imp Unknown Unknownlibmpi.so.12.1.1 0000148BF8EF9E6D MPI_Scatterv Unknown Unknownwrf.exe 0000000000AE66E9 Unknown Unknown Unknownwrf.exe 0000000000876BD5 Unknown Unknown Unknownwrf.exe 0000000001781C81 Unknown Unknown Unknownwrf.exe 000000000177CD98 Unknown Unknown Unknownwrf.exe 00000000017727BC Unknown Unknown Unknownwrf.exe 0000000001771D3A Unknown Unknown Unknownwrf.exe 0000000001771786 Unknown Unknown Unknownwrf.exe 0000000001CB0CC2 Unknown Unknown Unknownwrf.exe 0000000001589461 Unknown Unknown Unknownwrf.exe 0000000001655F21 Unknown Unknown Unknownwrf.exe 00000000004145C0 Unknown Unknown Unknownwrf.exe 00000000004134C7 Unknown Unknown Unknown