arw v3.9.1.1 with dm+sm

cwlam08 · May 3, 2021

I have a server with 2 Intel Xeon Gold 6148 CPUs, in which omp domains distributed in the following mode:
CPUs 0-19 in domain 0
CPUs 20-39 in domain 1
I've compiled wrf.exe (arw v3.9.1.1) with Intel compilers (v19.1.3.304) and the option dm+sm, also with support of avx512 instructions set. (see configure.wrf file). It created all exe files except the real.exe (It was no problem because I had real.exe compiled with only dm option with the support of avx512)
When I run the wrf.exe with the command:
$ mpirun -np 2 -genv OMP_NUM_THREADS=19 -genv I_MPI_DEBUG=5 -genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=omp -genv I_MPI_PIN_ORDER=compact -genv I_MPI_PIN_CELL=core -genv I_MPI_PIN_PROCESSOR_LIST=1-19,21-39 ./wrf.exe
it crushs with the error: (Segmentation fault)
here is output:
--------------------------------------------------------------------------------------------------
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
starting wrf task 1 of 2
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 32308 localhost.localdomain {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18}
[0] MPI startup(): 1 32309 localhost.localdomain {19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/compilers_and_libraries_2020.4.304/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_PIN=1
[0] MPI startup(): I_MPI_PIN_PROCESSOR_LIST=1-19,21-39
[0] MPI startup(): I_MPI_PIN_CELL=core
[0] MPI startup(): I_MPI_PIN_DOMAIN=omp
[0] MPI startup(): I_MPI_PIN_ORDER=compact
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5
starting wrf task 0 of 2

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 32308 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 32309 RUNNING AT localhost.localdomain
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
----------------------------------------------------------------------------------------------------
But it runs without any problems if wrf.exe is compiled in dm mode.
$ mpirun -n 38 ./wrf.exe
See my namelist.input file and wrf run logs

Can any one help me?

kwerner · May 3, 2021

Hi,
Over the years, there have been a lot of issues with dm+sm compiled versions of wrf. We actually don't test that option anymore and do not recommend using it because it causes so many problems. Since your simulation is running okay with dmpar only, would you be okay just using it instead?

cwlam08 · May 4, 2021

Thanks, kwerner!
I'd like to decrease the run time of wrf (it takes more than 2 hours) and only for this reason I've used dm+sm!

davegill · May 7, 2021

A couple of points that will not solve your model failure:

It is likely that DM-only is about as fast as DM+SM. If your entire purpose for DM+SM is performance, this may not be all that important for you.

Have you looked at different decompositions with MPI? The default is 5 (north south) and 4 (east west). Use the namelist options nproc_x and n_proc_y for your testing. Note that nproc_x * nproc_y = 40 (or how ever many MPI processes you are trying to run).

Have you tried modifying the compiler flags in the configure.wrf file? This can sometimes provide a nice performance boost. After a clean and configure, edit the configure.wrf file. Look for the FCOPTIM Makefile macro (looks like a shell variable assignment).

A couple of points that may help you with your model failure:

Have you tried to simplify your mpirun command? I understand that the options are there for performance, not just decoration. First, though, WRF needs to complete a time step.

Rebuild the code with DM-only. Run the model for a few time steps (turn off the debug switch). Send us that rsl.out.0000 file. I just want to look for anomalous timings.

A couple of points if you do manage to get DM+SM working:

Why not use 40 total processes (for example, 20 OpenMP threads with 2 MPI tasks)? Even if you use 38, perhaps split the processors so that you use 19 on each of your 2 CPUs?
- You may want to try more MPI processes, and a smaller number of threads. For example, 8 MPI with 5 OpenMP threads each.

arw v3.9.1.1 with dm+sm

cwlam08

New member

Attachments

kwerner

Administrator

cwlam08

New member

davegill

New member