Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe in dm+sm mode

Please set OMP_NUM_THREADS, the number of threads, and run wrf.exe using the command mpiexec -n wrf.exe (n is the number of processors you will use).
 
Many thanks Ming Chen,
in my case I have server with two intel cpus. I want to run 10 threads on each of these cpus. From Your answer I understand that I have to set OMP_NUM_THREADS to 10 and run wrf with command mpirun -n 2 wrf.exe.
I did it, but only 2 wrf.exe processes were started!
I have pgi compiler v19.04 and openmpi v3.03.
 

Attachments

  • wrf_dm+sm.jpg
    wrf_dm+sm.jpg
    846.7 KB · Views: 1,322
I am not sure what is going on here. One issue I want to know is, how many processors in total do you have? In NCAR Cheyenne, for example, we have 36 processors and if we want to run in dm+sm mode, we can do the following:

set OMP_NUM_THREADS N1
mpiexec -n N2 wrf.exe

and N1 x N2 = 36
 
Hi Ming Chen,
I have AMD Ryzen 9 5900X cpu with the following topology:
Package L#0
NUMANode L#0 (P#0 15GB)
L3 L#0 (32MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#12)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#13)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#14)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#15)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#16)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#17)
L3 L#1 (32MB)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#18)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#19)
L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#20)
L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#21)
L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#22)
L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#23)
I've compiled WRF-ARW v4.3.2 with DM+SM mode using support of AVX2 instructions set (Intel Parallel Studio XE).
When I run wrf.exe in hybrid mpi mode it crashes with error:
*** longjmp causes uninitialized stack frame ***: ./wrf.exe terminated
I tried to run wrf with different methods:
1. export OMP_NUM_THREADS=6 && mpirun -n 2 ./wrf.exe
2. mpirun -n 2 -genv OMP_NUM_THREADS=6 ./wrf.exe
3. mpirun -n 2 -genv OMP_NUM_THREADS=6 -genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=omp -genv I_MPI_PIN_ORDER=compact -genv I_MPI_PIN_CELL=core -genv I_MPI_PIN_PROCESSOR_LIST=0-11 ./wrf.exe
But the results were the same!
WRF runs without any problem in pure mpi mode:
mpirun -n 12 ./wrf.exe

Can You help me?
 
I would suggest that you stay with MPI mode.
At present I have no idea what is going on with the dm+sm mode. I will talk to our software engineers and get back to you if we have an answer.
 
is there a problem with MPI+OMP on Intel hardware? I am not able to launch OMP threads correctly on Intel hw but I am on AMD. Any idea why?
 
I am sorry that I don't know much of intel hardware. Hope someone in the community may provide further information regrading this issue.
 
Came across this thread while searching for some answers. Hoping to piggyback of previous comments.

I have been running wrf on a dm build for quite some time, using 256 cores. It runs just fine, no problems.

I have started exploring the use of a dm+sm build. When I try running 128 cores and 2 threads (256 total), the model runs, but the time it takes to run increases by a factor of 2-3x. I figured I would see improvement over the baseline, but that was not the case. Is there an optimal configuration that you would suggest so I could potential see improvement? 64 cores and 4 threads? 32 cores and 8 threads? Or does sticking with my DM build seem to be the better option here? Thanks.
 
@andythewxman
It depends on the problem you're running and the architecture you're running on, but I've found that:

```
# mpi
export I_MPI_PIN_DOMAIN=auto
export I_MPI_PIN_ORDER=bunch
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export KMP_STACKSIZE=128M

# omp
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=2

mpiexec -np 48 wrf.exe
```

Works well for me on my 96-core machines - generally I'd keep to ~2-4 threads per worker.. I'd also recommend checking the output from `top` or `htop` when running wrf to make sure your affinity settings aren't binding the proceses to one node (I've seen SLURM interfere with mpirun sometimes).
 
Top