Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Slow run when using more than 1 CPU per node

MLandreau

Member
Hi,

I am running WRF v4.5 simulations on a cluster. I performed few tests to determine which number of nodes/tasks/CPU was optimum for my case. I tried several configurations and measured the time of a timestep. Here is a sample of the results :

1 node, 1 task, 1 core : 15 s
1 node, 4 tasks, 4 cores : 53 s
1 node, 4 tasks, 48 cores : 53 s
4 nodes, 4 tasks, 4 cores : 7 s
4 nodes, 9 tasks, 9 cores : 53 s
9 nodes, 9 tasks, 9 cores : 3.5 s
16 nodes, 16 tasks, 16 cores : 2 s

The nodes involved are always the same ones. My simulation is composed of 3 nested 120x120x52 cells domains.

I have difficulties to understand these results. It seems that when multiple cores and used in the same node, the program is very slow. I would have infered the opposite since message passing inside a single node should be faster than between two nodes ?

Mathieu
 
Hi Mathieu,

Is your nesting 3 separate nested domains in one parent domain or each sequentially nested (domain 3 nested within domain 2, domain 2 nested within domain 1)?

Also, is this only using MPI tasks or do you have OpenMP threads associated with each task as well?
 
Hi,

My simulations is a (d03 in d02, d02 in d01) type. I am using the mpich-4.1.2 library and WRF is built with configuration 34 (distribute memory).

I partially fixed my problem, it was related to the use of slurm. In case someone faces the same issue, here is the solution for me :

I used previously the following slurm script.
```
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=9

time mpiexec -np $SLURM_NTASKS ./wrf.exe
```

And I replaced with this script.
```
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=9

export SLURM_CPU_BIND=verbose,socket
export SLURM_DISTRIBUTION="block:cyclic:block"
time srun --mpi=pmi2 ./wrf.exe
```

I can't tell exactly how it solves the problem but it seems to work on the cluster I am using.

Mathieu
 
Top