Domain Decomposition and Runtime Performance

itwm-hpc · Jul 19, 2022

Greetings,

I have been investigating the performance of WRF w.r.t. different domain decomposition schemes (patches/tiles) in order to optimize for the turnaround time.

My approach is to reduce the number of patches on MPI level and to increase the number of tiles instead on OMP level. This should reduce the boundary data exchanged via MPI.

My expectation is that further decomposition of the patches into square tiles on OMP level should perform better than a 1D decomposition along the y axis. For two reasons:

First, we have better control to keep the data within caches when we do stencil accesses along all three axes.

Second, we reduce the additional data that each thread needs to access at the boundaries of the tiles by reducing the ratio of boundaries to inner points.

However, my benchmarks revealed the opposite. 1D decomposition performs significantly better than 2D square decomposition. I don't understand why this is the case.

To illustrate the point a series of benchmarks has been done. Our example has two domains, the first has 150x134x60 points, the second has 531x571x60 points.

I start 4 MPI ranks, one is pinned on each NUMA socket of a single node. WRF 4.3.1 was configured with "16", compiled with Intel compiler. MPI was Intel MPI. I selected one single OMP thread per rank. Depending on the benchmark this thread executes the tiles of each patch one after the other. No parallel execution of tiles within the same patch.

Domain one is executed every five executions of domain 2. Domain 2 dominates the runtime. We execute on a single cluster node with 2 AMD EPYC™ 7V73X (Milan-X) CPU, 512 GB RAM. The machine has 4 NUMA nodes 30 cores each, no hyper threading.

The tiling was done with 1D tiling along y-axis with 1,4,9,16,... tiles (still one thread per rank). And with 2D tiling along x/y axis 1, 2x2, 3x3, 4x4, ... (also single thread per rank). So the only difference between the two benchmark series is the order that the threads execute the data within each patch.

This setting excludes effects of OMP scheduling and false sharing between threads. It includes cache effects, TLB effects and overheads by loop pealing.

Plot 1 depicts the results. It plots the average runtime per iteration of domain1 (including 5 calls to domain2). We see that 1D tiling is performing always better than 2D tiling.

2D tiling gets significantly worse going from 1 to 2x2 and to 3x3. Beyond that we see a plateau. For very large numbers of tiles the runtime starts to raise again. We never see any positive cache effects by shrinking tiles.

1D tiling performs very constantly (always better than 2D) but gets worse for very high number of tiles. No visible negative cache effects or boundary effects caused by low extension of tiles in y direction.

Performance degradation for both 2D/1D tiles at high tile numbers should be caused by management overhead per tile.

Seemingly shorter extension in x direction seems to degrade the performance very quickly, e.g. going from 260 points to 130 points. However I don't see respective performance degradation when I use more ranks with smaller patches which have shorter extension in x-direction too.

I would appreciate any hints concerning the reason why we see performance degradation of 2D tiling (but less for 2D patches).

Thanks in advance!

kwerner · Jul 20, 2022

Hi, I've reached out to someone who may be able to offer some possible explanation or thoughts. They will hopefully get a response to you later this week, if possible. Thank you for your patience.

In the meantime, perhaps another user on the forum will have had some experience with this kind of thing and will be able to offer some insight.

itwm-hpc · Aug 1, 2022

Hi,

I further extended the experiments. In the 2D cases the configuration of the processes are changed from 2x2 to 1x4 and 4x1. The sub structure of the tiles is adapted accordingly so that we still get close to homogeneous quadratic tiles of the same size everywhere.

# tiles per proc = 4, tile size approx. 133x143:

config	# procs x	# procs y	# tiles x	# tiles y
normal	2	2	2	2
tall	4	1	1	4
fat	1	4	4	1

# tiles per proc = 16, tile size approx. 67x72:

config	# procs x	# procs y	# tiles x	# tiles y
normal	2	2	4	4
tall	4	1	2	8
fat	1	4	8	2

# tiles per proc = 36, tile size approx. 45x48:

config	# procs x	# procs y	# tiles x	# tiles y
normal	2	2	6	6
tall	4	1	3	12
fat	1	4	12	3

Other details as described in the first post.

The 2D plot summarizes the three cases tall, fat, normal. In all cases the tiles have approximately the same size and aspect ratio and we calculate on a single thread per process only. Nevertheless the fat case performs worst, the tall case performs best. The normal case performs somewhere in between.

The 1D plot shows the same experiments for 1D Tile distribution. E.g.

# tiles per proc = 4

config	# procs x	# procs y	# tiles x	# tiles y
normal	2	2	1	4
tall	4	1	1	4
fat	1	4	1	4

... and so on and so forth.

Here we see almost no difference between the three cases normal, tall and fat.

What could be the reason for this pretty mysterious behavior? Any ideas are welcome!

Domain Decomposition and Runtime Performance

itwm-hpc

New member

kwerner

Administrator

itwm-hpc

New member