I have been reading some of the entries on UCAR website about WRF performance and scaling, but I am confused about one thing: how are nested domains divided up between processors when running WRF that was compiled with dmpar options? I see from here that the x and y dimensions to the domain are divided by the factors of the number of cores. Does that apply to nested domains as well? What is the performance hit of adding a domain the same grid size with a grid ratio and time-step factor of 5?

The reason I am asking I am trying to find the hypothetical time it would have taken my simulation to run on a different machine. I ran a 18 hr simulation with time step of 40 seconds, 5 total domains, each nested with grid ratio and time step factor of 5. The grid sizes were 102x102 for the course and 106x106 for all the fine domains. I assume this means that ~7.81 million grid points are integrated at each time step, but I could be off. I ran with 32 cores (Intel Xeon Processor E5-2670, 2.60 GHz) in dmpar config and it took 31 hours to complete.

Given what I read online, I can't get the math to work out to get even the same order of magnitude for the running time. If the parallel run divides the domain in to 26x13 or 27x14 subdomains, are all the nested domains divided similarly, so each subdomain is still roughly 5 times as expensive as it's parent? For my example this would mean we are computing 286,171 grid points per processor per timestep, or 463 million grid points over the course of the simulation.

Additionally, how many floating point or equivalent operations per grid cell are needed to integrate at each timestep? If I had a rough number I could more easily translate the runtime between machines. Assuming the domain breakdown I mentioned above and a rough processor performance of 5 Gflops, that comes out to 1.2 million floating point ops per grid point. Does that sound like the right ballpark?

My apologies in advance if I have misunderstood something fundamental. I'm an undergraduate who is completely new to NWP and only recently began using WRF.

## How are nested domains divided in distributed memory execution?

### Re: How are nested domains divided in distributed memory execution?

Attached are a couple of screen shots.

The first is a depiction of a WRF simulation with five total domains. The yellow area is domain 1, the blue box is domain 2, and the cyan boxes are domains 3, 4, 5. These are set up just to show us the order of operations.

The second attachment shows the sequential procedure within the WRF model.

1. Domain 1 runs a single time step for 3 minutes (red arrow).

2. Domain 2 runs a single time step for 1 minute (yellow arrow). Before domain 2 can do the second time step, all of domain 2's children need to catch up.

3. Domain 3 runs multiple time steps to catch up to domain 2 (20 second time step).

4. Domain 4 runs multiple time steps to catch up to domain 2.

5. Domains 5 run multiple time steps to catch up to domain 2.

6. Domains 3, 4, and 5 feed back to domain 2.

7. The steps 2 - 6 are repeated until domain 2 has caught up with domain 1.

8. Domain 2 feeds back to domain 1.

9. Steps 1 - 8 are repeated until all domains are completed with the entire simulation.

For computational performance, generally the amount of time that the WRF model spends on a single grid column is nearly constant. If we assume that all domains are running the same physics, then the amount of computation is about the same for a single grid column, whether it be from domain 1 or domain 2. However, a nest causes the model to be expensive for two reasons.

1. The fine grid time step is smaller (typically, the time ratio is the same as the grid distance ratio). While the amount of time is similar to do a single grid column, the fine grid has to do that computation more times to keep pace with the larger time step from the parent domain.

2. A nested simulation has additional overhead: computing lateral boundaries at each parent time step, feedback at the end of the child time steps. And for a distributed memory machine, these also entail a good amount of communication.

Let's use the 5-domain example that I depicted. We have a 3:1 ratio for the grid distance ratios and the time step ratios. Let us assume that the number of grid cells is identical in each of the domains (for example, all are 100x100 or all of 500x500, it does not really matter).

Let us define the cost of running domain 1 as ONE unit of computation.

Without overhead (boundary update, feedback, extra nesting communications), the cost of domain 2 would then THREE units of computation (the same number grid points, just doing the computations three times more due to the smaller time step).

With the same arm waving, each of domains 3, 4, and 5 would cost NINE units of computation (due entirely to their smaller time step).

Without any nesting overhead, one would expect this five domain run to be 31x more expensive than the single domain case (1 + 3 + 9 + 9 + 9).

With nesting overhead, the costs are a bit unpredictable. We tend to see an additional factor of anywhere between 10-25%, depending on how expensive the communications interconnect is (hardware on the computer). The process of nesting is communication intensive. The more domains and levels of nests, the more the overhead.

For your particular case, there are about 8% more grid cells on the fine domains than d01.

d01 costs 1 unit of time

d02 costs 1.08 * 5 units of time

d03 costs 1.08 * 25 units of time

d04 costs 1.08 * 125 units of time

d05 costs 1.08 * 625 units of time

Compared to the d01 only (which is our unit measure of time), the entire simulation without overhead relative to the coarse domain is ( 1 + 1.08*(5 + 25 + 125 + 625 ) ) = 843.4x the cost of d01.

From this, you can see why the advice we give is to make the outer domains LARGE. Domains 1 and 2 together are not even 1% of the simulation's cost.

For your timing estimates, on your current machine, just run the first two domains for a few time steps. Throw away the initial time and make sure that a radiation step is included in your timing. You should be able to get a good scaling from domains 1 and 2 compared to your entire simulation.

Next, run the small d01 and d02 case on the new machine. Use the same scale factor to estimate what the five domain cost would be on the new machine.

The absolute amount of computations per grid cell and the speed of the chip are not a direct way to handle the timing performance, due to the relatively low percent of peak that WRF is able to obtain, the wild variations in compiler performance, the huge impact that I/O can have on a full simulation, etc.

Lastly, the default decomposition for each domain is based only on the number of MPI process and the number of grid cells in a domain. If the MPI processes are set up to decompose the domain into 8 chunks in the south-north direction, and 4 chunks in the west-east direction - that same decomposition is used for every domain.

The first is a depiction of a WRF simulation with five total domains. The yellow area is domain 1, the blue box is domain 2, and the cyan boxes are domains 3, 4, 5. These are set up just to show us the order of operations.

The second attachment shows the sequential procedure within the WRF model.

1. Domain 1 runs a single time step for 3 minutes (red arrow).

2. Domain 2 runs a single time step for 1 minute (yellow arrow). Before domain 2 can do the second time step, all of domain 2's children need to catch up.

3. Domain 3 runs multiple time steps to catch up to domain 2 (20 second time step).

4. Domain 4 runs multiple time steps to catch up to domain 2.

5. Domains 5 run multiple time steps to catch up to domain 2.

6. Domains 3, 4, and 5 feed back to domain 2.

7. The steps 2 - 6 are repeated until domain 2 has caught up with domain 1.

8. Domain 2 feeds back to domain 1.

9. Steps 1 - 8 are repeated until all domains are completed with the entire simulation.

For computational performance, generally the amount of time that the WRF model spends on a single grid column is nearly constant. If we assume that all domains are running the same physics, then the amount of computation is about the same for a single grid column, whether it be from domain 1 or domain 2. However, a nest causes the model to be expensive for two reasons.

1. The fine grid time step is smaller (typically, the time ratio is the same as the grid distance ratio). While the amount of time is similar to do a single grid column, the fine grid has to do that computation more times to keep pace with the larger time step from the parent domain.

2. A nested simulation has additional overhead: computing lateral boundaries at each parent time step, feedback at the end of the child time steps. And for a distributed memory machine, these also entail a good amount of communication.

Let's use the 5-domain example that I depicted. We have a 3:1 ratio for the grid distance ratios and the time step ratios. Let us assume that the number of grid cells is identical in each of the domains (for example, all are 100x100 or all of 500x500, it does not really matter).

Let us define the cost of running domain 1 as ONE unit of computation.

Without overhead (boundary update, feedback, extra nesting communications), the cost of domain 2 would then THREE units of computation (the same number grid points, just doing the computations three times more due to the smaller time step).

With the same arm waving, each of domains 3, 4, and 5 would cost NINE units of computation (due entirely to their smaller time step).

Without any nesting overhead, one would expect this five domain run to be 31x more expensive than the single domain case (1 + 3 + 9 + 9 + 9).

With nesting overhead, the costs are a bit unpredictable. We tend to see an additional factor of anywhere between 10-25%, depending on how expensive the communications interconnect is (hardware on the computer). The process of nesting is communication intensive. The more domains and levels of nests, the more the overhead.

For your particular case, there are about 8% more grid cells on the fine domains than d01.

d01 costs 1 unit of time

d02 costs 1.08 * 5 units of time

d03 costs 1.08 * 25 units of time

d04 costs 1.08 * 125 units of time

d05 costs 1.08 * 625 units of time

Compared to the d01 only (which is our unit measure of time), the entire simulation without overhead relative to the coarse domain is ( 1 + 1.08*(5 + 25 + 125 + 625 ) ) = 843.4x the cost of d01.

From this, you can see why the advice we give is to make the outer domains LARGE. Domains 1 and 2 together are not even 1% of the simulation's cost.

For your timing estimates, on your current machine, just run the first two domains for a few time steps. Throw away the initial time and make sure that a radiation step is included in your timing. You should be able to get a good scaling from domains 1 and 2 compared to your entire simulation.

Next, run the small d01 and d02 case on the new machine. Use the same scale factor to estimate what the five domain cost would be on the new machine.

The absolute amount of computations per grid cell and the speed of the chip are not a direct way to handle the timing performance, due to the relatively low percent of peak that WRF is able to obtain, the wild variations in compiler performance, the huge impact that I/O can have on a full simulation, etc.

Lastly, the default decomposition for each domain is based only on the number of MPI process and the number of grid cells in a domain. If the MPI processes are set up to decompose the domain into 8 chunks in the south-north direction, and 4 chunks in the west-east direction - that same decomposition is used for every domain.

Dave Gill

*NCAR/MMM*