WPS Compile Issue

cwrenn · Sep 4, 2021

kwerner--brief update: when I put your python script together with the domain size and location procedure operation using plotgrids_new.ncl, when

e_we = 180, 250, 220
e_sn = 180, 241, 241,

the python script suggests 180 processors and 9 nodes (when there are 20 cores per node). These particular values of processors and nodes hold until you get to

e_we = 200, 250, 253
e_sn = 200, 241, 253,

when the suggested number jumps to 400 processors and 20 nodes (when there are 20 cores per node).

Do you think that this is real? Is a setting of 200 in the first domain some sort of magic number?

cwrenn · Sep 5, 2021

kwerner--
I got some preliminary results from reducing the number of processors from 240 to 180 as suggested by your python script number_of_procs.py: the run seems to simulate about the same number of days with 25% less processors. do you think that this is the result of not spending as much time calculating the halo regions? thanks again!

kwerner · Sep 7, 2021

Hi,
I'm not sure I fully understand your question. When you say the run is simulating about the same number of days with 25% less processors, what do you mean by that? Do you mean it's simulating the same number of days in the same amount of time (real time), as if you had used 25% more processors? Are you satisfied with the results, or do you see them as being problematic? What do you mean by "do you think that this is the result of not spending as much time calculating the halo regions?" The reason you cannot use too many processors is that at some point the grid will be divided in a way that the halo regions are the majority of each grid cell, and there is no space in each grid cell left for calculating physical processes.

As for the question you asked in the email from Sept 4, if you uncomment the print statements in the python script and run it, you can see the full break-down. This script is meant to only calculate, based on the number of processors you want to use per node. Since you have 20 processors per node, if you have a domain size of 180x180, and you use 180 processors, the closest two factors of 180 are 12 and 15, meaning the processor break down will be:
180/12 = 15
180/15 = 12
As the minimum number of grid spaces allowed in each direction is 10, 180 processors is okay.

If, however, you tried to use 200 processors (which is adding 1 additional node with 20 processors), the breakdown would look like:
closest two factors: 10 and 20
180/10 = 18
180/20 = 9
While the east-west direction is okay, the number for north-south is 9, which is less than the minimum of 10. In this case, you should get an error telling you to use more processors.

If your domain is size 200x200, then the decomposition becomes different (meaning the closest factor pair allows many more processors):
Using 400 processors: closest two factors are 20 and 20
200/20 = 10
200/20 = 10
which falls within the limit, and is okay. A domain of size 1 grid cell less would not meet this requirement.

*Note: I've once again modified the script for finding the correct number of processors. I realized I did not account for a case in which you could not use more than a single node. This does not apply for you, but if you want the updated version, please find it here.

cwrenn · Sep 7, 2021

hey, kwerner!
Yes, the run is simulating about the same number of days with 25% less processors in the same amount of time (real time), as if i had used 25% more processors. Yes, i am satisfied with the results, and am glad that i am better seeing the effects of too many nodes. i like your explanation that "The reason you cannot use too many processors is that at some point the grid will be divided in a way that the halo regions are the majority of each grid cell, and there is no space in each grid cell left for calculating physical processes," but i am unsure of exactly what "divided in a way that the halo regions are the majority of each grid cell" means. Could you clarify?
Is there anyway to show that the halo regions are the majority of each cell grid. For example, if I ran the same job with 200 processors is there anyway to see how the halo regions are becoming the majority of each cell grid?
It's always more satisfying to know what you're doing. But sometimes it helps to see the problem more explicitly.
Mahalo nui!

kwerner · Sep 13, 2021

Hi,
Okay, I'm attaching two figures I created to show this better. For the sake of time, I did not create figures using a large number of processors, but hopefully you'll understand with these.

halo_good.png

Domain is 192x192: 192 grid spaces in each direction

If you use 16 processors, the grid will be divided into a 4x4 grid, where each square is its own processor.

This means each processor is responsible for calculations in 48 grid points in the west/east direction, and 48 grid points in the south/north direction ( 192/4 = 48 ).

Each processor has a halo region on each side of its square - these are the columns/rows responsible for communicating to the cells next to them.

If the halo region is 5 rows along each edge, this leaves there are 38 cells available in the middle of each square for physical calculations ( 48 - 5 - 5 = 38).

And this is okay!!

halo_bad.png

Domain is 36x36: 36 grid spaces in each direction

If you use 16 processors, the grid will be divided into a 4x4 grid, where each square is its own processor.

This means each processor is responsible for calculations in 9 grid points in the west/east direction, and 9 grid points in the south/north direction ( 36/4 = 9 ).

Each processor has a halo region on each side of its square - these are the columns/rows responsible for communicating to the cells next to them.

If the halo region is 5 rows along each edge, this leaves there are NO cells available in the middle of each square for physical calculations ( 9 - 5 - 5 = -1).

And this is bad.

As for actually seeing this in the code or output, I'm not sure there is a way to do that, but hopefully the above explanation helps to make it more clear.

cwrenn · Sep 13, 2021

kwerner--I really appreciate your taking the time to make an accompanying graphic. I will study it and try to more fully understand the strengths and limitations of the domain size and processor allocation connections that I have been investigating lately.
Regarding the runs where I used both 240 processors and 180 processors (recommended by number_of_procs.py): I finally had the opportunity to do a 200 processor run (20 core x 10 nodes) and it promptly crashed. I was surprised that it could run with 240 processors but not 200. The error message in rsl.error.0000 stated "Minimum decomposed computational patch size, either x or y 10 grid cells, e_we = 180 nproc_x = 10, e_sn = 180 nproc_y = 20 with cell width x-dir = 18, with cell width y-dir = 9." I'll puzzle through your description and graphic along with this error message to get a fuller understanding of the interrelationships so I can optimize my domain size and processor number for present and future experiments.
nui loa!

kwerner · Sep 13, 2021

Hi,
If you are using 200 processors, the closest factor pair is 10x20. If you use this, then 180/20 = 9, which is less than the minimum number of cells allowed in that direction (10).

If you use 240 processors, the closest factor pair is 15x16, where 180/16 = 11.25, which is above the minimum number of cells allowed in that direction (10).

Given this information, the script I wrote isn't perfect, and should probably be updated, but it gives a rough starting point.

cwrenn · Oct 1, 2021

I have a question about the input for the number_of_procs.py script.

As you may remember, when I had e_sn = 180, e_we = 180, using nodes with 20 cores, the suggested number of tasks was 180 (nodes = 9). But what if I only use 18 of the cores per nodes that have 20 cores each, can I input into number_of_procs.py that the number of cores = 18? The reason that I ask, is there appears to be some potential advantage to not using all the cores per node. In addition, because of the processor-to-tile constraints in wrf 4.3, there are only so many allowable configurations. I plan to do an experiment soon in which I only use 18 of the cores on the 20-core node (nodes = 10) to obtain 180 cores. But when I put 18 cores into number_of_procs.py, with the same south-north, east-west edges, it returns that the maximum number of cores is 324 using 18 nodes. Does that sound reasonable?

kwerner · Oct 4, 2021

Good question - yes, you can certainly do that!

cwrenn · Nov 8, 2021

kwerner--
I have now completed some runs and can give an update on what I've found when I reduce the number of cores per node from 20 to 18 for 20-core nodes. I also have two related questions:
A. How does the performance degrade when there are heterogeneous nodes? I.e., is it at the hardware level, the fibers, MPI, or the slurm daemons, or some combination of all of them?
B. What might cause non-orthogonal arrays to outperform orthogonal arrays (see below)?

1. I just finished a quick test comparing 5 nodes x 18 cores vs 5 nodes x 20 cores on the UH HPC, and the 18 core run went approximately 67% faster than the 20 core run.

2. For a different numerical experiment, I compared the performance of 10, 16, 17, and 18 nodes using 18 cores each.

a. The 18 nodes x 18 cores job was by far the slowest. It simulated one hour and 15 minutes in about an hour of clock time. (A paper by Moreno et al. shows that non-orthogonal arrays perform better than orthogonal arrays; 18 x 18 would be an example of an orthogonal array according to the authors in their paper "Analysis of New MPI process distribution for the WRF Model" (2020), https://www.hindawi.com/journals/sp/2020/8148373/)

b. The 17 nodes x 18 cores computed one simulated day in about 1 hour and 40 minutes (but this run is hard to compare with the rest b/c one of the nodes used was an older one, which slowed the overall computation down; I plan to rerun it later after excluding such nodes in my slurm script).

c. The 16 nodes x 18 cores simulated 1 day in about 1 hour and 33 minutes, using all the same nodes.

d. The 10 nodes x 18 cores simulated 1 day in about 2 hours and 12 minutes, when all the same nodes were used but when an older, larger memory node was included, the performance decreased by about 30% (see table below)--and took about 3 hours and 4 minutes per simulated day.

Well, that's all for now. Thanks again for all your help! I appreciate your having shared your number_of_procs.py script; I've been able to use it to optimize my wrf4.3 experiments.

Table:
Nov 1 1:06 wrfrst_d03_1998-03-17_00:00:00
Nov 1 4:13 3:05 wrfrst_d03_1998-03-18_00:00:00
Nov 1 7:17 3:04 wrfrst_d03_1998-03-19_00:00:00
Nov 1 10:22 3:05 wrfrst_d03_1998-03-20_00:00:00
Nov 1 13:26 3:04 wrfrst_d03_1998-03-21_00:00:00
Nov 1 16:30 3:04 wrfrst_d03_1998-03-22_00:00:00
Nov 1 19:34 3:04 wrfrst_d03_1998-03-23_00:00:00
Nov 3 23:13 wrfrst_d03_1998-03-24_00:00:00
Nov 4 1:24 2:09 wrfrst_d03_1998-03-25_00:00:00
Nov 4 3:36 2:12 wrfrst_d03_1998-03-26_00:00:00
Nov 4 5:48 2:12 wrfrst_d03_1998-03-27_00:00:00
Nov 4 8:01 2:13 wrfrst_d03_1998-03-28_00:00:00
Nov 4 10:12 2:11 wrfrst_d03_1998-03-29_00:00:00
Nov 4 12:25 2:13 wrfrst_d03_1998-03-30_00:00:00
Nov 4 14:39 2:14 wrfrst_d03_1998-03-31_00:00:00

kwerner · Nov 8, 2021

Hi,
Thanks for sharing those results. Hopefully they will be useful to us and/or another user in the future!

WPS Compile Issue

cwrenn

Member

cwrenn

Member

kwerner

Administrator

cwrenn

Member

kwerner

Administrator

Attachments

cwrenn

Member

kwerner

Administrator

cwrenn

Member

kwerner

Administrator

cwrenn

Member

kwerner

Administrator