Question about the number of nodes to run WRF

angelarc · Sep 12, 2023

Hello,

We are running parallelization tests and using the script referenced here: How many processors should I use to run WRF? to determine the correct number of nodes needed. We are using four domains and using ndown to do so. We have been able to run without issues up to the third domain.

To run WRF for the 4th domain the Python script mentioned returns a value of 58 nodes (each of our nodes has 48 cores).

The problem we are seeing is that the process seems to start, writes a few lines (as shown in the rsl file attached), and then just stays running but doesn't seem to do anything else. It is running but there are no outputs, no results, and no logs. It ends up being killed after the scheduled time is up. We have let the process run for 24 hours just to be certain that nothing was happening.

We have done several tests with different configurations and it ended up running when we used 40 nodes and a time_step of 2 (lower than we had). Our question is why there is such a difference between the value we got from the script and the one that worked for us. Also, why didn't we get an error related to that (we have gotten errors related to using more nodes than needed previously). Is this related to some other issue like node memory?

Thanks for your help!

Ming Chen · Sep 12, 2023

Hi,

I looked at the case for D04, which has a grid number of 1141 x 1096. For such a big case, your option of 58 nodes (48 cores/node) is appropriate.
I don't have an answer at hand why the model stopped running, ---- sorry that we don't have many experiences running such a big case.

I am suspicious that this is not related to the number of cpus you used. For grid interval of 200m (0.2km), the maximum time step is 1.2s. I am curious how your case can run successfully with time step of 2 for a 0.2km grid interval.

angelarc · Sep 15, 2023

Hi,

Thanks for your reply.

We will keep running tests and trying different configurations and see if we get a configuration that yields consistent (and correct) results. IF we manage to find something we will add the information here.

MichaelIgb · Jan 20, 2024

angelarc said:
Hi,

Thanks for your reply.

We will keep running tests and trying different configurations and see if we get a configuration that yields consistent (and correct) results. IF we manage to find something we will add the information here.

Have any success? I've been running into this issue ever since I've been trying to run WRF on Princeton clusters.

angelarc · Jan 23, 2024

MichaelIgb said:
Have any success? I've been running into this issue ever since I've been trying to run WRF on Princeton clusters.

Yes, after much testing with parametrizations in our namelists and node assignation, we managed to get successful runs.

We ended up using 40 nodes (128 cores per node) and I'm attaching a couple of our namelists, maybe they'll be helpful to you.

Ming Chen · Jan 23, 2024

@ angelarc
I suppose that you run the D03 case using 40 nodes (128 cores/node), please let me know if I am wrong.
For the case with grid number of 679 x 583, I am thinking that such a number of cores is way too much.

@ MichaelIgb
We recommend that the largest number of processors you should use can be determined by the formula below:

(e_we/25) * (e_sn/25)

Hope this is helpful.

MichaelIgb · Jan 23, 2024

Thank you so much

angelarc said:
Yes, after much testing with parametrizations in our namelists and node assignation, we managed to get successful runs.

We ended up using 40 nodes (128 cores per node) and I'm attaching a couple of our namelists, maybe they'll be helpful to you.

angelarc · Jan 25, 2024

Ming Chen said:
@ angelarc
I suppose that you run the D03 case using 40 nodes (128 cores/node), please let me know if I am wrong.
For the case with grid number of 679 x 583, I am thinking that such a number of cores is way too much.

What we did was run d03 with 18 nodes and the last domain d04 with 40 nodes. I suppose my reply was not very clear about this point, sorry.

The namelist for that domain is attached below.

Question about the number of nodes to run WRF

angelarc

New member

Attachments

Ming Chen

Moderator

angelarc

New member

MichaelIgb

New member

angelarc

New member

Attachments

Ming Chen

Moderator

MichaelIgb

New member

angelarc

New member

Attachments