Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Question about the number of nodes to run WRF

angelarc

New member
Hello,

We are running parallelization tests and using the script referenced here: How many processors should I use to run WRF? to determine the correct number of nodes needed. We are using four domains and using ndown to do so. We have been able to run without issues up to the third domain.

To run WRF for the 4th domain the Python script mentioned returns a value of 58 nodes (each of our nodes has 48 cores).

The problem we are seeing is that the process seems to start, writes a few lines (as shown in the rsl file attached), and then just stays running but doesn't seem to do anything else. It is running but there are no outputs, no results, and no logs. It ends up being killed after the scheduled time is up. We have let the process run for 24 hours just to be certain that nothing was happening.

We have done several tests with different configurations and it ended up running when we used 40 nodes and a time_step of 2 (lower than we had). Our question is why there is such a difference between the value we got from the script and the one that worked for us. Also, why didn't we get an error related to that (we have gotten errors related to using more nodes than needed previously). Is this related to some other issue like node memory?

Thanks for your help!
 

Attachments

  • namelist.input
    9.2 KB · Views: 9
  • namelist_wrf_d03.input
    9 KB · Views: 3
  • rsl.out.0000
    869 bytes · Views: 3
Hi,

I looked at the case for D04, which has a grid number of 1141 x 1096. For such a big case, your option of 58 nodes (48 cores/node) is appropriate.
I don't have an answer at hand why the model stopped running, ---- sorry that we don't have many experiences running such a big case.

I am suspicious that this is not related to the number of cpus you used. For grid interval of 200m (0.2km), the maximum time step is 1.2s. I am curious how your case can run successfully with time step of 2 for a 0.2km grid interval.
 
Hi,

Thanks for your reply.

We will keep running tests and trying different configurations and see if we get a configuration that yields consistent (and correct) results. IF we manage to find something we will add the information here.
 
Hi,

Thanks for your reply.

We will keep running tests and trying different configurations and see if we get a configuration that yields consistent (and correct) results. IF we manage to find something we will add the information here.
Have any success? I've been running into this issue ever since I've been trying to run WRF on Princeton clusters.
 
Have any success? I've been running into this issue ever since I've been trying to run WRF on Princeton clusters.
Yes, after much testing with parametrizations in our namelists and node assignation, we managed to get successful runs.

We ended up using 40 nodes (128 cores per node) and I'm attaching a couple of our namelists, maybe they'll be helpful to you.
 

Attachments

  • namelist.input
    9.6 KB · Views: 9
  • namelist_d03.input
    8.7 KB · Views: 4
@ angelarc
I suppose that you run the D03 case using 40 nodes (128 cores/node), please let me know if I am wrong.
For the case with grid number of 679 x 583, I am thinking that such a number of cores is way too much.

@ MichaelIgb
We recommend that the largest number of processors you should use can be determined by the formula below:

(e_we/25) * (e_sn/25)

Hope this is helpful.
 
Thank you so much
Yes, after much testing with parametrizations in our namelists and node assignation, we managed to get successful runs.

We ended up using 40 nodes (128 cores per node) and I'm attaching a couple of our namelists, maybe they'll be helpful to you.
 
@ angelarc
I suppose that you run the D03 case using 40 nodes (128 cores/node), please let me know if I am wrong.
For the case with grid number of 679 x 583, I am thinking that such a number of cores is way too much.
What we did was run d03 with 18 nodes and the last domain d04 with 40 nodes. I suppose my reply was not very clear about this point, sorry.

The namelist for that domain is attached below.
 

Attachments

  • namelist_wrf_d04.input
    7.8 KB · Views: 8
Top