Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Question about the number of nodes to run WRF

angelarc

New member
Hello,

We are running parallelization tests and using the script referenced here: How many processors should I use to run WRF? to determine the correct number of nodes needed. We are using four domains and using ndown to do so. We have been able to run without issues up to the third domain.

To run WRF for the 4th domain the Python script mentioned returns a value of 58 nodes (each of our nodes has 48 cores).

The problem we are seeing is that the process seems to start, writes a few lines (as shown in the rsl file attached), and then just stays running but doesn't seem to do anything else. It is running but there are no outputs, no results, and no logs. It ends up being killed after the scheduled time is up. We have let the process run for 24 hours just to be certain that nothing was happening.

We have done several tests with different configurations and it ended up running when we used 40 nodes and a time_step of 2 (lower than we had). Our question is why there is such a difference between the value we got from the script and the one that worked for us. Also, why didn't we get an error related to that (we have gotten errors related to using more nodes than needed previously). Is this related to some other issue like node memory?

Thanks for your help!
 

Attachments

  • namelist.input
    9.2 KB · Views: 2
  • namelist_wrf_d03.input
    9 KB · Views: 2
  • rsl.out.0000
    869 bytes · Views: 0

Ming Chen

Moderator
Staff member
Hi,

I looked at the case for D04, which has a grid number of 1141 x 1096. For such a big case, your option of 58 nodes (48 cores/node) is appropriate.
I don't have an answer at hand why the model stopped running, ---- sorry that we don't have many experiences running such a big case.

I am suspicious that this is not related to the number of cpus you used. For grid interval of 200m (0.2km), the maximum time step is 1.2s. I am curious how your case can run successfully with time step of 2 for a 0.2km grid interval.
 

angelarc

New member
Hi,

Thanks for your reply.

We will keep running tests and trying different configurations and see if we get a configuration that yields consistent (and correct) results. IF we manage to find something we will add the information here.
 
Top