Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF in High-Processor Parallel Execution on Supercomputers keeps running without any progress when increasing the number of processors

nicolas

New member
Dear WRF user,

I am currently facing an issue related to the task distribution of the WRF model grid in my simulation based on WRF v4.3.3. I intend to run WRF in high-resolution (3x3 and 1.5x1.5) for a larger domain and to achieve faster simulation times, I want to increase the number of processors. The supercomputer I am using is NPAD-UFRN BRAZIL, which has 40 nodes with 60 tasks per node. To determine the maximum number of nodes and processors I can utilize, I employed the Python script (number_of_procs.py) recommended by kwerner.

However, I encountered a problem when attempting to increase the number of nodes and tasks beyond a certain limit. Currently, I am able to use only 3 nodes with a total of 169 tasks. Whenever I try to increase the number, such as using 4 nodes and 225 tasks, the WRF distribution appears to proceed normally, but the model eventually hangs or crashes without displaying any error message. I have exhaustively attempted various combinations, following the square rule to equally divide the domain.

As a newcomer to this field, I am uncertain whether this issue stems from a problem with the model itself or if it is a computational limitation. I have attached the WRF namelist.input and the rsl.out.0000 files for reference.

Thanks!

Nícolas
 

Attachments

  • namelist.input
    6.3 KB · Views: 4
  • rsl.out.0000
    3.4 KB · Views: 7
The decomposition doesn't have to be perfectly square - it just shouldn't be drastically different. Can you try using all the processors on each node and see if that makes a difference?
 
Hello,

Simillar kind issue we are facing as well. For some specific number of MPI ranks WRF works well and for some number of MPI ranks it runs forever without any error/output.

@nicolas Could you overcome your issue?

Cheers,
Samir Shaikh
 
Hello,

I am also facing similar issue. I am using 1km resolution over mountainous region with steep topography gradient, where I am able to complete simulation with certain number of MPI processes such as with 640 or 1200 or 1392 procs but the simulation hangs with different no. of procs such as 720 or 768 or 960 or 1440 etc.

I assume it is because of certain sub-domain which if gets divided among two processes leads to unstable simulation and hangs. If this could be true how can I debug it and how can I find out which process is working on which lat/lon ?

Thanks,
Sandeep Agrawal
 
If this could be true how can I debug it and how can I find out which process is working on which lat/lon ?

I would suggest talking to a systems administrator at your institution to see what they suggest. It is likely related to the specific environment.
 
Top