Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe not scaling after 8 mpi process

vishnuas

New member
Hello everyone,
I am running a wrf model with this publicly available data from https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.20230313/00/atmos/. wrf.exe completed successfully in 34 hrs using 20 mpi processes. I was trying to reduce the runtime by increasing the number of mpi processes, but I couldn't make it run faster. While trying to debug this, I figured out that beyond 8 mpi processes, the wrf.exe is not scaling any further (i.e. with 8 mpi processes, the runtime is 34hrs and further increasing the number of processes doesn't reduce the runtime. Till 8 mpi processes, it is scaling linearly). Also, I couldn't find any system-related issues (insufficient memory etc.). I am very new to WRF and am trying to understand if any configuration parameters like in namelist.input could be causing this scalability issue. I would very much appreciate any help in solving this problem. The expected runtime was around 3 to 4 hours.
All the details are listed below and also attaching namelist.input file and rsl.out.0000 file

WRF v4.2.1 (configured in dmpar option) and WPSv4.2
Input data : 48 files each of size ~500MB (gfs.t00z.pgrb2.0p25.f000 to gfs.t00z.pgrb2.0p25.f048) from the above-mentioned URL
Server : Dual-socket server with 2 x Intel Xeon Gold 6230 cpus(20 core, 40 thread, 2.1GHz) and 120GB RAM (during the run, memory utilization is always less than 40GB)

Thank you
 

Attachments

  • namelist.input.txt
    4.2 KB · Views: 2
  • rsl.out.0000
    6 MB · Views: 6
Would you please confirm that running this case with 8, 20 and more processors, they all take the same CPU time?

The '34 hours' you mentioned is wall-clock time or CPU time?
 
Hello, 34 hours mentioned here is the wall-clock time. Apologies for the ambiguity

From 1 mpi process to 8 mpi processes, the wall clock time reduced linearly. With 8 mpi processes, the wall clock time was 34 hrs and increasing the number of processes beyond 8 did not have much impact on the time. With 40 mpi processes, the wall clock time was ~35hrs.

Also, I am attaching the output of 'time' command while running with 8 and 16mpi processes. To collect the results without delay, I had reduced the run_hours in namelist.input file to 6.
 

Attachments

  • output_time_8mpi.JPG
    output_time_8mpi.JPG
    87.2 KB · Views: 14
  • output_time_16mpi.JPG
    output_time_16mpi.JPG
    61.1 KB · Views: 12
Last edited:
Would you please send me your rsl.out.0000 files for the 8-processor and 40-processor runs? Thanks.
 
Would you please send me your rsl.out.0000 files for the 8-processor and 40-processor runs? Thanks.
Sure, Please find attached the rsl.out.0000 file for the 40 processor run. I will share the 8-processor file here shortly. Thanks.

Editing the comment to include the rsl.out.0000 file for the 10 processor run. @Ming Chen Sorry for the delay as I didn't have this log and had to rerun the program. I am also attaching the output of time command for running this 10 process run, in case it helps.

One interesting observation was while trying to execute the 8 processor run for the 48hrs simulation, it hangs after executing for 30 hrs (completes 42hrs out of the 48hr simulation, no errors in any of the logs, but no further writes to rsl files or wrfout* etc.). I cancelled the run after waiting for another 10hrs. I tried rerunning the same configuration, but everytime it hangs at the same point. I am also attaching the rsl.out.0000 file for this failed run for reference.
 

Attachments

  • rsl.out.0000_40mpi.txt
    6 MB · Views: 3
  • output_timecommand_10mpi_48runhours.JPG
    output_timecommand_10mpi_48runhours.JPG
    67.2 KB · Views: 7
  • rsl.out.0000_8mpi_runhangs.txt
    5.2 MB · Views: 2
Last edited:
Would you please send me your rsl.out.0000 files for the 8-processor and 40-processor runs? Thanks.
@Ming Chen

Would their issue be because of the non square number processors being used?

According to this faq
It says that wrf likes square numbers. (9, 16, 36 etc etc)
 
Thank you @Whatheway for the input. I am sharing the domains and e_we, e_sn values used in this run. From my understanding of the above article, for this usecase, maximum number of processors to use is 355 (e_we 397, e_sn 592) and minimum number is 2 (e_we 150, e_sn 154). Even though the server in use has 40 physical cores, I am not able to get any performance improvement beyond 8 processes and the minimum wall clock time we could achieve was 32hrs, which seems to be very high. Any guidance in how to approach this, would really help a lot. Thanks
 

Attachments

  • domains.JPG
    domains.JPG
    21.7 KB · Views: 4
  • domains.JPG
    domains.JPG
    21.7 KB · Views: 10
I looked at your rsl files for the 8- and 40-processor runs. The model decomposition looks fine and reasonable. The only issue I can think is that, the run with 40 processors takes too much communication time. However, this can hardly explain everything.
Would you please talk to your computer manager about this issue? Please keep me updated about any feedback. Thanks in advance.
 
Top