wrf.exe not scaling after 8 mpi process

vishnuas · Apr 5, 2023

Hello everyone,
I am running a wrf model with this publicly available data from https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.20230313/00/atmos/. wrf.exe completed successfully in 34 hrs using 20 mpi processes. I was trying to reduce the runtime by increasing the number of mpi processes, but I couldn't make it run faster. While trying to debug this, I figured out that beyond 8 mpi processes, the wrf.exe is not scaling any further (i.e. with 8 mpi processes, the runtime is 34hrs and further increasing the number of processes doesn't reduce the runtime. Till 8 mpi processes, it is scaling linearly). Also, I couldn't find any system-related issues (insufficient memory etc.). I am very new to WRF and am trying to understand if any configuration parameters like in namelist.input could be causing this scalability issue. I would very much appreciate any help in solving this problem. The expected runtime was around 3 to 4 hours.
All the details are listed below and also attaching namelist.input file and rsl.out.0000 file

WRF v4.2.1 (configured in dmpar option) and WPSv4.2
Input data : 48 files each of size ~500MB (gfs.t00z.pgrb2.0p25.f000 to gfs.t00z.pgrb2.0p25.f048) from the above-mentioned URL
Server : Dual-socket server with 2 x Intel Xeon Gold 6230 cpus(20 core, 40 thread, 2.1GHz) and 120GB RAM (during the run, memory utilization is always less than 40GB)

Thank you

Ming Chen · Apr 6, 2023

Would you please confirm that running this case with 8, 20 and more processors, they all take the same CPU time?

The '34 hours' you mentioned is wall-clock time or CPU time?

vishnuas · Apr 10, 2023

Hello, 34 hours mentioned here is the wall-clock time. Apologies for the ambiguity

From 1 mpi process to 8 mpi processes, the wall clock time reduced linearly. With 8 mpi processes, the wall clock time was 34 hrs and increasing the number of processes beyond 8 did not have much impact on the time. With 40 mpi processes, the wall clock time was ~35hrs.

Also, I am attaching the output of 'time' command while running with 8 and 16mpi processes. To collect the results without delay, I had reduced the run_hours in namelist.input file to 6.

Ming Chen · Apr 11, 2023

Would you please send me your rsl.out.0000 files for the 8-processor and 40-processor runs? Thanks.

vishnuas · Apr 11, 2023

Ming Chen said:
Would you please send me your rsl.out.0000 files for the 8-processor and 40-processor runs? Thanks.

Sure, Please find attached the rsl.out.0000 file for the 40 processor run. I will share the 8-processor file here shortly. Thanks.

Editing the comment to include the rsl.out.0000 file for the 10 processor run. @Ming Chen Sorry for the delay as I didn't have this log and had to rerun the program. I am also attaching the output of time command for running this 10 process run, in case it helps.

One interesting observation was while trying to execute the 8 processor run for the 48hrs simulation, it hangs after executing for 30 hrs (completes 42hrs out of the 48hr simulation, no errors in any of the logs, but no further writes to rsl files or wrfout* etc.). I cancelled the run after waiting for another 10hrs. I tried rerunning the same configuration, but everytime it hangs at the same point. I am also attaching the rsl.out.0000 file for this failed run for reference.

Deleted member 3607 · Apr 12, 2023

vishnuas said:
Sure, Please find attached the rsl.out.0000 file for the 40 processor run. I will share the 8-processor file here shortly. Thanks.

Your file did not upload

vishnuas · Apr 12, 2023

Whatheway said:
Your file did not upload

Sorry, updated the comment now to attach the rsl.out.0000 file for the 40 process run.

Deleted member 3607 · Apr 12, 2023

Ming Chen said:
Would you please send me your rsl.out.0000 files for the 8-processor and 40-processor runs? Thanks.

@Ming Chen

Would their issue be because of the non square number processors being used?

According to this faq

How many processors should I use to run WRF?

To choose an appropriate number of processors, you will need to consider the decomposition of the processes in relation to the size of the domains. For processing, your domain will be divided up into tiles, and the number of tiles depends on the total number of processors you use - 1 tile per...

forum.mmm.ucar.edu

It says that wrf likes square numbers. (9, 16, 36 etc etc)

vishnuas · Apr 18, 2023

Thank you @Whatheway for the input. I am sharing the domains and e_we, e_sn values used in this run. From my understanding of the above article, for this usecase, maximum number of processors to use is 355 (e_we 397, e_sn 592) and minimum number is 2 (e_we 150, e_sn 154). Even though the server in use has 40 physical cores, I am not able to get any performance improvement beyond 8 processes and the minimum wall clock time we could achieve was 32hrs, which seems to be very high. Any guidance in how to approach this, would really help a lot. Thanks

Ming Chen · Apr 18, 2023

I looked at your rsl files for the 8- and 40-processor runs. The model decomposition looks fine and reasonable. The only issue I can think is that, the run with 40 processors takes too much communication time. However, this can hardly explain everything.
Would you please talk to your computer manager about this issue? Please keep me updated about any feedback. Thanks in advance.

vishnuas · Apr 20, 2023

Thanks @Ming Chen for taking the time to look at the logs and model decomposition. I will keep you updated on any progress.

wrf.exe not scaling after 8 mpi process

vishnuas

New member

Attachments

Ming Chen

Moderator

vishnuas

New member

Attachments

Ming Chen

Moderator

vishnuas

New member

Attachments

Deleted member 3607

Guest

vishnuas

New member

Deleted member 3607

Guest

How many processors should I use to run WRF?

vishnuas

New member

Attachments

Ming Chen

Moderator

vishnuas

New member