Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

How to improve the wrf running speed

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Hachi-ait

New member
Dear WRF users,

I have installed WRF v4.2 with dmpar in a cluster which has three nodes, 8 CPUs each.
I test run this cluster with CAMx model (air modeling), the speed is 3 times higher than the case using only 1 processor (i.e. no useing mpiexe)
However, when I test with WRF (set -np 9), the speed is almost the same with the case of 1 processor.
Please advise me how can I speed up the wrf simulation?
Let me explain my case:
When I set run command as:
Code:
mpiexe --machinefile machinefile.txt -np 9 ./wrf.exe
The running shows:
Code:
starting wrf task       0   of 9 
starting wrf task       1   of 9 
starting wrf task       2  of 9 
starting wrf task       3   of 9 
starting wrf task       4   of 9 
starting wrf task       5   of 9 
starting wrf task       6   of 9 
starting wrf task       7   of 9
starting wrf task       8   of 9
Then 9 rsl* files were generated (rsl.out.0000, rsl.out.0001, ..., rsl.out.0008)
Reading these 9 files, I found that that only time for reading inputs are divided between 9 cores, while "time for main" were only in 1 processor. So I guest this is the reason why the speed of simulation was not increase. How can we divide the task of "timing for main"

This is the endling lines of rsl.out.0000 file showing "timing for main" running in this core
Code:
Timing for main: time 2017-12-25_05:59:00 on domain   2:    9.60069 elapsed seconds
Timing for main: time 2017-12-25_05:59:10 on domain   3:    1.89823 elapsed seconds
Timing for main: time 2017-12-25_05:59:20 on domain   3:    1.86424 elapsed seconds
Timing for main: time 2017-12-25_05:59:30 on domain   3:    1.76740 elapsed seconds
Timing for main: time 2017-12-25_05:59:30 on domain   2:    9.53581 elapsed seconds
Timing for main: time 2017-12-25_05:59:40 on domain   3:    1.91286 elapsed seconds
Timing for main: time 2017-12-25_05:59:50 on domain   3:    1.82460 elapsed seconds
Timing for main: time 2017-12-25_06:00:00 on domain   3:    1.78574 elapsed seconds
Timing for Writing wrfout_d03_2017-12-25_06:00:00 for domain        3:    2.84209 elapsed seconds
Timing for main: time 2017-12-25_06:00:00 on domain   2:   12.28063 elapsed seconds
Timing for Writing wrfout_d02_2017-12-25_06:00:00 for domain        2:    1.90340 elapsed seconds
Timing for main: time 2017-12-25_06:00:00 on domain   1:   35.79001 elapsed seconds
Timing for Writing wrfout_d01_2017-12-25_06:00:00 for domain        1:    0.96697 elapsed seconds
d01 2017-12-25_06:00:00 wrf: SUCCESS COMPLETE WRF

And this is the rsl.out.0001, all other rsl* file have the same, only working for inputs, then waiting to complete. "timing for main" were not run in these cores
Code:
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
 LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND          33  CATEGORIES           2  SEASONS WATER CATEGORY =           17  SNOW CATEGORY =           15
INITIALIZE THREE Noah LSM RELATED TABLES
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
 LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND          33  CATEGORIES           2  SEASONS WATER CATEGORY =           17  SNOW CATEGORY =           15
INITIALIZE THREE Noah LSM RELATED TABLES
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
 LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND          33  CATEGORIES           2  SEASONS WATER CATEGORY =           17  SNOW CATEGORY =           15
INITIALIZE THREE Noah LSM RELATED TABLES
 Tile Strategy is not specified. Assuming 1D-Y
WRF TILE   1 IS     28 IE     54 JS      1 JE     21
WRF NUMBER OF TILES =   1
 Tile Strategy is not specified. Assuming 1D-Y
WRF TILE   1 IS     34 IE     66 JS      1 JE     24
WRF NUMBER OF TILES =   1
d01 2017-12-25_06:00:00 wrf: SUCCESS COMPLETE WRF

I attached here the namelist.input, rsl.out.0000 and rsl.out.0001.
Please help!
Thanks
Ha Chi
 

Attachments

  • namelist.input
    4 KB · Views: 77
  • rsl.out.0000.txt
    268.7 KB · Views: 54
  • rsl.out.0001.txt
    6 KB · Views: 61
Using more processors doesn't guarantee faster speed of running. This is because the communication between processors takes time. Usually we recommend that each processor covers around 20 grids. For example, if you have 100 grid numbers, then 5-6 processors will give you a relatively fast running
 
My domains include 3 nesting domain; number of grids are 50x50, 81x81, 96x99. i.e. total grids are ~18,000.
I use 9 processors to run, speed is about 3 (i.e. 1 hour can run for 3-hour data). When run 1 processor only, the speed is also about 3.

In this case, is it because my system connection was not well?
However, I tested with CAMx, the same domain, the speed was much faster when using 9 processors, nearly triple faster.
That's why I think problem could be due to the way I set WRF or domains. Could you please advise if the following setting up of WRF has problems?

- GNU compiler
- mpich3 v3.2.2
- netcdf v4.4.1.1
- configure option: dpmar
- nesting option: basic (1)

Thanks
 
Your compiling options look fine. For your grid settings, 9 processors definitely should run faster than 1 processor.

I am not sure yet what could be the possible reason for your case. I will talk to our software engineer and keep you updated if I get any feedback.
 
Let's do a simple matrix multiply to see if any MPI-parallel program is giving you speed-up. It is much easier to test MPI timing problems on a small program than with the huge WRF model.

Take a look at this file
https://gist.github.com/kmkurn/39ca673bb37946055b38

I built the code with:
Code:
mpicc mat_mul_mpi.c

On my 4-core laptop, I get reasonable parallel speed-up:
Code:
> time mpirun -np 2 a.out
mpi_mm has started with 2 tasks.
Done in 0.401521 seconds.
0.809u 0.063s 0:00.48 179.1%	0+0k 8336+0io 36pf+0w

> time mpirun -np 4 a.out
mpi_mm has started with 4 tasks.
Done in 0.145083 seconds.
0.612u 0.075s 0:00.19 357.8%	0+0k 88+0io 1pf+0w

> time mpirun -np 3 a.out
mpi_mm has started with 3 tasks.
Done in 0.206127 seconds.
0.634u 0.062s 0:00.25 276.0%	0+0k 64+0io 1pf+0w
 
Hello,

I have got the script from your link, and run
Code:
mpicc mpi_mm.c
It then generated a.out file. then I run time mpirun, but got this result. The a.out file is actually there, why it says "no such file or directory".
Does it mean my MPI got problem?
Code:
[root@hpc mpi-test]# time mpirun --machinefile machinefile.txt -np 6 a.out
[proxy:0:0@hpc.org] HYDU_create_process (utils/launch/launch.c:75): execvp error on file a.out (No such file or directory)
[proxy:0:0@hpc.org] HYDU_create_process (utils/launch/launch.c:75): execvp error on file a.out (No such file or directory)
[proxy:0:0@hpc.org] HYDU_create_process (utils/launch/launch.c:75): execvp error on file a.out (No such file or directory)
[proxy:0:0@hpc.org] HYDU_create_process (utils/launch/launch.c:75): execvp error on file a.out (No such file or directory)
[proxy:0:1@compute-0-0.local] HYDU_create_process (utils/launch/launch.c:75): execvp error on file a.out (No such file or directory)
[proxy:0:1@compute-0-0.local] HYDU_create_process (utils/launch/launch.c:75): execvp error on file a.out (No such file or directory)

real	0m0.649s
user	0m0.029s
sys	0m0.010s

I test with no machinefile, and another number of -np. all got the same error message.
Here are some information of my mpi built;
Code:
[root@hpc mpi-test]# mpirun --version
HYDRA build details:
    Version:                                 3.2
    Release Date:                            Wed Nov 11 22:06:48 CST 2015
    CC:                              gcc    
    CXX:                             g++    
    F77:                             gfortran   
    F90:                             gfortran   
    Configure options:                       '--disable-option-checking' '--prefix=/opt/mpich3/gnu' '--with-device=ch3:nemesis' '--enable-fast' '--enable-fortran' '--enable-shared' '--enable-sharedlibs=gcc' '--enable-threads=runtime' '--enable-romio' '--enable-smpcoll' 'FC=gfortran' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -DNDEBUG -DNVALGRIND -O2' 'LDFLAGS=' 'LIBS=-lpthread ' 'CPPFLAGS= -I/export/home/repositories/rocks/src/roll/hpc/BUILD/mpich3-ethernet-gnu-3.2/mpich-3.2/src/mpl/include -I/export/home/repositories/rocks/src/roll/hpc/BUILD/mpich3-ethernet-gnu-3.2/mpich-3.2/src/mpl/include -I/export/home/repositories/rocks/src/roll/hpc/BUILD/mpich3-ethernet-gnu-3.2/mpich-3.2/src/openpa/src -I/export/home/repositories/rocks/src/roll/hpc/BUILD/mpich3-ethernet-gnu-3.2/mpich-3.2/src/openpa/src -D_REENTRANT -I/export/home/repositories/rocks/src/roll/hpc/BUILD/mpich3-ethernet-gnu-3.2/mpich-3.2/src/mpi/romio/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:       
    Demux engines available:                 poll select
Please give me some comments/advices. Thanks!
 
Hello,
thank you for your advices,
I test with that mpi_mm.c and found that the optimal number of processors to use is 4.
Code:
> time mpirun --machinefile machinefile.txt -np 2 a.out
mpi_mm has started with 2 tasks.
Done in 0.834314 seconds.

3 task
0.419650 seconds

4 task 
0.286857 seconds

5 tasks
0.464718 seconds

6 task
0.614124 seconds

Then the more -np set, the more time took. The optimal number is 4.
I think this problem could be caused by our LAN and Switch system. So I will upgrade and hope the wrf simulation speed will be improved.
Please advise me if there is anything else causing the more -np set, the lower speed found.
Thanks
 
There are additional options that you can use with the mpirun command that deal with binding processes to cores, the mapping and binding of specific processes to cores, and the selection of which cores to use. All of these options are available to provide fine-tuning for parallel performance.

With your small test matrix multiply program you can tune your mpirun command fairly quickly.

We are not in a position to offer advice on recommendations for your system.
 
Top