mpirun -n 32 ./wrf.exe runs very slow on a 16 VCPUs machine

Topics specifically related to the wrf.exe program
Post Reply
mcanonic
Posts: 11
Joined: Tue May 26, 2020 8:21 am

mpirun -n 32 ./wrf.exe runs very slow on a 16 VCPUs machine

Post by mcanonic » Wed Jul 29, 2020 2:29 pm

Hi,
I've followed the guide from here, without any problem:
https://www2.mmm.ucar.edu/wrf/OnLineTut ... torial.php

I run WRF with mpi using the following command mpirun -n 32 ./wrf.exe . It started correctly but after a while it seemed that only one node is working as the output and error file are not more update except for the rsl.out.0000. The run continues very slowly:

Code: Select all

9226 Jul 16 13:18 rsl.out.0029
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0028
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0023
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0021
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0016
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0003
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0002
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0029
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0028
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0023
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0021
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0016
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0003
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0002
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0030
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0027
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0001
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0030
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0027
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0001
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0031
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0018
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0017
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0013
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0012
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0009
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0008
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0004
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0031
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0018
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0017
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0013
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0012
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0009
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0008
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0004
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0026
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0025
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0024
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0022
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0020
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0011
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0010
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0006
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0026
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0025
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0024
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0022
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0020
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0011
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0010
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0006
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0019
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0015
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0014
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0007
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0019
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0015
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0014
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0007
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0005
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0005
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:32 wrfout_d01_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d02_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d03_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d04_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d05_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc   22914356 Jul 22 14:14 rsl.out.0000
-rw-rw-r-- 1 cc cc   22914393 Jul 22 14:14 rsl.error.0000
Looking at the guide, it seems to me that no one step is dedicated to the amount of VCPU available. Is there something that I can do in order to fully exploit the HW in my machine?

Thanks in advance,
M

Ming Chen
Posts: 963
Joined: Mon Apr 23, 2018 9:42 pm

Re: mpirun -n 32 ./wrf.exe runs very slow on a 16 VCPUs machine

Post by Ming Chen » Thu Jul 30, 2020 5:40 pm

I suppose you compiled WRF in dmpar mode, please let me know if I am wrong.

I am not sure how many grid numbers you have for this case. Usually we recommend each processor should take care of at least around 20 grid numbers. If you don't have a big grid number but use many processors, then the communication between processors will take lots of time and slow the run.
WRF Help Desk

mcanonic
Posts: 11
Joined: Tue May 26, 2020 8:21 am

Re: mpirun -n 32 ./wrf.exe runs very slow on a 16 VCPUs machine

Post by mcanonic » Fri Jul 31, 2020 8:19 am

HI,
thanks for your answer.
In the compile phase for WRF, we select the option n. 34:
32. (serial) 33. (smpar) 34. (dmpar) 35. (dm+sm) GNU (gfortran/gcc)
and then the option 1.

Concerning the grid, we have a big number: the number of points is around 3 * 10^6.

Is there any other setting that I have to check or is there a "hello world" example to run in order to figure out if I'm fully exploiting my computational power?

Thanks,
Massimo

Ming Chen
Posts: 963
Joined: Mon Apr 23, 2018 9:42 pm

Re: mpirun -n 32 ./wrf.exe runs very slow on a 16 VCPUs machine

Post by Ming Chen » Fri Jul 31, 2020 5:15 pm

Massimo,
Probably you can try to use the largest number of processors based on your smallest domain grid numbers, and the formula is:
(e_we/25) * (e_sn/25)

This is not a very strict rule, it is just based on our experience of running WRF.
WRF Help Desk

Post Reply

Return to “wrf.exe”