Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

mpirun -n 32 ./wrf.exe runs very slow on a 16 VCPUs machine

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

mcanonic

New member
Hi,
I've followed the guide from here, without any problem:
https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compilation_tutorial.php

I run WRF with mpi using the following command mpirun -n 32 ./wrf.exe . It started correctly but after a while it seemed that only one node is working as the output and error file are not more update except for the rsl.out.0000. The run continues very slowly:
Code:
9226 Jul 16 13:18 rsl.out.0029
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0028
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0023
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0021
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0016
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0003
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0002
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0029
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0028
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0023
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0021
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0016
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0003
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0002
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0030
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0027
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0001
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0030
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0027
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0001
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0031
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0018
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0017
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0013
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0012
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0009
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0008
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0004
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0031
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0018
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0017
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0013
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0012
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0009
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0008
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0004
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0026
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0025
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0024
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0022
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0020
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0011
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0010
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0006
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0026
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0025
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0024
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0022
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0020
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0011
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0010
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0006
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0019
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0015
-rw-rw-r-- 1 cc cc       9226 Jul 16 13:18 rsl.out.0014
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0007
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0019
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0015
-rw-rw-r-- 1 cc cc       9263 Jul 16 13:18 rsl.error.0014
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0007
-rw-rw-r-- 1 cc cc       9225 Jul 16 13:18 rsl.out.0005
-rw-rw-r-- 1 cc cc       9262 Jul 16 13:18 rsl.error.0005
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:32 wrfout_d01_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d02_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d03_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d04_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc 2600727396 Jul 22 13:33 wrfout_d05_2008-01-26_06:00:00
-rw-rw-r-- 1 cc cc   22914356 Jul 22 14:14 rsl.out.0000
-rw-rw-r-- 1 cc cc   22914393 Jul 22 14:14 rsl.error.0000

Looking at the guide, it seems to me that no one step is dedicated to the amount of VCPU available. Is there something that I can do in order to fully exploit the HW in my machine?

Thanks in advance,
M
 
I suppose you compiled WRF in dmpar mode, please let me know if I am wrong.

I am not sure how many grid numbers you have for this case. Usually we recommend each processor should take care of at least around 20 grid numbers. If you don't have a big grid number but use many processors, then the communication between processors will take lots of time and slow the run.
 
HI,
thanks for your answer.
In the compile phase for WRF, we select the option n. 34:
32. (serial) 33. (smpar) 34. (dmpar) 35. (dm+sm) GNU (gfortran/gcc)
and then the option 1.

Concerning the grid, we have a big number: the number of points is around 3 * 10^6.

Is there any other setting that I have to check or is there a "hello world" example to run in order to figure out if I'm fully exploiting my computational power?

Thanks,
Massimo
 
Massimo,
Probably you can try to use the largest number of processors based on your smallest domain grid numbers, and the formula is:
(e_we/25) * (e_sn/25)

This is not a very strict rule, it is just based on our experience of running WRF.
 
Hi guys,

I am also running wrf using a virtual machine. (linux guest on windows host).

I have noticed when using metgrid with mpirun the processors on the host are being used as expected (ie 5 100% utilization on 5 processors if you specify np 5).
Wrf and reals behavior though seems to more like a shared memory with using all of the processors but with different percentage of utilization depending on how you specify np.

I am yet to test this but in the compile flags i have noticed that WPS compiles with -D_MPI in the CPPFLAGS section but wrf doesnt compile with this flag as far as i can see. I am also compiling with option 34.

Not sure if this is the same issue or a different one.

Adam
 
Top