Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Multiple running of wrf.exe by mpirun!

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Polly_LO

New member
Hi everyone!

I have 2 different cases for running by wrf.exe. And I tried to start 2 wrf.exe at the same time.
I have created the copy of WRF run directory with all nested dirs (/WRF_TEST/run/...,/WRF_TEST/main/...). Than I create 2 sets of wrf.exe input files (wrfinput_d01, wrfbdy_d01) and 2 different namelists.
As a result I have 2 independent directory: /WRF/run/... and /WRF_TEST/run/... with all the required input files in them (wrfinput_d01, wrfbdy_d01, namelist.input).
I have PC with AMD Ryzen 9 5950x 16-core Processor.
At first, I run one case on 8 (mpirun -np 8 ./wrf.exe) and 16 cores (mpirun -np 16 ./wrf.exe), and 8-cores calculation was faster.
I think that the reason for this is that I have is not a very big domain (dx/dy - 160/160).
Therefore, I decided to run 2 different calculations at the same time on 8 cores, since the rest of the cores are not used.

The main questions is:
How to start 2 different calculation in the different directory at the same time?

I tried to run something like this:
mpirun -np 8 --wdir /WRF/run wrf.exe : -np 8 --wdir /WRF_TEST/run wrf_new.exe
In top I see that mpirun successfully stared for all processes: 8 for wrf.exe and 8 for wrf_new.exe.
rsl.error. files were created in 2 directories: 8 in /WRF/run and 8 in /WRF_TEST/run
!!!!But wrfout_d01* files are created only in one directory, the calculation progresses is only in one directory.

Does anyone have experience with a similar wrf.exe running?
Or maybe there are ways to define which cores use for wrf.exe by mpirun? It is allowed to run 2 different cases on 8 cores in subsequence without loss of computing resources.

With best regards!
Palina Zaiko
 
Hello Palina,

You should be able to do so with the following:

cd /WRF; mpirun -np 8 ./wrf.exe &; cd /WRF_TEST; mpirun -np 8 ./wrf_new.exe &

You need to change into each directory first and run them in the background simultaneously.

Cheers,

Frank
 
fepeacock said:
Hello Palina,

You should be able to do so with the following:

cd /WRF; mpirun -np 8 ./wrf.exe &; cd /WRF_TEST; mpirun -np 8 ./wrf_new.exe &

You need to change into each directory first and run them in the background simultaneously.

Cheers,

Frank

Hi, Frank!
Thank you for reply.

I tried to run :
cd /WRF/run; mpirun -np 8 ./wrf.exe &;cd /WRF_TEST/run/;mpirun -np 8 ./wrf_new.exe

This command successfully run, and 2 wrf.exe/wrf_new.exe started, BUT the speed of calculation dropped 2 times.
I think that the main reason is that 2 programs started one by one, and use the same cores in calculation.

I want to find the way to difened the number of using cores.

With best regards,
Palina
 
Hello Palina,

If you are using Linux you can monitor the running of each mpirun process with a program such as htop. If you have not created any cpusets and the BIOS has not been configured to use the L3 cache in a certain way, the two runs should have 8 separate cores for each run running simultaneously.

The actual time taken for two 8-core runs will not run in the same time as a single run because of memory channel limitations of the Ryzen 9 processor. Since it has only dual channel capability there are latencies because of both sets accessing the same RAM.

My running time for 2 x 8 core runs is 1.35 x the time for a single 8 core run. So not as large as the 2x factor which you have stated.

You can try using the L3 cache configuration in the BIOS (if it has they are available) and Linux cpusets or bind to core parameter of mpirun to try and keep each 8-core accessing a single channel of memory. Assuming you are using OpenMPI on a single CPU node, you can read more about parameters of mpirun to bind the threads here: https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php

The Ryzen 9 unfortunately has far fewer memory channels than other high core multiprocessors which limits its use in NWP.

Cheers,

Frank
 
fepeacock said:
Hello Palina,

If you are using Linux you can monitor the running of each mpirun process with a program such as htop. If you have not created any cpusets and the BIOS has not been configured to use the L3 cache in a certain way, the two runs should have 8 separate cores for each run running simultaneously.

The actual time taken for two 8-core runs will not run in the same time as a single run because of memory channel limitations of the Ryzen 9 processor. Since it has only dual channel capability there are latencies because of both sets accessing the same RAM.

My running time for 2 x 8 core runs is 1.35 x the time for a single 8 core run. So not as large as the 2x factor which you have stated.

You can try using the L3 cache configuration in the BIOS (if it has they are available) and Linux cpusets or bind to core parameter of mpirun to try and keep each 8-core accessing a single channel of memory. Assuming you are using OpenMPI on a single CPU node, you can read more about parameters of mpirun to bind the threads here: https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php

The Ryzen 9 unfortunately has far fewer memory channels than other high core multiprocessors which limits its use in NWP.

Cheers,

Frank

Hi Frank!
Many thanks for your reply.
I have some tests of parallel running of 2 cases (2 x 8 cores). The time of parallel calculation was 2 times slower than for a single 8 core run.
As I understand the main reason is the memory limit on my PC?
If I have opportunity to increase the memory size on the server, it can help make the parallel calculation faster?

With best regards,
Palina
 
Hello Palina,

I suspect that this is not an issue of the amount of memory but rather that your two runs are actually running consecutively. To check though please use the following steps:

If you are using Linux then analyze using command line executable htop:
Type htop on the command line. If this fails then you may need to install htop: Install using "sudo apt install htop" or "sudo yum install htop" or equivalent for your Linux version.

Set up htop:
F2 to setup: (The buttons to configure are shown on the bottom of the console)
1. Ensure that meters choice shows either in the left and/or right column the CPU's option
2. Ensure that the meters choice shows the memory.
3, Ensure that the meters choice shows the amount of swap.
3. Ensure the columns choice show the processor number.

Please check when both wrf executables are running in parallel:

The CPU threads:
1. The number of threads running is 16.
2. Looking at each thread-pair per core: 0 with 16, 1 with 17 ... 15 with 31 that each pair only has 1 thread active. This indicates that there is no hyperthreading active on each of the 16 cores. This ensures that the 16 cores are used optimally. This is a WRF limitation.
3. Further optimization by ensuring the first wrf runs on the first 8 cores and the second wrf runs on the second set of 8 cores will need cpuset or mpi parameters.

The Memory:
Ensure from configuration 2, that the memory usage is less than the total and that the SWAP used is optimally less than 1GB. The lower the SWAP the better.

Cheers,

Frank
 
fepeacock said:
Hello Palina,

I suspect that this is not an issue of the amount of memory but rather that your two runs are actually running consecutively. To check though please use the following steps:

If you are using Linux then analyze using command line executable htop:
Type htop on the command line. If this fails then you may need to install htop: Install using "sudo apt install htop" or "sudo yum install htop" or equivalent for your Linux version.

Set up htop:
F2 to setup: (The buttons to configure are shown on the bottom of the console)
1. Ensure that meters choice shows either in the left and/or right column the CPU's option
2. Ensure that the meters choice shows the memory.
3, Ensure that the meters choice shows the amount of swap.
3. Ensure the columns choice show the processor number.

Please check when both wrf executables are running in parallel:

The CPU threads:
1. The number of threads running is 16.
2. Looking at each thread-pair per core: 0 with 16, 1 with 17 ... 15 with 31 that each pair only has 1 thread active. This indicates that there is no hyperthreading active on each of the 16 cores. This ensures that the 16 cores are used optimally. This is a WRF limitation.
3. Further optimization by ensuring the first wrf runs on the first 8 cores and the second wrf runs on the second set of 8 cores will need cpuset or mpi parameters.

The Memory:
Ensure from configuration 2, that the memory usage is less than the total and that the SWAP used is optimally less than 1GB. The lower the SWAP the better.

Cheers,

Frank
Hi, Frank!
Thank you for the answer.
Finally I was able to run the parallel test.

At first I run the command:
cd /WRF/run; mpirun -np 8 ./wrf.exe & cd /WRF_TEST/run/;mpirun -np 8 ./wrf_new.exe

Below is the screenshot of "htop" program output (attached).
It looks like both processes are running in parallel, isnt it?

With best regards,
Palina
 

Attachments

  • Htop_WRF.exe.jpg
    Htop_WRF.exe.jpg
    3.3 MB · Views: 824
Hello Palina,

Thanks for posting your screenshot:
1. It does appear that your two runs are running in parallel and using separate cores with no multi-threading on each core.
2. Your memory usage is low and hence the total amount of RAM is more than sufficient.

What is your motherboard and RAM configuration?:
1. I presume your motherboard has support for dual memory channels.
2. I presume the number of installed DIMMs is a multiple of two to provide dual channel support and that it is running in dual channel mode.

To test whether you have dual channel mode running, execute:

sudo dmidecode -t 17

For each RAM DIMM you should see values for the Bank and Bank_Locator keywords: There should be an A and B indicating that you have two channels active.

Cheers,

Frank
 
Hi, Frank!

I have run command to check dual channel support :
>sudo dmidecode -t 17

In attached file you can see the output of this command.
I saw results and guess that 2 channels are activated?
In this case, is it the maximum possible calculation rate?
Are there any other options for speeding it up?

With best regards,
Palina
 

Attachments

  • dmi_out.txt
    3.7 KB · Views: 39
Hi Palina,

Sorry for the late reply.

I am sure that your BIOS settings are standard and hence L3 caching should not be the issue.

Could you post your rsl.out.0000 files for all 3 of the runs? That is for the single 8 core run on it own as well as for the two 8 core runs running together.

Thanks,

Frank
 
Top