Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Performance improvement

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.


New member
Hi all.

I wanted to see if someone can lend me a hand to verify the performance of wrf's execution.

I have a Dell PowerEdge R540, with 2 Intel Xeon Gold processors of 16 cores each. WRF is compiled with Intel. The problem I have, is that when I run a simulation it does not matter if I use 16 cores or 32 cores, I can only win a minute. I have tried several different simulations (short, long, with more resolution, with less resolution, etc.).

Does anyone know what may be happening?

Thank you.
I am not sure that we can provide much guidance, but I do not want to just let this sit unattended as if the post is being ignored.

This may not so much be a WRF issue. This may be more of a problem of how to allocate MPI ranks of a distributed memory program on a small cluster, how to require that certain processes remain pinned to particular processors, and how to enforce that program's decomposition through the use of environment variables and the mpirun command's arguments.

Maybe it is easier to get the parallel processing working as expected on a simple MPI benchmark case (a google search shows LLNL and Intel offer such benchmarks). Once it is clear what env variables and mpirun settings are required to get the small benchmark running equitably and efficiently across the processes of the selected architecture, then those user-defined settings would be the first guess at how to launch a WRF job.

This may be an instance where the broader user community is better able to provide assistance, especially if those community members have obtained good performance from a similarly setup system.

Your problem could be slow I/O performance for a variety of reasons.

Set "io_form_history" to "0" in namelist.input and do your tests and see what happens. Also make sure nothing
non-system is running when you do your tests.

Thanks for the answers. I'm going to give a little more information. I have 3 domains (27, 9 and 3 km):

e_we = 100, 118, 106,
e_sn = 100, 118, 106,

The best optimization for decomposition is 20 cores, if I force to:

nproc_x = 2,
nproc_y = 10,

I have tried to set "io_form_history" to "0", but it did not improve.

If I increase the use of cores above 20 (I can use 32), the execution time does not improve, it is even slower.
Hi JCollins,

Your domains are actually really small, and therefore it's not typically recommended to use a lot of processors for that. Take a look at this FAQ post that explains this a bit more:

Hello, kwener.

I have seen in the best practice that domains under 100 are not recommended. If my domains are small, what domain size would be correct?

If I extend the first domain, I have to lower the resolution, because if I maintain the same resolution the domain would be very large.

Another thing that I saw is the following:
For your smallest-sized domain:
((e_we) / 25) * ((e_sn) / 25) = most amount of processors you should use

According to that formula, I should use 16 processors taking into account the smallest domain. Still, it's better for me to use 20 processors.

I have tried to increase the domains to 160x160. In this case, using 32 cores is a little faster than using 20 cores, but very little. Now, if I again force 20 cores to (nproc_x = 2 and nproc_y = 10) I get better results again with 20 cores than with 32.
The FAQ page I pointed you to explains a 'rule of thumb,' which is just a loose formula to follow. The exact calculations may need to be played around with a bit to get better performance. Performance can depend on many different things, like the size of your domains, the resolution, the input data you're using, the platform, the OS, the compiler, the versions of the platform/OS/compiler, the version of the code, the physics options you choose and how they interact with each other, and even background noise on your system. So if 20 is better for you, then you should use 20, instead of 16. That formula is just meant to give you a starting point, and it's not meant to be veered extremely far from.

Unfortunately I also cannot tell you what domain sizes to use. It's true that we don't recommend domains smaller than 100x100 because they need to be large enough to be able to resolve what you're interested in. Anything less than about 100 will typically propagate through the domain before there is time for anything of interest to develop. If your domains are 100x100 and you are satisfied with your results, then you should keep them at that size. It's always recommended to take baby steps and run short test cases to see if things make sense for your set-up before trying to modify all kinds of variables in the namelist, or to run a full, large/long run. You will just need to play around with the domains and the number of processors to find what works best for you.