Few CORES and the SPEEDUP no longer IMPROVES on my domains


New member
Hi everyone. I've been using WRF for a long time and have always found performance improvements as the number of CPU cores increases below my expectations. I know that, given a certain hardware and compiler, the speedup depends on many factors, such as the size of the calculation domain and the resolution, however after a few cores (5,6 or 7 at most) I no longer have improvements in performance. Am I wrong or do I have to set something I don't know?

Here is my last system: Intel compiler, Linux machine, SSD, 16GB RAM (used little during simulations...)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i9-12900T
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 2
CPU(s) scaling MHz: 59%
CPU max MHz: 4900,0000

For the ARW, this is the configuration:
time_step = 40,
time_step_fract_num = 0,
time_step_fract_den = 1,
time_step_dfi = 40,
use_adaptive_time_step = .false.,
step_to_output_time = .false.,
target_cfl = 1.2,
max_step_increase_pct = 10,
starting_time_step = 40,
starting_time_step_den = 0,
max_time_step = 108,
min_time_step = 6,
adaptation_domain = 1,
max_dom = 1,
s_we = 1,
e_we = 198,
s_sn = 1,
e_sn = 198,
s_vert = 1,
e_vert = 51,
num_metgrid_levels = 61,
num_metgrid_soil_levels = 6
dx = 7000,
dy = 7000,
grid_id = 1,
parent_id = 1,
i_parent_start = 1,
j_parent_start = 1,
parent_grid_ratio = 1,
parent_time_step_ratio = 1,
numtiles = 1,
p_top_requested = 5000.
smooth_option = 0
feedback = 0

Each core seems to be well used in terms of CPU:
672414 master 20 0 3238616 566752 107684 R 100,0 3,5 55:47.97 wrf_arw.exe
672409 master 20 0 3249520 579408 122660 R 100,0 3,6 55:49.55 wrf_arw.exe
672410 master 20 0 3238620 566992 107684 R 100,0 3,5 56:00.34 wrf_arw.exe
672411 master 20 0 3236568 567452 110036 R 100,0 3,5 55:42.65 wrf_arw.exe
672413 master 20 0 3238620 567124 110512 R 100,0 3,5 55:54.50 wrf_arw.exe
672412 master 20 0 3240672 567712 109832 R 100,0 3,5 55:56.74 wrf_arw.exe
672415 master 20 0 3218644 543264 105232 R 100,0 3,4 55:44.20 wrf_arw.exe

I only use 7 cores here, because if I increase the number of cores the calculation time improves rapidly for the first cores, but once it reaches 7 it no longer improves. I know the domain I use is small, but I didn't think that by reaching just 6 or 7 cores there would be no more improvements. I also tried increasing the domain size (all of Europe) and nothing changed!

A similar situation has always happened to me with other computers too...
What do you think? Any suggestions?

Thank you all!
I believe what you are hitting is the performance core vs effcient core issue. I have noticed on my intel 13900K that when I use more then the p-cores my speed gets worse.

From my computer science friends they have told me that when the e-cores and p-cores work together they run at less efficent speed together then just the p-cores by itself
Hello and thanks for your reply!

Some observations: the problem has always occurred to me, even with older PCs with multicore CPUs which I don't think have this hybrid p-e architecture... Even the old NMM core of the WRF, with the same domain and resolution I couldn't go more than 5 cores...

Also: I would then expect a performance improvement at least up to 8 cores, not 6 or 7 (5 in the case of the NMM, which I tested on the same computer).

Finally: the Intel Thread Director should be responsible for managing the coordinated functioning of the P-Core and E-Core. A hardware solution designed to optimize task management, choosing from time to time the most suitable Core to execute a given thread. According to needs, therefore, the Thread Director is able to assign the heaviest tasks to the Performance Cores and the lighter ones to the Efficient Cores.

And then: same problem with another computer, but AMD (not Intel)...

I really can't understand...
I guess that with more processors, the communication between these processors take more time, which eventually offset the higher computation efficiency due to more processors.
I thought so too, but I see many applications around where they use dozens of cores, or rather, even computer clusters (where the problem of communication speed is much greater) and I stop at 6 or 7 cores?
Your grid number is only 198 x 198, and 6 - 7 processors are already sufficient for this case. Further increasing the core numbers is not necessary and doesn't really help.
As I said at the beginning: "I also tried increasing the domain size (all of Europe) and nothing changed"...