server bios configuration

Topics specifically related to running the model in an HPC environment
Post Reply
MarcelloCasula
Posts: 4
Joined: Mon Jul 13, 2020 9:58 am

server bios configuration

Post by MarcelloCasula » Wed Nov 18, 2020 7:12 pm

Hi all,
I am running WRF 4.2 in a server Hawei 2488 equipped with 4 Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (total 4x24 cores) 388GB ram and I notice the strange behavior:
While running the test runs I encountered the following
anomalies:
1) the calculation time remains unchanged if the simulation is launched with 24 cores or more of them (up to use all 96)
2) sending 2 identical runs simultaneously each with 24 cores, the time of the single run doubles compared to the single run with 24 cores
3) sending 2 identical runs simultaneously each with 12 cores the time of the single run remains unchanged compared to the single run with 24 cores
4) the system was installed using both Intel and GNU, with the same results
5) opening the htop program you see that until you launch a run with 24 cores these are almost always correctly exploited at 100%, while as you increase the number of cores the percentage of use of each of them is lowered proportionally to the number of cores used
6) in all the test runs the ram of the system in use is around around 20% of the total, so it's not a memory problem insufficient installed.

It would seem like a threshold to the maximum number of operations that the system can do it per unit of time,

A little improvement turn up disabling NUMA in the bios, so the new time threshold became that one of 48 processor. But, as above, running at the same time two simulation with 48 core each one, the execution time of each simulation double exactly instead remains about the same. I'm quite sure this is not a problem of WRF, anyway I knock at the door of the experience of the community, to have at least a tip to solve this issue

Does anybody has a suggestion

Thank in advance

Marcello

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: server bios configuration

Post by kwerner » Thu Nov 19, 2020 11:10 pm

Hi Marcello,
I'm a little confused about what a "single run" means in your post. For instance, in 2), you said the two runs were identical, both with 24 cores. I assume one of the two identical simulations is the "single run," but what is the other one? I thought you meant a run with a single node, but given the content, I don't think that is correct.

Unfortunately, our team at NCAR likely won't be able to help with this, as it sounds like it's a system issue. Do you have a systems administrator at your institution that could help with the problem? And as you said, perhaps someone in the community with experience will be able to help. I hope so!
NCAR/MMM

MarcelloCasula
Posts: 4
Joined: Mon Jul 13, 2020 9:58 am

Re: server bios configuration

Post by MarcelloCasula » Mon Nov 23, 2020 12:32 pm

Sorry for the misleading expression "single" in:
2) sending 2 identical runs simultaneously each with 24 cores, the time of the single run doubles compared to the single run with 24 cores
3) sending 2 identical runs simultaneously each with 12 cores the time of the single run remains unchanged compared to the single run with 24 cores

in points 2 and 3 I would like to say that if I run just one simulation in the directory AAA involving 24/94 core it takes, for instance, 1 hours, but if I duplicate the directory AAA in BBB and I run 2 simulations one in AAA (involving 24/94 core) the other in BBB (involving 24/94 core) the system take almost the double time, therefore I observe a doubling in time even if the system do not exploits all its resources. This fact it seems to me anomalous, It's correct my impression? And if it is so, does anybody have any suggestion?

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: server bios configuration

Post by kwerner » Mon Nov 23, 2020 7:37 pm

Thanks for clarifying. That makes more sense. It does sound like a problem with your particular system. If you have anyone in the systems group at your institution that you can discuss the problem with, I'd recommend starting there. They should know how the machine should perform, what to expect, and solutions for correcting it.
NCAR/MMM

Post Reply

Return to “High-performance Computing”