About roles and ranks

afernandezody · Nov 17, 2020

Hello,
I just compiled the app and have started troubleshooting a few things. My first question is a simple one. When I use 1 rank, the message reads:

Code:

 My role is             3
 Role leader is             0

A bit odd but it creates the log file with 'role03' inserted into the file name. It never uses the GPU and eventually crashes (maybe because of insufficient GPU memory). The next step is to use 2 GPUs but this is more suspicious as it's producing:

Code:

 My role is             3
 Role leader is             0
 My role is             3
 Role leader is             0

and there is a single output file rather than 2. Obviously, there's something iffy so any pointers would be welcome. Thanks.

mgduda · Dec 8, 2020

Have you set the environment variables MPAS_DYNAMICS_RANKS_PER_NODE and MPAS_RADIATION_RANKS_PER_NODE as described in the documentation? The GPU-enabled model will probably require at least 4 MPI ranks -- two ranks to run the radiation on CPUs and two ranks to run the rest of the model on GPUs -- since two CPU sockets are assumed, and the code tries to distribute ranks equally between sockets.

afernandezody · Dec 9, 2020

Thanks.
Maybe I didn't interpret/understand the instructions correctly (some doubts crept up at the time). Just for clarification, I tried to run 2 MPI ranks and my system is very different from Summit. I was testing on a single node with 4 CPUs and 2 GPUs. I have 3 qs:
1) Would my configuration (single node with 2+ CPUs and 2 GPUs) be feasible?
2) Are two nodes required to have 2 CPU sockets? (Honestly, I'm unsure if we're using the word 'socket' differently because of the Summit architecture)
3) If I wanted to run on 2 CPUs + 2 GPUs, and assuming that MPAS_DYNAMICS_RANKS_PER_NODE & MPAS_RADIATION_RANKS_PER_NODE are set to 2, would I have to call 'mpirun -np 2' or 'mpirun -np 4''?

afernandezody · Dec 14, 2020

I did a last attempt to run in 2 nodes rather than in 1. Although it didn't crash (no error message), the app never engaged the GPUs and, as far as I can tell, didn't advance even though it ran for a while. It's probably better to wait until v7 is ready and perform a full testing then.

About roles and ranks

afernandezody

Member

mgduda

Administrator

afernandezody

Member

afernandezody

Member