About roles and ranks

This post was from a previous version of the WRF&MPAS-A Support Forum. Please do not add new replies here and if you would like the thread moved out of the Historical / Archive section then contact us, making sure to include the link of the thread to be moved.

Hello,
I just compiled the app and have started troubleshooting a few things. My first question is a simple one. When I use 1 rank, the message reads:
Code:
 My role is             3
 Role leader is             0
A bit odd but it creates the log file with 'role03' inserted into the file name. It never uses the GPU and eventually crashes (maybe because of insufficient GPU memory). The next step is to use 2 GPUs but this is more suspicious as it's producing:
Code:
 My role is             3
 Role leader is             0
 My role is             3
 Role leader is             0
and there is a single output file rather than 2. Obviously, there's something iffy so any pointers would be welcome. Thanks.
 

mgduda

Administrator
Staff member
Have you set the environment variables MPAS_DYNAMICS_RANKS_PER_NODE and MPAS_RADIATION_RANKS_PER_NODE as described in the documentation? The GPU-enabled model will probably require at least 4 MPI ranks -- two ranks to run the radiation on CPUs and two ranks to run the rest of the model on GPUs -- since two CPU sockets are assumed, and the code tries to distribute ranks equally between sockets.
 
Thanks.
Maybe I didn't interpret/understand the instructions correctly (some doubts crept up at the time). Just for clarification, I tried to run 2 MPI ranks and my system is very different from Summit. I was testing on a single node with 4 CPUs and 2 GPUs. I have 3 qs:
1) Would my configuration (single node with 2+ CPUs and 2 GPUs) be feasible?
2) Are two nodes required to have 2 CPU sockets? (Honestly, I'm unsure if we're using the word 'socket' differently because of the Summit architecture)
3) If I wanted to run on 2 CPUs + 2 GPUs, and assuming that MPAS_DYNAMICS_RANKS_PER_NODE & MPAS_RADIATION_RANKS_PER_NODE are set to 2, would I have to call 'mpirun -np 2' or 'mpirun -np 4''?
 
I did a last attempt to run in 2 nodes rather than in 1. Although it didn't crash (no error message), the app never engaged the GPUs and, as far as I can tell, didn't advance even though it ran for a while. It's probably better to wait until v7 is ready and perform a full testing then.
 
Top