Compiling and Running MPAS on GPU cluster

AKnight · Sep 3, 2025

I want to run MPAS-A openacc v6 or v7 on our GPU cluster.

It is a HPE Apollo 6500 Linux Cluster with 8 nodes equipped with 8 NVIDIA A100 GPUs per node. It uses NVIDIA/Mellanox HDR100 InfiniBand interconnect. The compute nodes are structured as follows:

4 Compute
(A100 GPU)

128 cores
(2x64 core 2.00GHz AMD EPYC Milan 7713)

1 TB
(16x 64GB DDR-4 Dual Rank 3200MHz)

8x NVIDIA A100 (80GB) mig=1

2 Compute
(A100 GPU)

128 cores
(2x64 core 2.00GHz AMD EPYC Milan 7713)

1 TB
(16x 64GB DDR-4 Dual Rank 3200MHz)

8x NVIDIA A100 (80GB) mig=2

2 Compute
(A100 GPU)

128 cores
(2x64 core 2.00GHz AMD EPYC Milan 7713)

1 TB
(16x 64GB DDR-4 Dual Rank 3200MHz)

8x NVIDIA A100 (80GB) mig=7

We currently have openmpi/4.1.6 compiled with nvhpc/24.7. I know additional libraries would be required for compilation. However, I have a few questions:
1) Is this the latest documentation on gpu-enabled mpas GPU-enabled MPAS-Atmosphere — MPAS Atmosphere documentation
2) What are the differences between the two branches atmosphere/v6.x-openacc and atmosphere/develop-openacc?
3) I also am looking for suggestions of ensuring correct usage of resources allocated in a batch/slurm script once I have the model compiled.

Additionally, I have compiled MPAS v 8.3.1 using nvhpc 25.1 (mpi, compilers, and cuda) and parallel-netcdf. All of these libraries were installed locally in one of my directories, and I don't think my nvhpc-mpi installation accepts srun. I do plan on using the system installation of nvhpc and openmpi used above but have not done so yet.
The executables were linked to cuda libs when compiled and the log file shows that the gpus are being detected, but it does not appear to be using any VRAM during the runs. Are they any suggestions to remedy this?

abishekg · Sep 3, 2025

Hi there,

While I do not have familiarity with the previous GPU-ports of MPAS - another staff member can perhaps help you better - I can try to help with getting v8.3.1 to run on GPUs. Could you please share the following details:

The full list of modules loaded at build-time, and your build command.
The section of the Makefile specific to your machine ( Showing the build options)
The model run log files + your command to launch the MPAS model on GPUs
Also, how do you check the GPU/device memory during the model runs?

Thanks!

AKnight · Sep 3, 2025

abishekg said:
The full list of modules loaded at build-time, and your build command.

1. I used the module file that comes with the nvhpc 25.1 installation. (Attached to this post). Additionally, I installed pnetcdf/1.14.1 and loaded it with the other module file attached.

abishekg said:
The section of the Makefile specific to your machine ( Showing the build options)

2. I may be misunderstanding the question, but I used the "nvhpc" build target. (make -j 4 CORE=atmosphere OPENACC=true)

abishekg said:
The model run log files + your command to launch the MPAS model on GPUs

3. I am attaching my submission script and the log files. The I have struggle with understand some of the documentation, so don't judge the likely pitiful attempt at running it. I have successfully run multiple larger cpu runs, but this is completely new territory.

abishekg said:
Also, how do you check the GPU/device memory during the model runs?

4. For starters, I kept running into the issue of the multiple processes being assigned to 1 gpu, so I couldn't even get past that issue. However, I used nvidia-smi and nvidia-smi --query-gpu=utilization.gpu,utilization.memory.

Additionally, when I attached the output of "ldd atmoshpere_model" in the ldd.txt file

abishekg · Sep 4, 2025

Thanks for the context! So a few things..

1. The environment variables MPAS_DYNAMICS_RANKS_PER_NODE and MPAS_RADIATION_RANKS_PER_NODE are not used in the ongoing GPU port of MPAS v8. Which also brings us to another important point, the version 8 GPU port is ongoing and only the dynamical core is currently ported. If you notice that v8 runs much slower on GPUs than it does on CPUs, it is expected.

2. I think your build command seems reasonable, but let's work on the command for model run. To ensure that each MPI rank gets assigned to a dedicated device, we use the CUDA_VISIBLE_DEVICES env variables as you know. However, it needs to be offloaded to a separate script so that each MPI task uses a different value of CUDA_VISIBLE_DEVICES.

For example to run the model on 4 MPI ranks with each rank using a GPU

mpirun -np 4 ./set_gpu_rank.sh ./atmosphere_model

where set_gpu_rank.sh contains

export LOCAL_RANK=$SLURM_LOCALID
export GPUS=(0 1 2 3)
export CUDA_VISIBLE_DEVICES=${GPUS[$SLURM_LOCALID]}

Let me know if this fixes some of your issues.

3. Regarding nvidia-smi, if you launch the job and login to the node and then do nvidia-smi to look at the visual output, it can sometimes show no GPU usage due to a lower update frequency. However, you might see the GPU being used if you're saving the log to a csv file for example. If you still don't, try adjusting the query interval to a few milliseconds.

Alternatively, you can also use the Nvidia Nsight tools to confirm that it's running on the GPUs.

AKnight · Thursday at 10:02 AM

It has been a very busy month, so I apologize for the late response. Thank you for the information, as it was very helpful. After speaking with our admin, I have MPAS running on the GPUs. However, we changed the set_gpu_rank.sh script:

#!/bin/bash
export LOCAL_RANK=$SLURM_LOCALID
export GPUS=($(echo $CUDA_VISIBLE_DEVICES | tr , '\n'))
export CUDA_VISIBLE_DEVICES=${GPUS[$((SLURM_LOCALID % $SLURM_GPUS_ON_NODE))]}
exec $1

This accounts for MIG instances as they are not labeled (0, 1, 2, 3) like the full A100 (at least in our cluster). It also assigns more than one rank to each GPU if you request more tasks in the Slurm script. This also allows us to request more cpu cores through slurm.

I have run using both MIG instances and full A100s. Although I know that the ranks are for sure being assigned to each MIG instance that I request, I have not been able to see the actual usage of the instances, because nvidia-smi does not show the same information for MIG instances as with the full A100.

However, I first tested a 60-3km (835586) case with only 2 A100s and 2 cpu cores, and as I assumed it would be, it was beyond extremely slow, with the cpu workload. The A100s would range between 50-70% usage during the "split dynamics-transport integration 3" portion of each timestep.

I then requested 40 tasks (40 cpu cores) and 4 A100s (10 ranks per GPU). This increased the GPU usage across the 4 GPUs to 95-100% during the "split dynamics-transport integration 3" and dramatically reduced each timestep. The VRAM usage was still relatively low with no more than 20 of the full 80GB per A100, which I expected.

While I do plan on playing around with the slurm request configuration and cases, do you have any suggestions on the most efficient approach to running the model or insights on balancing the cpu and gpu workloads? Thank you again for the help.

abishekg · Thursday at 12:33 PM

Thanks for the reportback! And good to hear that it's working now.

With regards to the sluggish performance, I assume you're using MPAS v8 GPU port? If so, that makes sense. The current develop branch/v8 only has kernels in the dynamical core ported, and it transfers data from host to device and back around every kernel execution, and more importantly all of the physics are still running on CPUs.

Running with 40 MPI tasks, and 4 GPUs will likely be much faster than with 4 MPI tasks 4 GPUs. The biggest bottlenecks are the radiation and other physics running on CPUs and having more CPU processes helps with that. However, this decreases the grid size of the dycore kernels running on the GPUs at any time, which brings down GPU usage and efficiency. So your observations make sense.

The GPU-port of MPAS v8 currently remains experimental and has only been checked for correctness. There is a lot of optimization work that remains to be done. We are close to merging code changes that will reduce the host -> device data transfers, and are looking into implemented GPU-ported radiation schemes, so it will get faster. It might help to know some of your objectives with regards to using the GPU-port so that we can offer more targeted advice.

AKnight · Thursday at 1:05 PM

abishekg said:
Running with 40 MPI tasks, and 4 GPUs will likely be much faster than with 4 MPI tasks 4 GPUs. The biggest bottlenecks are the radiation and other physics running on CPUs and having more CPU processes helps with that. However, this decreases the grid size of the dycore kernels running on the GPUs at any time, which brings down GPU usage and efficiency. So your observations make sense.

The reduction in usage is what I expected, but why did I see an increase in usage of the GPUs once I increased the ranks? I assume its not the most efficient, but the increase in gpu usage threw me for a loop.

Regarding the correctness, a while back, I thought saw a thread in the forum that mentioned differing solutions for convective precip between cpu and gpu. I haven't had the opportunity to check, but have there been in differences in output between the two methods?

This is primarily for just proof that it can be run for those who are in the meteorology department. The gpu version, however, is not something I expected to be faster, but it did allow me to become more familiar with porting to GPUs for this and software in general. Currently, the only possible benefit would be getting some of our students without access to our larger cpu machines the ability to run simulations without taking over our cpu resources on this smaller gpu machine.

abishekg said:
There is a lot of optimization work that remains to be done. We are close to merging code changes that will reduce the host -> device data transfers, and are looking into implemented GPU-ported radiation schemes

Is there a general time-line for expected changes that could increase the speed of mpas simulations. While it has been very beneficial to learn the process of running the gpu-enabled version, having a release that would increase the speed would/will greatly benefit some of our users.

abishekg · 2025-11-07T11:32:15-0700

That's a good question. So running the model on 40 MPI tasks/4 GPUs (compared to 4 MPI tasks on 4 GPUs) will split up the bigger chunks to be computed into 10x as many smaller ones. We have more GPU kernel launches, and nvidia-smi is probably reporting that the device is being utilized more. However, each kernel execution is itself quite inefficient, as GPUs are generally optimal when they are saturated with computations (a 'just fits' configuration).

I will inquire with others in my team regarding your other two questions and get back to you shortly. But it's likely that the issue with convective precip was with the previous OpenACC ports of MPAS.

Compiling and Running MPAS on GPU cluster

AKnight

New member

abishekg

New member

AKnight

New member

Attachments

abishekg

New member

AKnight

New member

abishekg

New member

AKnight

New member

abishekg

New member