Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

More info needed about compiling and running MPAS v8.2.0 with GPU capabilities

mandachasteen

New member
With the v8.2.0 release that requires use of GPUs and OpenACC directives, it would be super helpful for users to have more information about how to compile and run the model with these changes.

Some questions:
-- What build target should be used?
-- What new dependencies are needed?
-- What are the best practices for determining the number of CPUs and GPUs to use for a given model run?
 
For those who are interested, I was able to get a 120-km global simulation to successfully run on Derecho with the following steps:

Loading the following modules in my compile environment:
ncarenv/23.09
nvhpc/24.3
cray-mpich/8.1.27
cuda/12.2.1
ncarcompilers/1.0.0
parallel-netcdf/1.12.3
netcdf-mpi/4.9.2
parallelio/2.6.2


Compiling the atmosphere core by running:
make nvhpc CORE=atmosphere AUTOCLEAN=true PRECISION=single OPENACC=true >& compile.log &

To previously run this simulation on CPUs (compiled with ifort), I used the following in my PBS script:
#PBS -l select=1:ncpus=128:mpiprocs=128

module purge
module load ncarenv/23.09 intel/2023.2.1 ncarcompilers/1.0.0 cray-mpich/8.1.27 craype/2.7.23 parallel-netcdf/1.12.3

mpiexec ./atmosphere_model


To adapt to GPUs, I instead used:
#PBS -l select=1:ncpus=64:mpiprocs=4:ngpus=4

module purge
module load ncarenv/23.09 nvhpc/24.3 cuda/12.2.1 cray-mpich/8.1.27 parallel-netcdf/1.12.3 parallelio/2.6.2

mpiexec -n 4 -ppn 4 set_gpu_rank ./atmosphere_model
 
Hello Manda,

Given your reply it seems that you have sussed out the answers to your questions. Thank you for posting what has worked for you! Point taken, we will try to have some information out soon about the v8 OpenACC port soon.

-- What build target should be used?
Currently we are testing and developing with the nvhpc build target. The GNU and Cray compilers should support OpenACC directives but we haven't added the CFLAGS_ACC and FFLAGS_ACC and tested those targets yet.

-- What new dependencies are needed?
When using the NVHPC compilers on Derecho, the cuda module is the only new dependency. The modules you list in your reply is almost exactly what I've been using to work on this port. I frequently drop PIO since it isn't required anymore.

-- What are the best practices for determining the number of CPUs and GPUs to use for a given model run?
I have been testing with 4 ranks and 4 GPUs (so 1 rank per GPU). The current port on the master branch is only for 2 routines and we don't have the entire dycore running on GPUs. This means our time per timestep with the model is worse than CPU-only runs and the GPU memory required is less than CPU-only. What I say next will be more accurate as we continue with the port.

I'd say the major concern is the amount of memory available; per node for CPU-only runs and per GPU otherwise. The amount of memory required depends greatly on the number of vertical levels, grid columns, physics suite, and I/O settings (esp. how many ranks perform I/O tasks). For the default settings (55 levels, the "mesoscale_reference" suite, and 1 I/O rank) you can expect the simulation to require about 175 KiB per grid column. With the 40GB A100 GPUs on Derecho, each GPU should be able to support about 239k columns.

After the memory requirements, the number of ranks and GPUs used mostly affect your time per timestep (a.k.a. model throughput). Generally using more ranks and/or GPUs should improve your model throughput, as long as there is more work that can be shared without adding too much communication overhead.
 
Hello Manda,

Given your reply it seems that you have sussed out the answers to your questions. Thank you for posting what has worked for you! Point taken, we will try to have some information out soon about the v8 OpenACC port soon.


Currently we are testing and developing with the nvhpc build target. The GNU and Cray compilers should support OpenACC directives but we haven't added the CFLAGS_ACC and FFLAGS_ACC and tested those targets yet.


When using the NVHPC compilers on Derecho, the cuda module is the only new dependency. The modules you list in your reply is almost exactly what I've been using to work on this port. I frequently drop PIO since it isn't required anymore.


I have been testing with 4 ranks and 4 GPUs (so 1 rank per GPU). The current port on the master branch is only for 2 routines and we don't have the entire dycore running on GPUs. This means our time per timestep with the model is worse than CPU-only runs and the GPU memory required is less than CPU-only. What I say next will be more accurate as we continue with the port.

I'd say the major concern is the amount of memory available; per node for CPU-only runs and per GPU otherwise. The amount of memory required depends greatly on the number of vertical levels, grid columns, physics suite, and I/O settings (esp. how many ranks perform I/O tasks). For the default settings (55 levels, the "mesoscale_reference" suite, and 1 I/O rank) you can expect the simulation to require about 175 KiB per grid column. With the 40GB A100 GPUs on Derecho, each GPU should be able to support about 239k columns.

After the memory requirements, the number of ranks and GPUs used mostly affect your time per timestep (a.k.a. model throughput). Generally using more ranks and/or GPUs should improve your model throughput, as long as there is more work that can be shared without adding too much communication overhead.
Thank you -- this is super helpful information!
 
Top