More info needed about compiling and running MPAS v8.2.0 with GPU capabilities

mandachasteen · Jul 1, 2024

With the v8.2.0 release that requires use of GPUs and OpenACC directives, it would be super helpful for users to have more information about how to compile and run the model with these changes.

Some questions:
-- What build target should be used?
-- What new dependencies are needed?
-- What are the best practices for determining the number of CPUs and GPUs to use for a given model run?

mandachasteen · Jul 1, 2024

For those who are interested, I was able to get a 120-km global simulation to successfully run on Derecho with the following steps:

Loading the following modules in my compile environment:
ncarenv/23.09
nvhpc/24.3
cray-mpich/8.1.27
cuda/12.2.1
ncarcompilers/1.0.0
parallel-netcdf/1.12.3
netcdf-mpi/4.9.2
parallelio/2.6.2

Compiling the atmosphere core by running:
make nvhpc CORE=atmosphere AUTOCLEAN=true PRECISION=single OPENACC=true >& compile.log &

To previously run this simulation on CPUs (compiled with ifort), I used the following in my PBS script:
#PBS -l select=1:ncpus=128:mpiprocs=128

module purge
module load ncarenv/23.09 intel/2023.2.1 ncarcompilers/1.0.0 cray-mpich/8.1.27 craype/2.7.23 parallel-netcdf/1.12.3

mpiexec ./atmosphere_model

To adapt to GPUs, I instead used:
#PBS -l select=1:ncpus=64:mpiprocs=4:ngpus=4

module purge
module load ncarenv/23.09 nvhpc/24.3 cuda/12.2.1 cray-mpich/8.1.27 parallel-netcdf/1.12.3 parallelio/2.6.2

mpiexec -n 4 -ppn 4 set_gpu_rank ./atmosphere_model

gdicker · Jul 2, 2024

Hello Manda,

Given your reply it seems that you have sussed out the answers to your questions. Thank you for posting what has worked for you! Point taken, we will try to have some information out soon about the v8 OpenACC port soon.

mandachasteen said:
-- What build target should be used?

Currently we are testing and developing with the nvhpc build target. The GNU and Cray compilers should support OpenACC directives but we haven't added the CFLAGS_ACC and FFLAGS_ACC and tested those targets yet.

mandachasteen said:
-- What new dependencies are needed?

When using the NVHPC compilers on Derecho, the cuda module is the only new dependency. The modules you list in your reply is almost exactly what I've been using to work on this port. I frequently drop PIO since it isn't required anymore.

mandachasteen said:
-- What are the best practices for determining the number of CPUs and GPUs to use for a given model run?

I have been testing with 4 ranks and 4 GPUs (so 1 rank per GPU). The current port on the master branch is only for 2 routines and we don't have the entire dycore running on GPUs. This means our time per timestep with the model is worse than CPU-only runs and the GPU memory required is less than CPU-only. What I say next will be more accurate as we continue with the port.

I'd say the major concern is the amount of memory available; per node for CPU-only runs and per GPU otherwise. The amount of memory required depends greatly on the number of vertical levels, grid columns, physics suite, and I/O settings (esp. how many ranks perform I/O tasks). For the default settings (55 levels, the "mesoscale_reference" suite, and 1 I/O rank) you can expect the simulation to require about 175 KiB per grid column. With the 40GB A100 GPUs on Derecho, each GPU should be able to support about 239k columns.

After the memory requirements, the number of ranks and GPUs used mostly affect your time per timestep (a.k.a. model throughput). Generally using more ranks and/or GPUs should improve your model throughput, as long as there is more work that can be shared without adding too much communication overhead.

mandachasteen · Jul 10, 2024

gdicker said:
Hello Manda,

Given your reply it seems that you have sussed out the answers to your questions. Thank you for posting what has worked for you! Point taken, we will try to have some information out soon about the v8 OpenACC port soon.

Currently we are testing and developing with the nvhpc build target. The GNU and Cray compilers should support OpenACC directives but we haven't added the CFLAGS_ACC and FFLAGS_ACC and tested those targets yet.

When using the NVHPC compilers on Derecho, the cuda module is the only new dependency. The modules you list in your reply is almost exactly what I've been using to work on this port. I frequently drop PIO since it isn't required anymore.

I have been testing with 4 ranks and 4 GPUs (so 1 rank per GPU). The current port on the master branch is only for 2 routines and we don't have the entire dycore running on GPUs. This means our time per timestep with the model is worse than CPU-only runs and the GPU memory required is less than CPU-only. What I say next will be more accurate as we continue with the port.

I'd say the major concern is the amount of memory available; per node for CPU-only runs and per GPU otherwise. The amount of memory required depends greatly on the number of vertical levels, grid columns, physics suite, and I/O settings (esp. how many ranks perform I/O tasks). For the default settings (55 levels, the "mesoscale_reference" suite, and 1 I/O rank) you can expect the simulation to require about 175 KiB per grid column. With the 40GB A100 GPUs on Derecho, each GPU should be able to support about 239k columns.

After the memory requirements, the number of ranks and GPUs used mostly affect your time per timestep (a.k.a. model throughput). Generally using more ranks and/or GPUs should improve your model throughput, as long as there is more work that can be shared without adding too much communication overhead.

Thank you -- this is super helpful information!

ruecuil · Aug 1, 2024

Didn't you get

Code:

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_allreduce (mpas_dmpar.F: 1769)

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_irecv (mpas_dmpar.F: 1788)

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_isend (mpas_dmpar.F: 1789)

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_irecv (mpas_dmpar.F: 1877)

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_isend (mpas_dmpar.F: 1878)

      0 inform,   0 warnings,   5 severes, 0 fatal for  mpas_dmpar_max_real

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_irecv (mpas_dmpar.F: 6453)

    NVFORTRAN-S-0155-Could not resolve generic procedure mpi_isend (mpas_dmpar.F: 6496)

     0 inform,   0 warnings,   5 severes, 0 fatal for mpas_dmpar_get_exch_list

More info needed about compiling and running MPAS v8.2.0 with GPU capabilities

mandachasteen

New member

mandachasteen

New member

gdicker

Member

mandachasteen

New member

ruecuil

New member