Hello Manda,
Given your reply it seems that you have sussed out the answers to your questions. Thank you for posting what has worked for you! Point taken, we will try to have some information out soon about the v8 OpenACC port soon.
Currently we are testing and developing with the nvhpc build target. The GNU and Cray compilers should support OpenACC directives but we haven't added the CFLAGS_ACC
and FFLAGS_ACC
and tested those targets yet.
When using the NVHPC compilers on Derecho, the cuda module is the only new dependency. The modules you list in your reply is almost exactly what I've been using to work on this port. I frequently drop PIO since it isn't required anymore.
I have been testing with 4 ranks and 4 GPUs (so 1 rank per GPU). The current port on the master branch is only for 2 routines and we don't have the entire dycore running on GPUs. This means our time per timestep with the model is worse than CPU-only runs and the GPU memory required is less than CPU-only. What I say next will be more accurate as we continue with the port.
I'd say the major concern is the amount of memory available; per node for CPU-only runs and per GPU otherwise. The amount of memory required depends greatly on the number of vertical levels, grid columns, physics suite, and I/O settings (esp. how many ranks perform I/O tasks). For the default settings (55 levels, the "mesoscale_reference" suite, and 1 I/O rank) you can expect the simulation to require about 175 KiB per grid column. With the 40GB A100 GPUs on Derecho, each GPU should be able to support about 239k columns.
After the memory requirements, the number of ranks and GPUs used mostly affect your time per timestep (a.k.a. model throughput). Generally using more ranks and/or GPUs should improve your model throughput, as long as there is more work that can be shared without adding too much communication overhead.