Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A GPU Crashes

MyAtmosphere

New member
I am trying to run MPAS-A with GPUs on a system that has 2X 64 CPU cores , 4 A100 GPUs 80 GB, and 512 RAM.
The model was compiled with PGI and OpenACC enabled.

I request 2 nodes and set:

export MPAS_DYNAMICS_RANKS_PER_NODE=24
export MPAS_RADIATION_RANKS_PER_NODE=16

I run using the command

sun -n 80 ./atmosphere_model

I get an empty history file then model crashes. The error I'm getting is:

Role leader is 0

My role is 1

Role leader is 1

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=730788.0. Some of your processes may have been killed by the cgroup out-of-memory handler.

srun: error: system0034: task 53: Out Of Memory

The log file says:
ERROR: MPAS IO Error: Bad return value from PIO

ERROR: ********************************************************************************

ERROR: Error writing one or more output streams

CRITICAL ERROR: ********************************************************************************

Does any one know what might be causing this error? I'm running a 15 Km case.
 
Hello,
Based on the info provided here, and our previous work with MPAS- the issue is most likely stemming from PIO. We have had the "MPAS IO Error: Bad return value from PIO" issue when an incompatible PIO was used to run the OpenACC MPAS 6.3. Are you using the latest available PIO version already? If so, can you provide the version numbers of NVHPC compiler, Netcdf and PIO libraries?
 
Would it be possible to test it with latest PIO which is PIO 2.5.10? We have had issues with PIO 2.5.4 in the past.
 
It may also be worth trying to set
io_type="pnetcdf,cdf5"
in the definition of your "output" and "restart" streams. The default I/O type ("pnetcdf") writes files in CDF-2 format and cannot accommodate individual records larger than 4 GiB. On the 15-km mesh, for example, the 'ozmixm' field that is written to the default history files is around nMonths * nOznLevels * nCells * 4 bytes/value = 12 * 59 * 2621442 * 4 = 7.4 GB (and double that if you're writing double-precision fields).
 
It may also be worth trying to set

in the definition of your "output" and "restart" streams. The default I/O type ("pnetcdf") writes files in CDF-2 format and cannot accommodate individual records larger than 4 GiB. On the 15-km mesh, for example, the 'ozmixm' field that is written to the default history files is around nMonths * nOznLevels * nCells * 4 bytes/value = 12 * 59 * 2621442 * 4 = 7.4 GB (and double that if you're writing double-precision fields).

Thanks! Now I get some data in the history file and the error for PIO disappeared. However, the job hangs and I get an out of memory error:
Some of your processes may have been killed by the cgroup out-of-memory handler.

Do you think the system is small for this job? I requested 4 nodes and 160 total ranks.
 
It might be worth trying to get a lower-resolution simulation running first before trying with the 15-km mesh. One possibility would be to use the 60-km mesh (with 163842 horizontal grid cells) and to try running across 4 A100s on a single node. In this case, you could request 8 MPI ranks on one node, setting the environment variables
export MPAS_DYNAMICS_RANKS_PER_NODE=4
export MPAS_RADIATION_RANKS_PER_NODE=4
If the 60-km simulation works, that would be a good indicator that the library and compiler versions are all compatible with the MPAS-A GPU code.
 
Ok, I was able to run the low res. case (120 km). I requested 1 node and 4 ranks for each dynamics and radiation. But I set the total MPI to be 16:
srun -n 16 --mpi=pmi2 ./atmosphere_model

My question is how do I know if it ran on GPUs and not CPUs only?

The end of the file log.atmosphere.role01.0000.out reads:


timer_name total calls min max avg pct_tot pct_par par_eff


1 total time 1182.08386 1 1182.08313 1182.08386 1182.08350 100.00 0.00 1.00


2 initialize 5.03950 1 5.03937 5.03950 5.03943 0.43 0.43 1.00


2 time integration 948.59100 600 1.56835 2.07839 1.58096 80.25 80.25 1.00


3 atm_rk_integration_setup 5.03239 600 0.00757 0.00850 0.00829 0.43 0.53 0.99


3 atm_compute_moist_coefficients 5.39747 600 0.00831 0.00927 0.00897 0.46 0.57 1.00


3 physics_get_tend 2.79485 600 0.00387 0.00500 0.00447 0.24 0.29 0.96


3 atm_compute_vert_imp_coefs 13.98097 1800 0.00697 0.00842 0.00765 1.18 1.47 0.99


3 atm_compute_dyn_tend 433.47076 5400 0.06790 0.10689 0.07980 36.67 45.70 0.99


3 small_step_prep 26.35007 5400 0.00466 0.00549 0.00482 2.23 2.78 0.99


3 atm_advance_acoustic_step 107.47188 7200 0.01234 0.01729 0.01445 9.09 11.33 0.97


3 atm_divergence_damping_3d 10.66192 7200 0.00123 0.00206 0.00141 0.90 1.12 0.95


3 atm_recover_large_step_variables 71.73347 5400 0.01210 0.01627 0.01318 6.07 7.56 0.99


3 atm_compute_solve_diagnostics 237.04735 5400 0.03325 0.21557 0.04362 20.05 24.99 0.99


3 atm_rk_dynamics_substep_finish 10.43771 1800 0.00298 0.00853 0.00570 0.88 1.10 0.98


3 atm_rk_reconstruct 1.61023 600 0.00239 0.00388 0.00261 0.14 0.17 0.97


3 atm_rk_summary 0.90147 600 0.00114 0.00264 0.00132 0.08 0.10 0.88


3 mpas update GPU data on host 0.00103 600 0.00000 0.00000 0.00000 0.00 0.00 0.97





-----------------------------------------


Total log messages printed:


Output messages = 6324


Warning messages = 0


Error messages = 0


Critical error messages = 0


-----------------------------------------


Logging complete. Closing file at 2023/03/16 14:44:25

So, there is a total of 600 calls to update GPU data on the host.


The end of the file log.atmosphere.role02.0000.out reads:

timer_name total calls min max avg pct_tot pct_par par_eff


1 total time 2.91916 1 2.91754 2.91916 2.91817 100.00 0.00 1.00


2 initialize 2.91083 1 2.90945 2.91083 2.91003 99.71 99.71 1.00


2 time integration 0.00246 600 0.00000 0.00002 0.00000 0.08 0.08 0.98





-----------------------------------------


Total log messages printed:


Output messages = 908


Warning messages = 0


Error messages = 0


Critical error messages = 0


-----------------------------------------


Logging complete. Closing file at 2023/03/16 14:24:46

Is there any other way to check if GPUs on the node were utilized?
I should mention that in the profiler I get:


Percent of CPU this job got: 0%

This is very confusing to me.
Any explanation?

Thanks,
 
Last edited:
I appreciate that the reason may ultimately be system- or scheduler-specific, but is there a rationale for requesting 16 MPI ranks if you only use 4 ranks each for dynamics and for radiation (8 ranks total)?

Anyway, I'm not particularly knowledgeable when it comes to OpenACC profiling, so I don't think I have any good ideas for how to verify that specific parts of the code are running on GPUs. I believe there are NVIDIA profiling tools that can provide this sort of information, though. One simple test may be to just clean and recompile the code without OpenACC (i.e., don't specify OPENACC=true in your build command), and run again with 8 MPI tasks. If the model is running significantly slower, that would be a simple indication that OPENACC=true is effective.
 
If you get role1 logs, it proves that you are running on GPUs. If the code is compiled without OPENACC or OPENACC=false, a role3 log (but not role1 and role2) is written.

Role1 logs contain everything about the timing but the radiation calls. Role2 logs contain the radiation and associated timer driving.

Nick
 
Just to add some information regarding the "role" log files: as part of the work to port the model to GPUs, we had also introduced infrastructure to enable different parts of the model to run under different MPI intracommunicators (essentially by splitting MPI_COMM_WORLD). The determination of which MPI rank should run each part of the model is done by assigning "roles" to each rank according to the presence (or absence) of environment variables MPAS_DYNAMICS_RANKS_PER_NODE and MPAS_RADIATION_RANKS_PER_NODE. Strictly speaking, it would be possible for the model to run with lagged radiation (with "integration" roles and "radiation" roles) entirely on CPUs, and so I was reluctant to suggest that the simple presence of "role1" log files could be taken as proof that some parts of the model were actually executing on GPU hardware.
 
Top