MPAS-A GPU Crashes

MyAtmosphere

New member
I am trying to run MPAS-A with GPUs on a system that has 2X 64 CPU cores , 4 A100 GPUs 80 GB, and 512 RAM.
The model was compiled with PGI and OpenACC enabled.

I request 2 nodes and set:

export MPAS_DYNAMICS_RANKS_PER_NODE=24
export MPAS_RADIATION_RANKS_PER_NODE=16

I run using the command

sun -n 80 ./atmosphere_model

I get an empty history file then model crashes. The error I'm getting is:

Role leader is 0

My role is 1

Role leader is 1

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=730788.0. Some of your processes may have been killed by the cgroup out-of-memory handler.

srun: error: system0034: task 53: Out Of Memory

The log file says:
ERROR: MPAS IO Error: Bad return value from PIO

ERROR: ********************************************************************************

ERROR: Error writing one or more output streams

CRITICAL ERROR: ********************************************************************************

Does any one know what might be causing this error? I'm running a 15 Km case.
 

rrpk

New member
Hello,
Based on the info provided here, and our previous work with MPAS- the issue is most likely stemming from PIO. We have had the "MPAS IO Error: Bad return value from PIO" issue when an incompatible PIO was used to run the OpenACC MPAS 6.3. Are you using the latest available PIO version already? If so, can you provide the version numbers of NVHPC compiler, Netcdf and PIO libraries?
 

rrpk

New member
Would it be possible to test it with latest PIO which is PIO 2.5.10? We have had issues with PIO 2.5.4 in the past.
 

mgduda

Administrator
Staff member
It may also be worth trying to set
io_type="pnetcdf,cdf5"
in the definition of your "output" and "restart" streams. The default I/O type ("pnetcdf") writes files in CDF-2 format and cannot accommodate individual records larger than 4 GiB. On the 15-km mesh, for example, the 'ozmixm' field that is written to the default history files is around nMonths * nOznLevels * nCells * 4 bytes/value = 12 * 59 * 2621442 * 4 = 7.4 GB (and double that if you're writing double-precision fields).
 

MyAtmosphere

New member
It may also be worth trying to set

in the definition of your "output" and "restart" streams. The default I/O type ("pnetcdf") writes files in CDF-2 format and cannot accommodate individual records larger than 4 GiB. On the 15-km mesh, for example, the 'ozmixm' field that is written to the default history files is around nMonths * nOznLevels * nCells * 4 bytes/value = 12 * 59 * 2621442 * 4 = 7.4 GB (and double that if you're writing double-precision fields).

Thanks! Now I get some data in the history file and the error for PIO disappeared. However, the job hangs and I get an out of memory error:
Some of your processes may have been killed by the cgroup out-of-memory handler.

Do you think the system is small for this job? I requested 4 nodes and 160 total ranks.
 

mgduda

Administrator
Staff member
It might be worth trying to get a lower-resolution simulation running first before trying with the 15-km mesh. One possibility would be to use the 60-km mesh (with 163842 horizontal grid cells) and to try running across 4 A100s on a single node. In this case, you could request 8 MPI ranks on one node, setting the environment variables
export MPAS_DYNAMICS_RANKS_PER_NODE=4
export MPAS_RADIATION_RANKS_PER_NODE=4
If the 60-km simulation works, that would be a good indicator that the library and compiler versions are all compatible with the MPAS-A GPU code.
 

MyAtmosphere

New member
Ok, I was able to run the low res. case (120 km). I requested 1 node and 4 ranks for each dynamics and radiation. But I set the total MPI to be 16:
srun -n 16 --mpi=pmi2 ./atmosphere_model

My question is how do I know if it ran on GPUs and not CPUs only?

The end of the file log.atmosphere.role01.0000.out reads:


timer_name total calls min max avg pct_tot pct_par par_eff


1 total time 1182.08386 1 1182.08313 1182.08386 1182.08350 100.00 0.00 1.00


2 initialize 5.03950 1 5.03937 5.03950 5.03943 0.43 0.43 1.00


2 time integration 948.59100 600 1.56835 2.07839 1.58096 80.25 80.25 1.00


3 atm_rk_integration_setup 5.03239 600 0.00757 0.00850 0.00829 0.43 0.53 0.99


3 atm_compute_moist_coefficients 5.39747 600 0.00831 0.00927 0.00897 0.46 0.57 1.00


3 physics_get_tend 2.79485 600 0.00387 0.00500 0.00447 0.24 0.29 0.96


3 atm_compute_vert_imp_coefs 13.98097 1800 0.00697 0.00842 0.00765 1.18 1.47 0.99


3 atm_compute_dyn_tend 433.47076 5400 0.06790 0.10689 0.07980 36.67 45.70 0.99


3 small_step_prep 26.35007 5400 0.00466 0.00549 0.00482 2.23 2.78 0.99


3 atm_advance_acoustic_step 107.47188 7200 0.01234 0.01729 0.01445 9.09 11.33 0.97


3 atm_divergence_damping_3d 10.66192 7200 0.00123 0.00206 0.00141 0.90 1.12 0.95


3 atm_recover_large_step_variables 71.73347 5400 0.01210 0.01627 0.01318 6.07 7.56 0.99


3 atm_compute_solve_diagnostics 237.04735 5400 0.03325 0.21557 0.04362 20.05 24.99 0.99


3 atm_rk_dynamics_substep_finish 10.43771 1800 0.00298 0.00853 0.00570 0.88 1.10 0.98


3 atm_rk_reconstruct 1.61023 600 0.00239 0.00388 0.00261 0.14 0.17 0.97


3 atm_rk_summary 0.90147 600 0.00114 0.00264 0.00132 0.08 0.10 0.88


3 mpas update GPU data on host 0.00103 600 0.00000 0.00000 0.00000 0.00 0.00 0.97





-----------------------------------------


Total log messages printed:


Output messages = 6324


Warning messages = 0


Error messages = 0


Critical error messages = 0


-----------------------------------------


Logging complete. Closing file at 2023/03/16 14:44:25

So, there is a total of 600 calls to update GPU data on the host.


The end of the file log.atmosphere.role02.0000.out reads:

timer_name total calls min max avg pct_tot pct_par par_eff


1 total time 2.91916 1 2.91754 2.91916 2.91817 100.00 0.00 1.00


2 initialize 2.91083 1 2.90945 2.91083 2.91003 99.71 99.71 1.00


2 time integration 0.00246 600 0.00000 0.00002 0.00000 0.08 0.08 0.98





-----------------------------------------


Total log messages printed:


Output messages = 908


Warning messages = 0


Error messages = 0


Critical error messages = 0


-----------------------------------------


Logging complete. Closing file at 2023/03/16 14:24:46

Is there any other way to check if GPUs on the node were utilized?
I should mention that in the profiler I get:


Percent of CPU this job got: 0%

This is very confusing to me.
Any explanation?

Thanks,
 
Last edited:

mgduda

Administrator
Staff member
I appreciate that the reason may ultimately be system- or scheduler-specific, but is there a rationale for requesting 16 MPI ranks if you only use 4 ranks each for dynamics and for radiation (8 ranks total)?

Anyway, I'm not particularly knowledgeable when it comes to OpenACC profiling, so I don't think I have any good ideas for how to verify that specific parts of the code are running on GPUs. I believe there are NVIDIA profiling tools that can provide this sort of information, though. One simple test may be to just clean and recompile the code without OpenACC (i.e., don't specify OPENACC=true in your build command), and run again with 8 MPI tasks. If the model is running significantly slower, that would be a simple indication that OPENACC=true is effective.
 
Top