Scalability of the MPAS-A GPU

Sylvia · Jan 11, 2022

Hello,

I ran MPAS-A GPU version successfully using the following command:
$export MPAS_DYNAMICS_RANKS_PER_NODE=4
$export MPAS_RADIATION_RANKS_PER_NODE=2
$mpirun -np 6 ./atmosphere_model

The test case I used is the CFSR example (Sample real-data input files) provided on the MPAS-A official site (http://mpas-dev.github.io/).

But when I ran MPAS-A GPU version using the following command:
$export MPAS_DYNAMICS_RANKS_PER_NODE=24
$export MPAS_RADIATION_RANKS_PER_NODE=16
$mpirun -np 40 ./atmosphere_model

And I met the following segmentation error:

------------------------------------------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

Local host: gpu
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
[gpu:395574] [[59095,0],0] ORTE_ERROR_LOG: Out of resource in file ../../orte/util/show_help.c at line 507
[gpu:395574] 158 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[gpu:395574] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
[gpu:395580] *** Process received signal ***
[gpu:395580] Signal: Segmentation fault (11)
[gpu:395580] Signal code: Address not mapped (1)
[gpu:395580] Failing at address: (nil)
[gpu:395580] [ 0] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libpthread.so.0(+0x12b20)[0x15054402db20]
[gpu:395580] [ 1] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40(mca_btl_smcuda_sendi+0x6e3)[0x150541c4c603]
[gpu:395580] [ 2] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(mca_pml_ob1_isend+0x3a4)[0x1505463ae744]
[gpu:395580] [ 3] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(ompi_coll_base_bcast_intra_binomial+0x1d0)[0x1505462cce50]
[gpu:395580] [ 4] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(ompi_coll_tuned_bcast_intra_dec_fixed+0x35)[0x1505462dbc95]
[gpu:395580] [ 5] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(MPI_Bcast+0xd9)[0x15054628dc99]
[gpu:395580] [ 6] ./atmosphere_model[0xae2805]
[gpu:395580] [ 7] ./atmosphere_model[0xae98f5]
[gpu:395580] [ 8] ./atmosphere_model[0xae7d6e]
[gpu:395580] [ 9] ./atmosphere_model[0xaaf31d]
[gpu:395580] [10] ./atmosphere_model[0xaaee93]
[gpu:395580] [11] ./atmosphere_model[0xa099c3]
[gpu:395580] [12] ./atmosphere_model[0xaa82a8]
[gpu:395580] [13] ./atmosphere_model[0xa220d6]
[gpu:395580] [14] ./atmosphere_model[0xa1edea]
[gpu:395580] [15] ./atmosphere_model[0xa1d689]
[gpu:395580] [16] ./atmosphere_model[0x5194c2]
[gpu:395580] [17] ./atmosphere_model[0x40d444]
[gpu:395580] [18] ./atmosphere_model[0x40b697]
[gpu:395580] [19] ./atmosphere_model[0x40b633]
[gpu:395580] [20] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(__libc_start_main+0xf3)[0x1505433cf493]
[gpu:395580] [21] ./atmosphere_model[0x40b52e]
[gpu:395580] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gpu exited on signal 11 (Segmentation fault).
-----------------------------------------------------------------------------------------------------

I just wonder if the test could only be run using limited processes? Or it has something to do with my platform? Followed are my platform information:
CPU: 2 AMD EPYC 7742 64-Core Processor
GPU: 8 Nvidia A100 card

I'm very new to MPAS-A GPU. Any suggestions are appreciated!

mgduda · Jan 12, 2022

The MPI ranks that handle dynamics run on GPUs, so MPAS_DYNAMICS_RANKS_PER_NODE=24 would attempt to assign more than one MPI rank to each of your 8 A100s. It could be that there's some issue in assigning multiple MPI ranks to each GPU on your system, and the details of how to do this would probably be system-specific. Could you try with the following?

Code:

export MPAS_DYNAMICS_RANKS_PER_NODE=8
export MPAS_RADIATION_RANKS_PER_NODE=8

In general, running across 8 GPUs should be no problem. We've found that some care is needed in setting up job scripts and mpiexec commands; see, e.g., the example ORNL Summit job script (which is probably not applicable to your case, but illustrates the degree to which we've tried to optimally lay out MPI ranks on Summit).

Sylvia · Jan 13, 2022

Hi mgduda,

Thank you for your reply. I tried the settings as you suggested,

export MPAS_DYNAMICS_RANKS_PER_NODE=8
export MPAS_RADIATION_RANKS_PER_NODE=8

it works well.

I have three questions:
(i) I wonder if MPAS-A GPU could assign multiple MPI ranks on one GPU? If it did, it maybe a specific problem for my machine that I cannot set MPAS_RADIATION_RANKS_PER_NODE to more than 8 ranks. And if I have 8 GPU cards, with the above setting, how many MPI ranks will each GPU card have? 1 or 2?

(ii) The second question is about the ORNL Summit job script. In order to understand the job script, I searched that the resource set of Summit supercomputer seems consists of 2 Power 9 CPU (each has 22 cores) and 6 Tesla V100 GPU. But I'm not quite sure about it. Followed is the ORNL Summit job script:
--------------------------------
#!/bin/bash
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -nnodes 2
#BSUB -alloc_flags "gpumps smt1"
#BSUB -P XXXXX
#BSUB -J MPAS_GPU
#BSUB -q batch
#BSUB -W 5

export OMP_NUM_THREADS=1
export OMP_STACKSIZE=64M

export PAMI_ENABLE_STRIPING=0
export PAMI_IBV_ENABLE_OOO_AR=1
export PAMI_IBV_QP_SERVICE_LEVEL=8
export PAMI_IBV_ADAPTER_AFFINITY=1
export PAMI_IBV_DEVICE_NAME="mlx5_0:1"
export PAMI_IBV_DEVICE_NAME_1="mlx5_3:1"

export MPAS_DYNAMICS_RANKS_PER_NODE=24
export MPAS_RADIATION_RANKS_PER_NODE=16

jsrun --smpiargs="-gpu" --stdio_mode=prepended --nrs 2 --cpu_per_rs 42 --gpu_per_rs 6 -d plane:40 --bind packed:1 --np 80 ./unset.sh ./atmosphere_model
-----------------------------------------
From the above script, we can see that it uses two resource set with 42 CPU and 6 GPU per resource set to run MPAS-A GPU, but "export MPAS_DYNAMICS_RANKS_PER_NODE=24" indicates it uses 4 (24/6) resource set? I may didn't understand it right! The script seems a little confusing to me.

(iii) I also tested other "MPAS_DYNAMICS_RANKS_PER_NODE" and "MPAS_RADIATION_RANKS_PER_NODE" settings, if both of them are less than or equal to 8, it works well. Others settings will produce a "Segmentation fault". But for
$export MPAS_DYNAMICS_RANKS_PER_NODE=8
$export MPAS_RADIATION_RANKS_PER_NODE=12
$mpirun --allow-run-as-root -np 20 ./atmosphere_model

it works well sometimes, and it also produce a "Segmentation fault" sometimes. I have no idea why this happens. I attached the log files and hope it helps.

Thank you!

Scalability of the MPAS-A GPU

Sylvia

New member

mgduda

Administrator

Sylvia

New member

Attachments