UCX error when running ACC version

Questions about and discussion of the GPU-enabled MPAS-Atmosphere branch.
Post Reply
luise1030
Posts: 2
Joined: Fri May 07, 2021 1:40 am

UCX error when running ACC version

Post by luise1030 » Sun May 09, 2021 3:44 pm

Hello,

When using `jw_baroclinic_wave` to run ACC version on my A100 environment as specified in `Environment` section, the following UCX error occurs. After quick breakdown, this MPI_Wait error occurs after a MPI_ISend operation in `acc host_date use_device(tempBuffer)` region. I tested OSU benchmark on my platform to make sure MPI P2P work, and it does.

Could anyone give some ideas? Thanks

[1620585067.066380] [4313c8b49592:281825:0] cma_ep.c:87 UCX ERROR process_vm_readv(pid=281827 length=144768) returned -1: Bad address
[1620585067.066380] [4313c8b49592:281827:0] cma_ep.c:87 UCX ERROR process_vm_readv(pid=281825 length=144560) returned -1: Bad address
[4313c8b49592:281825] *** An error occurred in MPI_Wait
[4313c8b49592:281825] *** reported by process [2477588481,1]
[4313c8b49592:281825] *** on communicator MPI COMMUNICATOR 6 DUP FROM 5
[4313c8b49592:281825] *** MPI_ERR_INTERN: internal error
[4313c8b49592:281825] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[4313c8b49592:281825] *** and potentially your MPI job)
[4313c8b49592:281820] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[4313c8b49592:281820] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Cases of execution:
`jw_baroclinic_wave` idealized case of v5

Commands for execution:
export MPAS_DYNAMICS_RANKS_PER_NODE=0
export MPAS_RADIATION_RANKS_PER_NODE=1
mpirun --allow-run-as-root -np 1 ./init_atmosphere_model

export MPAS_DYNAMICS_RANKS_PER_NODE=2
export MPAS_RADIATION_RANKS_PER_NODE=2
mpirun --allow-run-as-root -np 4 ./atmosphere_model

Environment:
docker: nvcr.io/nvidia/nvhpc:21.2-devel-cuda_multi-ubuntu20.04
GPU: A100x2

MPAS version:

atmosphere/v6.x-openacc @ 498393d2c5cf36f73db8925d717ae449c3660d40 with following diff:

diff --git a/Makefile b/Makefile
index cf8a5e5f..04b35df6 100644
--- a/Makefile
+++ b/Makefile
@@ -1,4 +1,4 @@
-MODEL_FORMULATION =
+MODEL_FORMULATION = -DROTATED_GRID


dummy:
@@ -96,7 +96,7 @@ pgi:
"LDFLAGS_DEBUG = -O0 -g -Mbounds -Mchkptr -Ktrap=divz,fp,inv,ovf -traceback" \
"FFLAGS_OMP = -mp" \
"CFLAGS_OMP = -mp" \
- "FFLAGS_ACC = -Mnofma -acc -ta=tesla:cc70 -Minfo=accel" \
+ "FFLAGS_ACC = -Mnofma -acc -ta=tesla:cc80,cuda11.2 -Minfo=accel" \
"CFLAGS_ACC =" \
"CORE = $(CORE)" \

Post Reply

Return to “GPU / OpenACC”