Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

UCX error when running ACC version

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

luise1030

New member
Hello,

When using `jw_baroclinic_wave` to run ACC version on my A100 environment as specified in `Environment` section, the following UCX error occurs. After quick breakdown, this MPI_Wait error occurs after a MPI_ISend operation in `acc host_date use_device(tempBuffer)` region. I tested OSU benchmark on my platform to make sure MPI P2P work, and it does.

Could anyone give some ideas? Thanks

[1620585067.066380] [4313c8b49592:281825:0] cma_ep.c:87 UCX ERROR process_vm_readv(pid=281827 length=144768) returned -1: Bad address
[1620585067.066380] [4313c8b49592:281827:0] cma_ep.c:87 UCX ERROR process_vm_readv(pid=281825 length=144560) returned -1: Bad address
[4313c8b49592:281825] *** An error occurred in MPI_Wait
[4313c8b49592:281825] *** reported by process [2477588481,1]
[4313c8b49592:281825] *** on communicator MPI COMMUNICATOR 6 DUP FROM 5
[4313c8b49592:281825] *** MPI_ERR_INTERN: internal error
[4313c8b49592:281825] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[4313c8b49592:281825] *** and potentially your MPI job)
[4313c8b49592:281820] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[4313c8b49592:281820] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Cases of execution:
`jw_baroclinic_wave` idealized case of v5

Commands for execution:
export MPAS_DYNAMICS_RANKS_PER_NODE=0
export MPAS_RADIATION_RANKS_PER_NODE=1
mpirun --allow-run-as-root -np 1 ./init_atmosphere_model

export MPAS_DYNAMICS_RANKS_PER_NODE=2
export MPAS_RADIATION_RANKS_PER_NODE=2
mpirun --allow-run-as-root -np 4 ./atmosphere_model

Environment:
docker: nvcr.io/nvidia/nvhpc:21.2-devel-cuda_multi-ubuntu20.04
GPU: A100x2

MPAS version:

atmosphere/v6.x-openacc @ 498393d2c5cf36f73db8925d717ae449c3660d40 with following diff:

diff --git a/Makefile b/Makefile
index cf8a5e5f..04b35df6 100644
--- a/Makefile
+++ b/Makefile
@@ -1,4 +1,4 @@
-MODEL_FORMULATION =
+MODEL_FORMULATION = -DROTATED_GRID


dummy:
@@ -96,7 +96,7 @@ pgi:
"LDFLAGS_DEBUG = -O0 -g -Mbounds -Mchkptr -Ktrap=divz,fp,inv,ovf -traceback" \
"FFLAGS_OMP = -mp" \
"CFLAGS_OMP = -mp" \
- "FFLAGS_ACC = -Mnofma -acc -ta=tesla:cc70 -Minfo=accel" \
+ "FFLAGS_ACC = -Mnofma -acc -ta=tesla:cc80,cuda11.2 -Minfo=accel" \
"CFLAGS_ACC =" \
"CORE = $(CORE)" \
 
Top