Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

run-time error "Illegal address during kernel execution" from "!$acc enter data copyin(scalar_tend_save)" pragma

kosakaguchi

New member
Thanks to the information and comments shared in the recent post, I am able to compile the MPAS atmosphere v8.2.0 and v8.2.1 with OpenACC option for the NERSC Perlmutter system.

However, in a test run using the 240km mesh, I quickly get the following error:

Code:
Accelerator Fatal Error: call to cuMemcpyHtoDAsync returned error 700: Illegal address during kernel execution
 File: /global/cfs/cdirs/wcm_code/MPAS-Atmosphere/models/ksa/MPAS-Model-v8.2.1/src/core_atmosphere/dynamics/mpas_atm_time_integration.F
 Function: atm_advance_scalars_work:3049
 Line: 3240

The same 240km simulation runs fine with the cpu-version of the model executable (compiled without the OpenACC option).

The relevant source code section looks like this:

Code:
#ifndef DO_PHYSICS
      !$acc enter data create(scalar_tend_save)
#else
      !$acc enter data copyin(scalar_tend_save)
#endif


line 3240 is the fourth line,
Code:
!$acc enter data copyin(scalar_tend_save)


I am testing the mpas executable compiled with the debug option. To compile, I copied the entry for "nvhpc" in the top-level Makefile to create the following:

Code:
nvhpc-pm-gpu:   # BUILDTARGET Nvidia compilers on NERSC Perlmutter GPU node following nvhpc
    ( $(MAKE) all \
    "FC_PARALLEL = ftn" \
    "CC_PARALLEL = cc" \
    "CXX_PARALLEL = CC" \
    "FC_SERIAL = nvfortran" \
    "CC_SERIAL = nvc" \
    "CXX_SERIAL = nvc++" \
    "FFLAGS_PROMOTION = -r8" \
    "FFLAGS_OPT = -gopt -O4 -byteswapio -Mfree" \
    "CFLAGS_OPT = -gopt -O3" \
    "CXXFLAGS_OPT = -gopt -O3" \
    "LDFLAGS_OPT = -gopt -O3" \
    "FFLAGS_DEBUG = -O0 -g -Mbounds -Mchkptr -byteswapio -Mfree -Ktrap=divz,fp,inv,ovf -traceback" \
    "CFLAGS_DEBUG = -O0 -g -traceback" \
    "CXXFLAGS_DEBUG = -O0 -g -traceback" \
    "LDFLAGS_DEBUG = -O0 -g -Mbounds -Ktrap=divz,fp,inv,ovf -traceback" \
    "FFLAGS_OMP = -mp" \
    "CFLAGS_OMP = -mp" \
    "FFLAGS_ACC = -Mnofma -acc -gpu=cc70,cc80 -Minfo=accel" \
    "CFLAGS_ACC =" \
    "PICFLAG = -fpic" \
    "BUILD_TARGET = $(@)" \
    "CORE = $(CORE)" \
    "DEBUG = $(DEBUG)" \
    "USE_PAPI = $(USE_PAPI)" \
    "OPENMP = $(OPENMP)" \
    "OPENACC = $(OPENACC)" \
    "CPPFLAGS = $(MODEL_FORMULATION) -D_MPI -DCPRPGI" )


and by loading the following modules

Code:
module load PrgEnv-nvidia/8.5.0
module load gpu/1.0
module load cray-hdf5/1.12.2.3
module load cray-netcdf/4.9.0.9
module load cray-parallel-netcdf/1.12.3.9
module load cmake/3.24.3

, which leads to those modules for the compile- and run-time
Code:
1) craype-x86-milan                        7) conda/Miniconda3-py311_23.11.0-2        13) cray-mpich/8.1.28     (mpi)   19) cray-hdf5/1.12.2.3            (io)
  2) libfabric/1.15.2.0                      8) evp-patch                               14) cray-libsci/23.12.5   (math)  20) cray-netcdf/4.9.0.9           (io)
  3) craype-network-ofi                      9) python/3.11                      (dev)  15) PrgEnv-nvidia/8.5.0   (cpe)   21) cray-parallel-netcdf/1.12.3.9 (io)
  4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta  10) nvidia/23.9                      (g,c)  16) cudatoolkit/12.2      (g)     22) cmake/3.24.3                  (buildtools)
  5) perftools-base/23.12.0                 11) craype/2.7.30                    (c)    17) craype-accel-nvidia80
  6) cpe/23.12                              12) cray-dsmml/0.2.2                        18) gpu/1.0


I followed examples from the NERSC documentation and their online tool for generating a batch script.:

Code:
#SBATCH -q debug
#SBATCH -t 00:30:00
...
#SBATCH -C gpu&hbm40g
#SBATCH -G 4

export SLURM_CPU_BIND="cores"


srun -n 4 -c 32 --cpu_bind=cores -G 4 --gpu-bind=none  ./atmosphere_model

I'd appreciate any suggestions for solving the problem.

Best regards,

Koichi
 
Does the model run if you use an optimized (non-debug) executable? I vaguely recall that others have encountered inexplicable errors when building with `OPENACC=true` and `DEBUG=true`.
 
Thank you for the suggestion, @mgduda.

With 'DEBUG=false,' I get an error from compiling mpas_atm_time_integration.f, which may indicate a problem in my compilation process, causing the above run-time error.

The error message says it cannot find mpas_derived_types.mod (line 1384 in the attached log file):

Code:
 ftn -gopt -O4 -byteswapio -Mfree -Mnofma -acc -gpu=cc70,cc80 -Minfo=accel  -c -o mpas_atm_time_integration.o mpas_atm_time_integration.f
NVFORTRAN-F-0004-Unable to open MODULE file mpas_derived_types.mod (mpas_atm_time_integration.f: 9)
NVFORTRAN/x86-64 Linux 23.9-0: compilation aborted

Up in the log file, mpas_derived_types.mod seems to be removed (line 149 in the attached log file), I wonder why....

Code:
rm -f mpas_derived_types.o mpas_derived_types.mod

I get the same error in serial and parallel compilations (tested—j 4 and—j 6). The attached log file is from the serial compilation.

Am I missing some flags in the top-level Makefile entry that I copied and modified (copied in the original thread)? That's the only change I made in the build tool-related files. I am using version 8.2.1. Also below is a copy of the shell script I used to build the model.

I'd appreciate any further suggestions on solving the problem.

Best,

Koichi

Bash:
#!/bin/bash
set -e

MPASver="8.2.1"

...

imach="pm"  #Perlmutter

mpascore="atmosphere"

#issingle=true #single precision build, now default
doopenmp=false #OpenMP support
isdebug=false

modversion="2024-07"  #use year of major update that module (default) are introduced (INC0182147)

#Modules --------------------------------------------------------------------
loading_script="${scriptdir}/load_module_pmgpu_mpas_${modversion}.sh"
source ${loading_script}
#----------------------------------------------------------------------------
#capture starting time for log file name
idate=$(date "+%Y-%m-%d-%H%M")

bldlog="make_mpas_${mpascore}_${idate}.log"

module list &> ${bldlog}

#set environment variables used by MPAS
export NETCDF=$NETCDF_DIR

export PNETCDF=$PNETCDF_DIR

echo NETCDF is $NETCDF &>> ${bldlog}
echo PNETCDF is $PNETCDF &>> ${bldlog}

#run make in the top directory ---------------------------
cd $MPASdir

echo "cd into $MPASdir to compile"

pwd

#DO NOT SET PIO environment variable so that the build script uses the Simple MPAS I/O Layer (SMIOL)
#export PIO="/global/common/software/m1867/MPAS/mpas_io/pio2_pm_${modversion}_nodarshan"
#echo PIO is $PIO &>> ${bldlog}
echo "PIO environment variable is not set; using SMOIL instead"
echo PIO is $PIO

#reset Makefile
rm -rf Makefile   
ln -s ${scriptdir}/Makefile_pm_v${MPASver} Makefile
    
#run make
make clean CORE=${mpascore} #clean the previously compiled core?

exec_name="${mpascore}_v${MPASver}_pm_gpu_module${modversion}_${idate}"

makeoptions="nvhpc-pm-gpu OPENACC=true AUTOCLEAN=true CORE=${mpascore}"

if [ "$isdebug" = true ]; then
    makeoptions+=" DEBUG=true"
    exec_name+="_debug"
fi

if [ "$doopenmp" = true ]; then
    makeoptions+=" OPENMP=true"
    exec_name+="_OpenMP"
fi

#make -j 6 ${makeoptions}  &>> ${bldlog}
make ${makeoptions}  &>> ${bldlog}

makeval=$?  #capture return code
    
# Other commands needed after srun, such as copy your output filies,
if [ $makeval -eq 0  ]; then
    #rename the mpas executable
    echo "success, renaming & copying the executable"
    cp ./${mpascore}_model  ${MPASbindir}/${exec_name}
    
    #also copy log files
    mkdir -p ${MPASdir}/logs
    mv ${bldlog} ${MPASdir}/logs/
else
    echo "mpas build failed"
fi
 

Attachments

  • make_mpas_atmosphere_2024-08-15-1604.log
    398.8 KB · Views: 2
It looks like there's something odd going on with your build. Typically, compilation of source (*.F) files includes module search paths with -I<path> flags; e.g. from your build log:
Code:
ftn -D_MPI -DCPRPGI -DCORE_ATMOSPHERE -DMPAS_NAMELIST_SUFFIX=atmosphere -DMPAS_EXE_NAME=atmosphere_model -DMPAS_OPENACC -DSINGLE_PRECISION -DMPAS_NATIVE_TIMERS -DMPAS_GIT_VERSION=v8.2.1-dirty -DMPAS_BUILD_TARGET=nvhpc-pm-gpu -DMPAS_SMIOL_SUPPORT -DDO_PHYSICS -gopt -O4 -byteswapio -Mfree -Mnofma -acc -gpu=cc70,cc80 -Minfo=accel -c mpas_atm_iau.F -I/opt/cray/pe/netcdf/4.9.0.9/nvidia/23.3/include -I/opt/cray/pe/parallel-netcdf/1.12.3.9/nvidia/23.3/include -I/global/cfs/cdirs/wcm_code/MPAS-Atmosphere/models/ksa/MPAS-Model-v8.2.1/src/external/SMIOL -I/opt/cray/pe/netcdf/4.9.0.9/nvidia/23.3/include -I/opt/cray/pe/parallel-netcdf/1.12.3.9/nvidia/23.3/include -I/global/cfs/cdirs/wcm_code/MPAS-Atmosphere/models/ksa/MPAS-Model-v8.2.1/src/core_atmosphere/physics/physics_noahmp/drivers/mpas -I/global/cfs/cdirs/wcm_code/MPAS-Atmosphere/models/ksa/MPAS-Model-v8.2.1/src/core_atmosphere/physics/physics_noahmp/utility -I/global/cfs/cdirs/wcm_code/MPAS-Atmosphere/models/ksa/MPAS-Model-v8.2.1/src/core_atmosphere/physics/physics_noahmp/src -I.. -I../../framework -I../../operators -I../physics -I../physics/physics_wrf -I../physics/physics_mmm -I../../external/esmf_time_f90
However, it looks like your compilation is failing when compiling a file with a .f suffix:
Code:
ftn -gopt -O4 -byteswapio -Mfree -Mnofma -acc -gpu=cc70,cc80 -Minfo=accel  -c -o mpas_atm_time_integration.o mpas_atm_time_integration.f
I can't see why the default v8.2.1 build system would be trying to compile a file named mpas_atm_time_integration.f.

Can you start with a clean clone of the MPAS v8.2.1 code, modify just the top-level Makefile to add a Perlmutter build option (nvhpc-pm-gpu), and then compile manually (not from a script) with
Bash:
make nvhpc-pm-gpu CORE=atmosphere OPENACC=true
?
 
Thank you so much for the suggestion, and it worked :)

What I found is that the source code I downloaded as the tar.gz file from the release webpage compiles and runs fine, but if I clone from the master branch
Code:
git clone --recurse-submodules git@github.com:MPAS-Dev/MPAS-Model.git
does not compile with the above error involving mpas_atm_time_integration.f.

I wonder if git clone is not recommended from the MPAS-Dev repository. I usually download the code through
Code:
git clone
, especially right after a new release.

I also found the following:

- As long as I use the tarred, release source code, my bash script can compile the model, and the executable successfully finishes the test 240km simulation
- With debug = true, the executable compiled from the tarred source code still gives the same run-time error:
Code:
Accelerator Fatal Error: call to cuMemcpyHtoDAsync returned error 700: Illegal address during kernel execution

Best,

Koichi
 
Top