Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Error: cannot allocate memory mpas_init_atm_gwd.F, MPAS v8.2.2

RSAPIAIN_DMC

New member
Hello.

I would like some help with a mem.alloc error:
  • should I create a single-precision build?
  • or maybe the issue is that MPAS requires GNU 12.x ?
  • or the extraction of the regional domain has to be done from the "non-static" mesh file ?

I built MPAS v8.2.2 with
  • GOMPI 2024a + OpenMPI 4.1.6 (gnu: gcc, g++, gfortran 13.3.0)
  • SMIOL
  • OpenMP: on
  • Double Precision
  • NetCDF 4.9.2 , and NetCDF-Fortran 4.6.1 with GOMPI 2024a + OpenMPI 4.1.6
Created a regional domain, from the 15-km static mesh file from : MPAS-Atmosphere mesh downloads
PTS file content (used with the create_region script, from here GitHub - MPAS-Dev/MPAS-Limited-Area: Python tool to create a regional subset of a global MPAS Mesh ):
name: Sinoptico-15km
Type: Custom
Point: -33.5, -71.0
-55.0, -45.0
-55.0, -135.0
-15.0, -115.0
-15.0, -65.0

The .nc file had no issue when partitioning it with gpmetis for 4 mpi tasks, 64 and 128 (built with the same stack gnu-13), and also even tried to run it with a single process (without mpirun), and it still crashes with the mem.alloc. error.

I'm getting the following error when running init_atmosphere
[rsapiain@roddel run]$ time mpirun -n 4 ./init_atmosphere_model
In file 'mpas_init_atm_gwd.F', around line 562: Error allocating 13112510515200 bytes: Cannot allocate memory

Error termination. Backtrace:
In file 'mpas_init_atm_gwd.F', around line 562: Error allocating 18030996576000 bytes: Cannot allocate memory

Error termination. Backtrace:
In file 'mpas_init_atm_gwd.F', around line 562: Error allocating 18030996576000 bytes: Cannot allocate memory

Error termination. Backtrace:
In file 'mpas_init_atm_gwd.F', around line 562: Error allocating 17672006822400 bytes: Cannot allocate memory

There is no .err file created from running.
The mpas_static files were downloaded from the MPAS-website

EDIT:
- Tried running in machines with AMD EPYC 7313 w/128GB and AMD EPYC 7543 w/512GB RAM, .

Thank you in advance for the ideas.
 

Attachments

  • Chile_Sinoptico-15km.png
    Chile_Sinoptico-15km.png
    656.8 KB · Views: 2
  • log.init_atmosphere.0000.out.txt
    27 KB · Views: 1
  • streams.init_atmosphere.txt
    950 bytes · Views: 1
  • namelist.init_atmosphere.txt
    1.5 KB · Views: 1
Last edited:
If you have used the MPAS-Limited-Area tool to subset a global (15 km) static file, there's no need to reprocess the static fields again on the limited-area mesh.

The static field processing assumes that the input mesh is defined on a unit sphere, while static files use a sphere radius of 6371.229 km. So it's likely that the calculation of the GWDO fields is using cells whose dimensions are 6371.229 times larger than expected, leading to attempts to allocate large amounts of memory.
 
Michael, good evening, from Chile.

Thank you for the help, was able to run MPAS init_atmosphere steps, and the atmosphere_model accordind to the 2023 MPAS-A mini tutorial. for the domain specified.

The only thing is that the system load is quite overloaded; maybe that's a result of the openMP build.

ran it with: mpiexec -n 64 (the node has 64 cores, and graph was partitioned for that)
ps shows me 64 processes

To make an efficient run: system load around 64, should I set it to run with 16 mpi tasks (create another partition), or is there some env. variable to run with just 1 OpenMP thread per task ?
This to test times.


EDIT:
OMP_STACKSIZE=512m OMP_PLACES=cores OMP_PROC_BIND=close mpiexec -n 16 works for giving me an efficient machine usage (usually 60 to 64, but sometimes goes up to 80)
Any other ideas would be nice.


Kind regards,
 

Attachments

  • 02_MPAS-running_system-load-320.png
    02_MPAS-running_system-load-320.png
    188.4 KB · Views: 2
Last edited:
With support for OpenMP enabled in MPAS-A, there will generally be a total of (# MPI tasks) x (# OpenMP threads) MPAS-A threads. You can use the OMP_NUM_THREADS environment variable to control the maximum number of OpenMP threads for each MPI task. So, for example, if you would like a maximum of 32 threads, you could use 8 MPI tasks each with 4 OpenMP threads with
Code:
export OMP_NUM_THREADS=4
mpiexec -n 8 ./atmosphere_model
In my experience, MPAS-A seems to give the highest throughput (simulation rate) for a given number of threads when using only MPI, without OpenMP.

Hopefully that helps, but if you have additional questions about controlling processor usage in MPAS-A, please don't hesitate to create a new thread for that purpose!
 
Michael, good morning form Chile.

Thank you for the insights, will try with a non-OMP build then and test.

I would like to inquire about the optimization flags: since we are with Zen3 cpus, these are our optimization flags:

Makefile:
gnu:   # BUILDTARGET GNU Fortran, C, and C++ compilers
    ( $(MAKE) all \
    "FC_PARALLEL = mpif90" \
    "CC_PARALLEL = mpicc" \
    "CXX_PARALLEL = mpicxx" \
    "FC_SERIAL = gfortran" \
    "CC_SERIAL = gcc" \
    "CXX_SERIAL = g++" \
    "FFLAGS_PROMOTION = -fdefault-real-8 -fdefault-double-8 -march=znver3 -mtune=znver3 -mavx2 -mprefer-vector-width=256" \
    "FFLAGS_OPT = -std=f2008 -O3 -ffree-line-length-none -fconvert=big-endian -ffree-form -march=znver3 -mtune=znver3 -mavx2 -mprefer-vector-width=256" \
    "CFLAGS_OPT = -O3 -march=znver3 -mtune=znver3 -mavx2 -mprefer-vector-width=256" \
    "CXXFLAGS_OPT = -O3 -march=znver3 -mtune=znver3 -mavx2 -mprefer-vector-width=256" \
    "LDFLAGS_OPT = -O3" \
    "FFLAGS_DEBUG = -std=f2008 -g -ffree-line-length-none -fconvert=big-endian -ffree-form -fcheck=all -fbacktrace -ffpe-trap=invalid,zero,overflow" \
    "CFLAGS_DEBUG = -g" \
    "CXXFLAGS_DEBUG = -g" \
    "LDFLAGS_DEBUG = -g" \
    "FFLAGS_OMP = -fopenmp" \
    "CFLAGS_OMP = -fopenmp" \
    "FFLAGS_ACC =" \
    "CFLAGS_ACC =" \
    "PICFLAG = -fPIC" \
    "BUILD_TARGET = $(@)" \
    "CORE = $(CORE)" \
    "DEBUG = $(DEBUG)" \
    "USE_PAPI = $(USE_PAPI)" \
    "OPENMP = $(OPENMP)" \
    "OPENACC = $(OPENACC)" \
    "CPPFLAGS = $(MODEL_FORMULATION) -D_MPI" )

Which other modifiers would be good to use, according to your experience?
I found these ones in the specification
  • -ffast-math -mfma -m3dnow -fomit-frame-pointer
  • -flto
  • -funroll-all-loops
  • -fprefetch-loop-arrays --param prefetch-latency=300
Have not tested CLANG/FLANG with AOCC and AMD Lib-M (haven't yet been able to make the compilers work)

In my case we are trying with GNU compiler only: noticed a 20% speedup compared to Intel-classic ver 2022.latest.
 
Top