Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Memory leak in metgrid.exe

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

bartbrashers

New member
I have something wrong with my compilation that causes a memory leak with metgrid.exe. When it runs, it continues to use more and more virtual memory, until all the physical RAM + swap is exhausted, then metgrid gets killed. Here's an example, for a 6-day run, on a compute node with 64G of RAM:
Code:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND          
 6342 bbrashe+  20   0   34.4g  34.3g   5336 R  99.7 54.6   4:04.51 metgrid.exe

I get this behavior with multiple versions of WRF and WPS (from 3.8 to 4.1.3) on two different versions of CentOS, using three different versions of the PGI compilers. I just switched from using the LLVM to using the non-LLVM version of PGI 19.10, and recompiled everything in the software stack:

jasper-1.900.1
libpng-1.6.37
zlib-1.2.11
hdf5-1.8.20
netcdf-c-4.7.2
netcdf-fortran-4.5.2

This is on CentOS Linux release 7.4.1708, using the 3.10.0-693.el7.x86_64 x86_64 kernel.

I can use ldd on the binaries produced by some of the libs above (e.g. libpng-1.6.37/bin/pngfix) to show it's using my compilations of zlib, etc.

I have tried this using the yum-installed versions of libpng, zlib, and jasper. I've tried with and without hdf5 (disabling netcdf4).

I'll post some output from ldd and compile scripts below, and attach my configure.wrf, configure.wps, and the logs from compilation. Hopefully someone can see something wrong.

If not, any clues as to how to track down where the memory leak is coming from would be greatly appreciated.

Code:
# ldd libpng-1.6.37/bin/pngfix 
        linux-vdso.so.1 =>  (0x00007ffe8b9a1000)
        libpng16.so.16 => /usr/local/src/libpng-1.6.37/lib/libpng16.so.16 (0x00007f4c279a9000)
        libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007f4c276a1000)
        libz.so.1 => /usr/local/src/zlib-1.2.11/lib/libz.so.1 (0x00007f4c27489000)
        libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007f4c27201000)
        libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007f4c26ff1000)
        libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007f4c26dd1000)
        libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007f4c269b9000)
        libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007f4c26761000)
        libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007f4c26391000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007f4c26179000)
        /lib64/ld-linux-x86-64.so.2 (0x000055acdda1f000)

# ldd zlib-1.2.11/example64 
        linux-vdso.so.1 =>  (0x00007ffd09c79000)
        libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007f7b47051000)
        libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007f7b46e41000)
        libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007f7b46c21000)
        libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007f7b46809000)
        libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007f7b465b1000)
        libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007f7b462a9000)
        libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007f7b45ed9000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007f7b45cc1000)
        /lib64/ld-linux-x86-64.so.2 (0x00005584927de000)

# ldd hdf5-1.8.20.pgi/bin/h5copy 
        linux-vdso.so.1 =>  (0x00007ffde7721000)
        libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007fca87599000)
        libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007fca87381000)
        libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007fca87179000)
        libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007fca86e71000)
        libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007fca86be9000)
        libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007fca869d9000)
        libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007fca867b9000)
        libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007fca863a1000)
        libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007fca86149000)
        libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007fca85d79000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007fca85b61000)
        /lib64/ld-linux-x86-64.so.2 (0x0000564e1f4fe000)

# ldd netcdf-c-4.7.2.pgi/build/bin/ncgen3
        linux-vdso.so.1 =>  (0x00007ffce1e01000)
        libnetcdf.so.15 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdf.so.15 (0x00007fa0ae831000)
        libhdf5_hl.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_hl.so.10 (0x00007fa0ae5f9000)
        libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007fa0adf99000)
        libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007fa0adc91000)
        libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007fa0ada89000)
        libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007fa0ad871000)
        libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007fa0ad5e9000)
        libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007fa0ad3d9000)
        libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007fa0ad1b9000)
        libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007fa0acda1000)
        libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007fa0acb49000)
        libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007fa0ac779000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007fa0ac561000)
        /lib64/ld-linux-x86-64.so.2 (0x000056006a494000)

# netcdf-c-4.7.2.pgi/build/bin/nc-config --all

This netCDF 4.7.2 has been built with the following features: 

  --cc            -> pgcc
  --cflags        -> -I/usr/local/src/netcdf-c-4.7.2.pgi/build/include -I/usr/local/src/hdf5-1.8.20.pgi/include
  --libs          -> -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdf
  --static        -> -lhdf5_hl -lhdf5 -lm -ldl -lz 

  --has-c++       -> no
  --cxx           -> 

  --has-c++4      -> no
  --cxx4          -> 

  --has-fortran   -> yes
  --fc            -> pgfortran
  --fflags        -> -I/usr/local/src/netcdf-c-4.7.2.pgi/build/include
  --flibs         -> -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdff -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdf -lnetcdf -ldl -lm
  --has-f90       -> 
  --has-f03       -> yes

  --has-dap       -> no
  --has-dap2      -> no
  --has-dap4      -> no
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> no
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> no
  --has-cdf5      -> yes
  --has-parallel4 -> no
  --has-parallel  -> no

  --prefix        -> /usr/local/src/netcdf-c-4.7.2.pgi/build
  --includedir    -> /usr/local/src/netcdf-c-4.7.2.pgi/build/include
  --libdir        -> /usr/local/src/netcdf-c-4.7.2.pgi/build/lib
  --version       -> netCDF 4.7.2
  
# cat wrf/WRF-4.1.3/my.compile 
#!/bin/csh -f
setenv NETCDF     /usr/local/src/wrf/netcdf-c-4.7.2/build
setenv NETCDFHOME $NETCDF
setenv NETCDF_DIR $NETCDF

if !($?LD_LIBRARY_PATH) then
    setenv LD_LIBRARY_PATH $NETCDF_DIR/lib
else
    if ( "$LD_LIBRARY_PATH" !~ *$NETCDF_DIR/lib* ) then
        setenv LD_LIBRARY_PATH $NETCDF_DIR/lib:${LD_LIBRARY_PATH}
    endif
endif

setenv NETCDF4 0
setenv WRFIO_NCD_LARGE_FILE_SUPPORT 1

echo "Cleaning"
clean -a >& /dev/null
if (-e my.configure.wrf) then
    cp my.configure.wrf configure.wrf
    echo "Using existing my.configure.wrf"
else
    configure
    cp configure.wrf my.configure.wrf
    echo "Edit my.configure.wrf and re-run $0:t"
    exit
endif
echo "Compiling"
./compile em_real >&! compile.out.`date +%Y-%m-%d`
echo "Done"

# cat wrf/WPS-4.1/my.compile 
#!/bin/csh -f
setenv NETCDF     /usr/local/src/netcdf-c-4.7.2.pgi/build
setenv NETCDFHOME $NETCDF
setenv NETCDF_DIR $NETCDF

if !($?LD_LIBRARY_PATH) then
    setenv LD_LIBRARY_PATH $NETCDF_DIR/lib
else
    if ( "$LD_LIBRARY_PATH" !~ *$NETCDF_DIR/lib* ) then
        setenv LD_LIBRARY_PATH $NETCDF_DIR/lib:${LD_LIBRARY_PATH}
    endif
endif

setenv NETCDF4 0
setenv WRFIO_NCD_LARGE_FILE_SUPPORT 1

echo "Cleaning"
clean -a >& /dev/null
if (-e my.configure.wps) then
    cp my.configure.wps configure.wps
    echo "Using existing my.configure.wps"
else
    configure
    cp configure.wps my.configure.wps
    echo "Edit my.configure.wps and re-run $0:t"
    exit
endif
echo "Compiling"
compile >&! compile.out.`date +%Y-%m-%d`
echo "Done"


# ldd wrf/WRF-4.1.3/run/wrf.exe 
        linux-vdso.so.1 =>  (0x00007ffc245c1000)
        libnetcdff.so.7 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdff.so.7 (0x00007f6df81b1000)
        libnetcdf.so.15 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdf.so.15 (0x00007f6df7ea9000)
        libhdf5hl_fortran.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5hl_fortran.so.10 (0x00007f6df7c81000)
        libhdf5_hl.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_hl.so.10 (0x00007f6df7a49000)
        libhdf5_fortran.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_fortran.so.10 (0x00007f6df77f1000)
        libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007f6df7191000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00007f6df6e89000)
        libz.so.1 => /usr/lib64/libz.so.1 (0x00007f6df6c71000)
        libmpi_usempif08.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi_usempif08.so.40 (0x00007f6df6a01000)
        libmpi_usempi_ignore_tkr.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007f6df67f1000)
        libmpi_mpifh.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi_mpifh.so.40 (0x00007f6df6591000)
        libmpi.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi.so.40 (0x00007f6df60d1000)
        libpgf90rtl.so => /usr/local/pgi/linux86-64/2019/lib/libpgf90rtl.so (0x00007f6df5ea9000)
        libpgf90.so => /usr/local/pgi/linux86-64/2019/lib/libpgf90.so (0x00007f6df5891000)
        libpgf90_rpm1.so => /usr/local/pgi/linux86-64/2019/lib/libpgf90_rpm1.so (0x00007f6df5689000)
        libpgf902.so => /usr/local/pgi/linux86-64/2019/lib/libpgf902.so (0x00007f6df5471000)
        libpgftnrtl.so => /usr/local/pgi/linux86-64/2019/lib/libpgftnrtl.so (0x00007f6df5211000)
        libpgmp.so => /usr/local/pgi/linux86-64/2019/lib/libpgmp.so (0x00007f6df4f89000)
        libnuma.so.1 => /usr/local/pgi/linux86-64/2019/lib/libnuma.so.1 (0x00007f6df4d79000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f6df4b59000)
        libpgmath.so => /usr/local/pgi/linux86-64/2019/lib/libpgmath.so (0x00007f6df4741000)
        libpgc.so => /usr/local/pgi/linux86-64/2019/lib/libpgc.so (0x00007f6df44e9000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00007f6df42e1000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f6df3f11000)
        libgcc_s.so.1 => /opt/ohpc/pub/compiler/gcc/7.3.0/lib64/libgcc_s.so.1 (0x00007f6df3cf9000)
        libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007f6df3af1000)
        /lib64/ld-linux-x86-64.so.2 (0x00005599f7d8e000)
        libopen-rte.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/../lib/libopen-rte.so.40 (0x00007f6df3791000)
        libopen-pal.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/../lib/libopen-pal.so.40 (0x00007f6df32e9000)
        librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f6df30d1000)
        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f6df2eb9000)
        libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f6df2cb1000)
        libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00007f6df2a41000)
        libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f6df2819000)

# ldd wrf/WPS-4.1/metgrid/src/metgrid.exe 
        linux-vdso.so.1 =>  (0x00007fffe8e01000)
        libnetcdff.so.7 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdff.so.7 (0x00007fefc9711000)
        libnetcdf.so.15 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdf.so.15 (0x00007fefc9409000)
        libpgf90rtl.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90rtl.so (0x00007fefc91e1000)
        libpgf90.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so (0x00007fefc8bc9000)
        libpgf90_rpm1.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90_rpm1.so (0x00007fefc89c1000)
        libpgf902.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf902.so (0x00007fefc87a9000)
        libpgftnrtl.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgftnrtl.so (0x00007fefc8549000)
        libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007fefc82c1000)
        libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007fefc80b1000)
        libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007fefc7e91000)
        libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007fefc7a79000)
        libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007fefc7821000)
        librt.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/librt.so.1 (0x00007fefc7619000)
        libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007fefc7311000)
        libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007fefc6f41000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007fefc6d29000)
        libhdf5_hl.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_hl.so.10 (0x00007fefc6af1000)
        libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007fefc6491000)
        libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007fefc6279000)
        libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007fefc6071000)
        /lib64/ld-linux-x86-64.so.2 (0x000056033dce1000)
 

Attachments

  • configure.wrf
    20.5 KB · Views: 50
  • configure.wps
    3.6 KB · Views: 60
  • compile.out.2020-01-10.txt
    781.4 KB · Views: 52
  • compile.out.2020-01-10.txt
    106.3 KB · Views: 54
  • make.out.txt
    76.2 KB · Views: 57
  • netcdf-c.out.txt
    100.1 KB · Views: 52
  • netcdf-fortran.out.txt
    117 KB · Views: 52
In trying to compile WRF-4.1.3 with my own version of libz (aka zlib, compiled in /usr/local/src/zlib-1.2.11/lib), I specified the linking by updating configure.wrf from my previous posting, like so:

Code:
$ grep zlib configure.wrf
                      -L$(WRF_SRC_ROOT_DIR)/external/io_netcdf -lwrfio_nf -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdff -lnetcdf  -L/usr/local/src/hdf5-1.8.20.pgi/lib -lhdf5hl_fortran -lhdf5_hl -lhdf5_fortran -lhdf5 -L/usr/local/src/zlib-1.2.11/lib -lz -lm

But it doesn't appear to take:

Code:
$ ldd run/wrf.exe | grep libz
        libz.so.1 => /usr/lib64/libz.so.1 (0x00007f7373081000)

Even though "-lz" is included in my compile logs exactly 4 times, for each of wrf.exe, ndown.exe, tc.exe, and real.exe, like so:

Code:
$ grep /usr/local/src/zlib-1.2.11/lib compile.out.2020-01-11 | head -1
time /usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90 -o wrf.exe -mp -Minfo=mp -Mrecursive -O3 -tp=istanbul  -w -Mfree -byteswapio -mp -Minfo=mp -Mrecursive    wrf.o ../main/module_wrf_top.o libwrflib.a /usr/local/src/wrf/WRF-4.1.3/external/fftpack/fftpack5/libfftpack.a /usr/local/src/wrf/WRF-4.1.3/external/io_grib1/libio_grib1.a /usr/local/src/wrf/WRF-4.1.3/external/io_grib_share/libio_grib_share.a /usr/local/src/wrf/WRF-4.1.3/external/io_int/libwrfio_int.a -L/usr/local/src/wrf/WRF-4.1.3/external/esmf_time_f90 -lesmf_time /usr/local/src/wrf/WRF-4.1.3/external/RSL_LITE/librsl_lite.a /usr/local/src/wrf/WRF-4.1.3/frame/module_internal_header_util.o /usr/local/src/wrf/WRF-4.1.3/frame/pack_utils.o  -L/usr/local/src/wrf/WRF-4.1.3/external/io_netcdf -lwrfio_nf -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdff -lnetcdf  -L/usr/local/src/hdf5-1.8.20.pgi/lib -lhdf5hl_fortran -lhdf5_hl -lhdf5_fortran -lhdf5 -L/usr/local/src/zlib-1.2.11/lib -lz -lm

zlib is in my $LD_LIBRARAY_PATH, though not before /usr/lib64:

Code:
 $ echo $LD_LIBRARY_PATH | tr : "\n"
/opt/ohpc/pub/compiler/gcc/7.3.0/lib64
/usr/local/src/hdf5-1.8.20.pgi/lib
/usr/local/src/libpng-1.6.37/lib
/usr/local/src/netcdf-c-4.7.2.pgi/build/lib
/usr/local/pgi/linux86-64/2019/lib
/usr/local/pgi/linux86-64/2019/libso
/usr/lib64
/usr/local/src/szip-2.1.1/lib
/usr/local/src/zlib-1.2.11/lib

And rpm -qf /usr/lib64/libz.so.1 confirms that file is from the zlib-1.2.7-18.el7.x86_64 package.

Can anyone suggest the correct way to use my own version of this lib?

Thanks,

Bart
 
If there's reason to suspect that the libraries may be at fault, one worthwhile test may be to recompile WRF with netCDF4 features disabled. You can do this by setting the NETCDF_classic environment variable to a value of 1, then cleaning, reconfiguring, and recompiling WRF and the WPS.

For what it's worth, I did try running the metgrid program from WPS v4.1 locally with valgrind to look for memory leaks. With the GNU compilers, there is indeed a leak when reading static fields, but that leak is not in any time loop, and the amount of memory that is lost doesn't increase with the number of time periods that metgrid processes. I wasn't able to get valgrind to work with the metgrid executable compiled with the PGI compilers (even after some fussing with the target architecture flags), but you may have better luck than I did. If you happen to have valgrind available, could you give the following a try?

Code:
valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all ./metgrid.exe
 
Hi mgduda, thanks for the reply. I can upload all 1.8M of what valgrind output, if that would be helpful.

Valgrind didn't write out very much until the memory usage was about 98%, just a bunch of lines like this:

Code:
==10629== Memcheck, a memory error detector
==10629== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==10629== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==10629== Command: /usr/local/src/wrf/WPS-4.1/metgrid.exe
==10629==
...snip...
==10629== Warning: set address range perms: large range [0x59ea3040, 0xf466fb80) (undefined)
==10629== Warning: set address range perms: large range [0xf4670040, 0x18ee3cd90) (undefined)
==10629== Warning: set address range perms: large range [0xf4670028, 0x18ee3cda8) (noaccess)
==10629== Warning: set address range perms: large range [0xf4670040, 0x18ee3cfa0) (undefined)
==10629== Warning: set address range perms: large range [0x18ee3d040, 0x22960a1b0) (undefined)
==10629== Warning: set address range perms: large range [0x18ee3d028, 0x22960a1c8) (noaccess)

That was repeated with each new time-step (3 hours apart). When the memory was gone, valgrind wrote this:

Code:
==10629== Warning: set address range perms: large range [0x1f48cfd040, 0x1fe34ca5d0) (undefined)
0: ALLOCATE: 2591856000 bytes requested; not enough memory
==10629==
==10629== HEAP SUMMARY:
==10629==     in use at exit: 106,365,652,315 bytes in 14,940 blocks
==10629==   total heap usage: 6,874,915 allocs, 6,859,975 frees, 314,204,281,776 bytes allocated
==10629==
==10629== 1 bytes in 1 blocks are still reachable in loss record 1 of 2,312
==10629==    at 0x4C29E4B: malloc (vg_replace_malloc.c:309)
==10629==    by 0x8012CAD: H5MM_malloc (H5MM.c:64)
==10629==    by 0x8089D16: H5P_register_real (H5Pint.c:1069)
==10629==    by 0x80A558F: H5P__ocrt_reg_prop (H5Pocpl.c:154)
==10629==    by 0x807EB38: H5P_init (H5Pint.c:465)
==10629==    by 0x7E8A487: H5_init_library (in /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10.3.1)
==10629==    by 0x7F3E347: H5Eset_auto2 (H5E.c:1653)
==10629==    by 0x52D0D23: set_auto (hdf5internal.c:67)
==10629==    by 0x1FFEFF9CEF: ???
==10629==    by 0x52D002C: nc4_hdf5_initialize (hdf5internal.c:78)
==10629==    by 0x1FFEFF9CFF: ???
==10629==    by 0x52D9185: NC_HDF5_initialize (hdf5dispatch.c:120)
==10629==
==10629== 1 bytes in 1 blocks are still reachable in loss record 2 of 2,312
==10629==    at 0x4C29E4B: malloc (vg_replace_malloc.c:309)
==10629==    by 0x8012CAD: H5MM_malloc (H5MM.c:64)
==10629==    by 0x8089D16: H5P_register_real (H5Pint.c:1069)
==10629==    by 0x807A4DC: H5P_fcrt_reg_prop (H5Pfcpl.c:173)
==10629==    by 0x807EB38: H5P_init (H5Pint.c:465)
==10629==    by 0x7E8A487: H5_init_library (in /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10.3.1)
==10629==    by 0x7F3E347: H5Eset_auto2 (H5E.c:1653)
==10629==    by 0x52D0D23: set_auto (hdf5internal.c:67)
==10629==    by 0x1FFEFF9CEF: ???
==10629==    by 0x52D002C: nc4_hdf5_initialize (hdf5internal.c:78)
==10629==    by 0x1FFEFF9CFF: ???
==10629==    by 0x52D9185: NC_HDF5_initialize (hdf5dispatch.c:120)

That was repeated with the number of bytes increasing with each, mostly naming "H5" (i.e. hdf5) issues. It started saying "definitely lost" at record 2,227:

Code:
==10629== 3,200 bytes in 20 blocks are definitely lost in loss record 2,227 of 2,312
==10629==    at 0x4C29E4B: malloc (vg_replace_malloc.c:309)
==10629==    by 0x5989CDE: __fort_malloc_without_abort (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x5989F04: __fort_gmalloc_without_abort (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x59F40D6: __alloc04_i8 (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x59F1671: pgf90_alloc04a_i8 (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x481872: storage_module_storage_get_levels_ (storage_module.f90:1080)
==10629==    by 0x463EAA: process_domain_module_derive_mpas_fields_ (process_domain_module.f90:1999)
==10629==    by 0x452C7E: process_domain_module_process_single_met_time_ (process_domain_module.f90:853)
==10629==    by 0x44AE8C: process_domain_module_process_domain_ (process_domain_module.f90:169)
==10629==    by 0x4123F3: MAIN_ (metgrid.f90:75)
==10629==    by 0x408473: main (in /usr/local/src/wrf/WPS-4.1/metgrid/src/metgrid.exe)

Line 1080 of WPS-4.1/metgrid/src/storage_module.f90 is indeed a call to allocate():

Code:
  1071        n = 0
  1072        ! At this point, name_cursor points to a valid head node for fieldname
  1073        data_cursor => name_cursor%fieldlist_head
  1074        do while ( associated(data_cursor) )
  1075           n = n + 1
  1076           if (.not. associated(data_cursor%next)) exit
  1077           data_cursor => data_cursor%next
  1078        end do
  1079
  1080        if (n > 0) allocate(list(n))
  1081
  1082        n = 1
  1083        do while ( associated(data_cursor) )
  1084           list(n) = get_level(data_cursor%fg_data)
  1085           n = n + 1
  1086           data_cursor => data_cursor%prev
  1087        end do

But is that problem reported by valgrid just that I'm out of memory, and can't allocate() anything?

Valgrind's output ends like this:

Code:
==10629== 38,878,096,560 bytes in 15 blocks are possibly lost in loss record 2,311 of 2,312
==10629==    at 0x4C29E4B: malloc (vg_replace_malloc.c:309)
==10629==    by 0x5989CDE: __fort_malloc_without_abort (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x5989F04: __fort_gmalloc_without_abort (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x59F40D6: __alloc04_i8 (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x59F1671: pgf90_alloc04a_i8 (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x4767FA: read_met_module_read_next_met_field_ (read_met_module.f90:388)
==10629==    by 0x45442B: process_domain_module_process_intermediate_fields_ (process_domain_module.f90:1083)
==10629==    by 0x4524B8: process_domain_module_process_single_met_time_ (process_domain_module.f90:787)
==10629==    by 0x44AE8C: process_domain_module_process_domain_ (process_domain_module.f90:169)
==10629==    by 0x4123F3: MAIN_ (metgrid.f90:75)
==10629==    by 0x408473: main (in /usr/local/src/wrf/WPS-4.1/metgrid/src/metgrid.exe)
==10629==
==10629== 41,470,029,680 bytes in 16 blocks are possibly lost in loss record 2,312 of 2,312
==10629==    at 0x4C29E4B: malloc (vg_replace_malloc.c:309)
==10629==    by 0x5989CDE: __fort_malloc_without_abort (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x5989F04: __fort_gmalloc_without_abort (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x59F40D6: __alloc04_i8 (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x59F1671: pgf90_alloc04a_i8 (in /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so)
==10629==    by 0x4767FA: read_met_module_read_next_met_field_ (read_met_module.f90:388)
==10629==    by 0x464F4E: process_domain_module_get_interp_masks_ (process_domain_module.f90:2096)
==10629==    by 0x454211: process_domain_module_process_intermediate_fields_ (process_domain_module.f90:1072)
==10629==    by 0x4524B8: process_domain_module_process_single_met_time_ (process_domain_module.f90:787)
==10629==    by 0x44AE8C: process_domain_module_process_domain_ (process_domain_module.f90:169)
==10629==    by 0x4123F3: MAIN_ (metgrid.f90:75)
==10629==    by 0x408473: main (in /usr/local/src/wrf/WPS-4.1/metgrid/src/metgrid.exe)
==10629==
==10629== LEAK SUMMARY:
==10629==    definitely lost: 25,933,015,056 bytes in 8,908 blocks
==10629==    indirectly lost: 120,960 bytes in 504 blocks
==10629==      possibly lost: 80,431,669,184 bytes in 2,454 blocks
==10629==    still reachable: 847,115 bytes in 3,074 blocks
==10629==         suppressed: 0 bytes in 0 blocks

The first errors were about the HDF5 libs, there were a few about netCDF libs, a few about libnuma and other system libs, and more about pgfortran's libs, and potentially some locations in metgrid's code.

The first instance I can find about anything in metgrid's code is at interp_option_module.f90:97, which looks like this:

Code:
    88        ! Allocate one extra array element to act as the default
    89  ! BUG: Maybe this will not be necessary if we move to a module with query routines for
    90  !  parsing the METGRID.TBL
    91        num_entries = num_entries + 1
    92
    93        allocate(fieldname(num_entries))
    94        allocate(mpas_name(num_entries))
    95        allocate(interp_method(num_entries))
    96        allocate(v_interp_method(num_entries))
    97        allocate(masked(num_entries))
    98        allocate(fill_missing(num_entries))
    99        allocate(missing_value(num_entries))
   100        allocate(fill_lev_list(num_entries))
   101        allocate(interp_mask(num_entries))
   102        allocate(interp_land_mask(num_entries))
   103        allocate(interp_water_mask(num_entries))
   104        allocate(interp_mask_val(num_entries))
   105        allocate(interp_land_mask_val(num_entries))
   106        allocate(interp_water_mask_val(num_entries))
   107        allocate(interp_mask_relational(num_entries))
   108        allocate(interp_land_mask_relational(num_entries))
   109        allocate(interp_water_mask_relational(num_entries))
   110        allocate(level_template(num_entries))
   111        allocate(flag_in_output(num_entries))
   112        allocate(output_name(num_entries))
   113        allocate(from_input(num_entries))
   114        allocate(z_dim_name(num_entries))
   115        allocate(output_stagger(num_entries))
   116        allocate(output_this_field(num_entries))
   117        allocate(is_u_field(num_entries))
   118        allocate(is_v_field(num_entries))
   119        allocate(is_derived_field(num_entries))
   120        allocate(is_mandatory(num_entries))

I do not have the word "masked" in namelist.wps, nor is it mentioned in namelist.wps.all_options. And I think that warning in the comment about a bug is really a potential bug, not a current (known) bug.

I'm at a loss to continue. Any ideas?

Bart
 
I have narrowed the memory leak down to when I'm using high-resolution SST datasets.

In this case I was using ERA5 and the GHRSST 1.3km MUR dataset from JPL's PODAAC,https://podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1. It comes in netCDF, so I wrote a couple of Fortran programs to essentially replace UNGRIB. One reads netCDF and writes WPS Intermediate File Format, the other interpolates those file to every interval_seconds (6 hours, in my case). I name the files SST:YYYY-MM-DD_HH and add 'SST' to the fg_name in namelist.wps.

(Because some of the GHRSST products from JPL's PODAAC are not global, I can't just remove the SST line from Vtable for the UNGRIB run on the ERA5 files - I need to retain the SST values that are outside the GHRSST datasets coverage. It's the same idea as when I use SNODAS for North America: there are parts of the domain that are outside the SNODAS datasets' coverage. If I don't interpolate the GHRSST data to every interval_seconds, then METGRID will use the ERA5 for the other three periods per day, which is not what I want.)

When I compiled WPS (many versions tested) I use the -mcmodel=medium flag for pgfortran. Only the Global 1km GHRSST product actually requires that, the others just barely fit. But my PODAAC-to-WPS_INT program requires it, because the raw netCDF files from JPL are dimensioned 17999 lats and 36000 lons.

I didn't notice this before, because when I have a bunch of WPS runs going on a compute node, I see this:

Code:
# rtop 0
---- compute-0-0 ----
top - 15:30:03 up 84 days,  4:48,  0 users,  load average: 14.14, 14.36, 14.19
Tasks: 425 total,   5 running, 420 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.4%us,  5.5%sy,  0.0%ni, 63.6%id, 12.2%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:  66010804k total, 64595052k used,  1415752k free,     1708k buffers
Swap: 134215032k total,  7180512k used, 127034520k free, 33158420k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12366 bbrasher  25   0 5175m 3.4g 2156 R 98.0  5.4  41:06.07 metgrid.exe
12175 bbrasher  25   0 5194m 4.9g 463m R 96.2  7.7 104:57.03 metgrid.exe
12427 bbrasher  25   0 5194m 4.9g 2152 R 96.2  7.7 106:22.14 metgrid.exe
12488 bbrasher  18   0 5200m 4.9g 2.4g R 30.9  7.8  71:37.03 metgrid.exe
11806 bbrasher  18   0 5175m 5.0g 2.4g D 29.1  7.9  37:18.50 metgrid.exe
11242 bbrasher  18   0 5173m 4.3g 2156 D 25.4  6.9  31:17.94 metgrid.exe
 5859 root      11  -5     0    0    0 S 18.2  0.0 267:03.04 rpciod/21
 5540 bbrasher  15   0 13024 1256  716 R  5.4  0.0   0:00.05 top
 6547 root      10  -5     0    0    0 S  3.6  0.0 983:10.61 nfsiod
    1 root      15   0 10368  516  484 S  0.0  0.0   2:54.12 init

Nothing really alarming there, except that the runs take waaaaay too long, like 80 hours. When I run just one on a compute node, I get this:

Code:
# rtop 1
---- compute-0-1 ----
top - 15:11:20 up 92 days,  4:10,  0 users,  load average: 4.58, 2.70, 2.25
Tasks: 381 total,   6 running, 375 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.9%us,  2.9%sy,  0.0%ni, 75.8%id,  3.4%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  66010804k total, 65913700k used,    97104k free,     1464k buffers
Swap: 134215032k total, 84803852k used, 49411180k free,  5566232k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  798 bbrasher  25   0  137g  56g 1648 R 100.1 90.4  56:42.94 metgrid.exe
 1113 root      20  -5     0    0    0 R 98.2  0.0  13:26.52 kswapd0
 1114 root      20  -5     0    0    0 R 98.2  0.0  14:15.10 kswapd1
 1116 root      12  -5     0    0    0 R 98.2  0.0  13:57.34 kswapd3
 1115 root      20  -5     0    0    0 R 83.1  0.0  13:42.50 kswapd2
 6023 bbrasher  15   0 12892 1212  716 R  5.7  0.0   0:00.05 top
    1 root      15   0 10368  516  484 S  0.0  0.0   0:28.59 init
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:07.28 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:01.32 ksoftirqd/0
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0

METGRID is using 137 gigs of virtual memory, and swapping is using the equivalent of almost 4 CPUs. That's the cause of the great slowness.

The SST:YYYY-MM-DD_HH files are 2.5 G each, I can upload one somewhere if you'd like. I can also post the program that converts to WPS intermediate file format, if you'd like.

Suggestions?
 
Apologies for the long silence on my part! I spent a few minutes looking through the metgrid code in some of the areas mentioned in the valgrind output that you quoted; unfortunately, nothing obvious stood out as being the source of a memory leak.

If you could upload just one of the SST intermediate files to https://nextcloud.mmm.ucar.edu/nextcloud/index.php/s/DkPamzUKYhQkWEI (i.e., the Nextcloud link mentioned on the main forum page), I'll try to reproduce the leak locally using the PGI compilers.
 
OK, I've uploaded one file named SST_2014-12-31_12. I had to change the : to an _ because of Windows limitations, it should be SST:2014-12-31_12.

Let me know what you find. This problem has been plaguing me for quite a while now. I thought it appeared when I changed to a newer version of CentOS, and from RocksCluster.org to OHPC.community. But I was able to see it recently on the older OS, so my attempts to get HDF5 to use my compilation of zlib was just a red herring.

Thanks!
 
Any progress on this? Would it help to upload another SST file? From watching 'top' as files are created, I believe metgrid fails to deallocate at the end of each time step (6h in my case).
 
Please see the related thread at https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=31&t=9049.

I worked around this problem by extracting a sub-domain of the full (global, 0.01 degree, ~1km) SST dataset.

I included the LANDSEA mask in the SST:YYYY-MM-DD_HH Intermediate files - using the same name for the field as the LANDSEA in the ERA5 files that ungrib processes.

metgrid runs complete d01, and 75% of d02 files, but then crash with "ERROR: get_min(): No items left in the heap." That sounds like a memory issue, so might be related to this thread.

I do not get this crash when I skip using the high-res SST file, and just rely on the ERA5 SST fields.

Thanks,

Bart
 
Top