bartbrashers
New member
I have something wrong with my compilation that causes a memory leak with metgrid.exe. When it runs, it continues to use more and more virtual memory, until all the physical RAM + swap is exhausted, then metgrid gets killed. Here's an example, for a 6-day run, on a compute node with 64G of RAM:
I get this behavior with multiple versions of WRF and WPS (from 3.8 to 4.1.3) on two different versions of CentOS, using three different versions of the PGI compilers. I just switched from using the LLVM to using the non-LLVM version of PGI 19.10, and recompiled everything in the software stack:
jasper-1.900.1
libpng-1.6.37
zlib-1.2.11
hdf5-1.8.20
netcdf-c-4.7.2
netcdf-fortran-4.5.2
This is on CentOS Linux release 7.4.1708, using the 3.10.0-693.el7.x86_64 x86_64 kernel.
I can use ldd on the binaries produced by some of the libs above (e.g. libpng-1.6.37/bin/pngfix) to show it's using my compilations of zlib, etc.
I have tried this using the yum-installed versions of libpng, zlib, and jasper. I've tried with and without hdf5 (disabling netcdf4).
I'll post some output from ldd and compile scripts below, and attach my configure.wrf, configure.wps, and the logs from compilation. Hopefully someone can see something wrong.
If not, any clues as to how to track down where the memory leak is coming from would be greatly appreciated.
Code:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6342 bbrashe+ 20 0 34.4g 34.3g 5336 R 99.7 54.6 4:04.51 metgrid.exe
I get this behavior with multiple versions of WRF and WPS (from 3.8 to 4.1.3) on two different versions of CentOS, using three different versions of the PGI compilers. I just switched from using the LLVM to using the non-LLVM version of PGI 19.10, and recompiled everything in the software stack:
jasper-1.900.1
libpng-1.6.37
zlib-1.2.11
hdf5-1.8.20
netcdf-c-4.7.2
netcdf-fortran-4.5.2
This is on CentOS Linux release 7.4.1708, using the 3.10.0-693.el7.x86_64 x86_64 kernel.
I can use ldd on the binaries produced by some of the libs above (e.g. libpng-1.6.37/bin/pngfix) to show it's using my compilations of zlib, etc.
I have tried this using the yum-installed versions of libpng, zlib, and jasper. I've tried with and without hdf5 (disabling netcdf4).
I'll post some output from ldd and compile scripts below, and attach my configure.wrf, configure.wps, and the logs from compilation. Hopefully someone can see something wrong.
If not, any clues as to how to track down where the memory leak is coming from would be greatly appreciated.
Code:
# ldd libpng-1.6.37/bin/pngfix
linux-vdso.so.1 => (0x00007ffe8b9a1000)
libpng16.so.16 => /usr/local/src/libpng-1.6.37/lib/libpng16.so.16 (0x00007f4c279a9000)
libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007f4c276a1000)
libz.so.1 => /usr/local/src/zlib-1.2.11/lib/libz.so.1 (0x00007f4c27489000)
libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007f4c27201000)
libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007f4c26ff1000)
libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007f4c26dd1000)
libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007f4c269b9000)
libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007f4c26761000)
libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007f4c26391000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007f4c26179000)
/lib64/ld-linux-x86-64.so.2 (0x000055acdda1f000)
# ldd zlib-1.2.11/example64
linux-vdso.so.1 => (0x00007ffd09c79000)
libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007f7b47051000)
libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007f7b46e41000)
libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007f7b46c21000)
libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007f7b46809000)
libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007f7b465b1000)
libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007f7b462a9000)
libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007f7b45ed9000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007f7b45cc1000)
/lib64/ld-linux-x86-64.so.2 (0x00005584927de000)
# ldd hdf5-1.8.20.pgi/bin/h5copy
linux-vdso.so.1 => (0x00007ffde7721000)
libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007fca87599000)
libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007fca87381000)
libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007fca87179000)
libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007fca86e71000)
libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007fca86be9000)
libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007fca869d9000)
libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007fca867b9000)
libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007fca863a1000)
libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007fca86149000)
libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007fca85d79000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007fca85b61000)
/lib64/ld-linux-x86-64.so.2 (0x0000564e1f4fe000)
# ldd netcdf-c-4.7.2.pgi/build/bin/ncgen3
linux-vdso.so.1 => (0x00007ffce1e01000)
libnetcdf.so.15 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdf.so.15 (0x00007fa0ae831000)
libhdf5_hl.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_hl.so.10 (0x00007fa0ae5f9000)
libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007fa0adf99000)
libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007fa0adc91000)
libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007fa0ada89000)
libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007fa0ad871000)
libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007fa0ad5e9000)
libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007fa0ad3d9000)
libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007fa0ad1b9000)
libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007fa0acda1000)
libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007fa0acb49000)
libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007fa0ac779000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007fa0ac561000)
/lib64/ld-linux-x86-64.so.2 (0x000056006a494000)
# netcdf-c-4.7.2.pgi/build/bin/nc-config --all
This netCDF 4.7.2 has been built with the following features:
--cc -> pgcc
--cflags -> -I/usr/local/src/netcdf-c-4.7.2.pgi/build/include -I/usr/local/src/hdf5-1.8.20.pgi/include
--libs -> -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdf
--static -> -lhdf5_hl -lhdf5 -lm -ldl -lz
--has-c++ -> no
--cxx ->
--has-c++4 -> no
--cxx4 ->
--has-fortran -> yes
--fc -> pgfortran
--fflags -> -I/usr/local/src/netcdf-c-4.7.2.pgi/build/include
--flibs -> -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdff -L/usr/local/src/netcdf-c-4.7.2.pgi/build/lib -lnetcdf -lnetcdf -ldl -lm
--has-f90 ->
--has-f03 -> yes
--has-dap -> no
--has-dap2 -> no
--has-dap4 -> no
--has-nc2 -> yes
--has-nc4 -> yes
--has-hdf5 -> yes
--has-hdf4 -> no
--has-logging -> no
--has-pnetcdf -> no
--has-szlib -> no
--has-cdf5 -> yes
--has-parallel4 -> no
--has-parallel -> no
--prefix -> /usr/local/src/netcdf-c-4.7.2.pgi/build
--includedir -> /usr/local/src/netcdf-c-4.7.2.pgi/build/include
--libdir -> /usr/local/src/netcdf-c-4.7.2.pgi/build/lib
--version -> netCDF 4.7.2
# cat wrf/WRF-4.1.3/my.compile
#!/bin/csh -f
setenv NETCDF /usr/local/src/wrf/netcdf-c-4.7.2/build
setenv NETCDFHOME $NETCDF
setenv NETCDF_DIR $NETCDF
if !($?LD_LIBRARY_PATH) then
setenv LD_LIBRARY_PATH $NETCDF_DIR/lib
else
if ( "$LD_LIBRARY_PATH" !~ *$NETCDF_DIR/lib* ) then
setenv LD_LIBRARY_PATH $NETCDF_DIR/lib:${LD_LIBRARY_PATH}
endif
endif
setenv NETCDF4 0
setenv WRFIO_NCD_LARGE_FILE_SUPPORT 1
echo "Cleaning"
clean -a >& /dev/null
if (-e my.configure.wrf) then
cp my.configure.wrf configure.wrf
echo "Using existing my.configure.wrf"
else
configure
cp configure.wrf my.configure.wrf
echo "Edit my.configure.wrf and re-run $0:t"
exit
endif
echo "Compiling"
./compile em_real >&! compile.out.`date +%Y-%m-%d`
echo "Done"
# cat wrf/WPS-4.1/my.compile
#!/bin/csh -f
setenv NETCDF /usr/local/src/netcdf-c-4.7.2.pgi/build
setenv NETCDFHOME $NETCDF
setenv NETCDF_DIR $NETCDF
if !($?LD_LIBRARY_PATH) then
setenv LD_LIBRARY_PATH $NETCDF_DIR/lib
else
if ( "$LD_LIBRARY_PATH" !~ *$NETCDF_DIR/lib* ) then
setenv LD_LIBRARY_PATH $NETCDF_DIR/lib:${LD_LIBRARY_PATH}
endif
endif
setenv NETCDF4 0
setenv WRFIO_NCD_LARGE_FILE_SUPPORT 1
echo "Cleaning"
clean -a >& /dev/null
if (-e my.configure.wps) then
cp my.configure.wps configure.wps
echo "Using existing my.configure.wps"
else
configure
cp configure.wps my.configure.wps
echo "Edit my.configure.wps and re-run $0:t"
exit
endif
echo "Compiling"
compile >&! compile.out.`date +%Y-%m-%d`
echo "Done"
# ldd wrf/WRF-4.1.3/run/wrf.exe
linux-vdso.so.1 => (0x00007ffc245c1000)
libnetcdff.so.7 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdff.so.7 (0x00007f6df81b1000)
libnetcdf.so.15 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdf.so.15 (0x00007f6df7ea9000)
libhdf5hl_fortran.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5hl_fortran.so.10 (0x00007f6df7c81000)
libhdf5_hl.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_hl.so.10 (0x00007f6df7a49000)
libhdf5_fortran.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_fortran.so.10 (0x00007f6df77f1000)
libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007f6df7191000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f6df6e89000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00007f6df6c71000)
libmpi_usempif08.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi_usempif08.so.40 (0x00007f6df6a01000)
libmpi_usempi_ignore_tkr.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007f6df67f1000)
libmpi_mpifh.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi_mpifh.so.40 (0x00007f6df6591000)
libmpi.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/libmpi.so.40 (0x00007f6df60d1000)
libpgf90rtl.so => /usr/local/pgi/linux86-64/2019/lib/libpgf90rtl.so (0x00007f6df5ea9000)
libpgf90.so => /usr/local/pgi/linux86-64/2019/lib/libpgf90.so (0x00007f6df5891000)
libpgf90_rpm1.so => /usr/local/pgi/linux86-64/2019/lib/libpgf90_rpm1.so (0x00007f6df5689000)
libpgf902.so => /usr/local/pgi/linux86-64/2019/lib/libpgf902.so (0x00007f6df5471000)
libpgftnrtl.so => /usr/local/pgi/linux86-64/2019/lib/libpgftnrtl.so (0x00007f6df5211000)
libpgmp.so => /usr/local/pgi/linux86-64/2019/lib/libpgmp.so (0x00007f6df4f89000)
libnuma.so.1 => /usr/local/pgi/linux86-64/2019/lib/libnuma.so.1 (0x00007f6df4d79000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f6df4b59000)
libpgmath.so => /usr/local/pgi/linux86-64/2019/lib/libpgmath.so (0x00007f6df4741000)
libpgc.so => /usr/local/pgi/linux86-64/2019/lib/libpgc.so (0x00007f6df44e9000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f6df42e1000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f6df3f11000)
libgcc_s.so.1 => /opt/ohpc/pub/compiler/gcc/7.3.0/lib64/libgcc_s.so.1 (0x00007f6df3cf9000)
libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007f6df3af1000)
/lib64/ld-linux-x86-64.so.2 (0x00005599f7d8e000)
libopen-rte.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/../lib/libopen-rte.so.40 (0x00007f6df3791000)
libopen-pal.so.40 => /usr/local/pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3/lib/../lib/libopen-pal.so.40 (0x00007f6df32e9000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f6df30d1000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f6df2eb9000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f6df2cb1000)
libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00007f6df2a41000)
libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00007f6df2819000)
# ldd wrf/WPS-4.1/metgrid/src/metgrid.exe
linux-vdso.so.1 => (0x00007fffe8e01000)
libnetcdff.so.7 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdff.so.7 (0x00007fefc9711000)
libnetcdf.so.15 => /usr/local/src/netcdf-c-4.7.2.pgi/build/lib/libnetcdf.so.15 (0x00007fefc9409000)
libpgf90rtl.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90rtl.so (0x00007fefc91e1000)
libpgf90.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90.so (0x00007fefc8bc9000)
libpgf90_rpm1.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf90_rpm1.so (0x00007fefc89c1000)
libpgf902.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgf902.so (0x00007fefc87a9000)
libpgftnrtl.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgftnrtl.so (0x00007fefc8549000)
libpgmp.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmp.so (0x00007fefc82c1000)
libnuma.so.1 => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libnuma.so.1 (0x00007fefc80b1000)
libpthread.so.0 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0 (0x00007fefc7e91000)
libpgmath.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgmath.so (0x00007fefc7a79000)
libpgc.so => /usr/local/pgi/linux86-64-nollvm/19.10/lib/libpgc.so (0x00007fefc7821000)
librt.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/librt.so.1 (0x00007fefc7619000)
libm.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libm.so.6 (0x00007fefc7311000)
libc.so.6 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (0x00007fefc6f41000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libgcc_s.so.1 (0x00007fefc6d29000)
libhdf5_hl.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5_hl.so.10 (0x00007fefc6af1000)
libhdf5.so.10 => /usr/local/src/hdf5-1.8.20.pgi/lib/libhdf5.so.10 (0x00007fefc6491000)
libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007fefc6279000)
libdl.so.2 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libdl.so.2 (0x00007fefc6071000)
/lib64/ld-linux-x86-64.so.2 (0x000056033dce1000)