"Segmentation Fault - Invalid Memory Reference" From Running MPAS

tomerburg · Dec 21, 2022

This is an issue my department's IT team and I been dealing with for several months trying to compile and run MPAS v7.3 on our computing cluster after an OS update (CentOS 7.0), and we've hit a brick wall that hopefully folks here can help us resolve.

After installing NetCDF (NetCDF-C v4.9.0 and NetCDF-fortran v4.6.0), PNetCDF v1.12.3, PIO v1.7.1, and mpich (I believe v4.0.2), with the latest version available for gcc (v12.2.0), I was able to compile MPAS using gfortran, create the static file and initial conditions file. To run MPAS, I use a qsub script attached here for reference. Normally the command we'd use to run MPAS would be as follows:

Code:

qsub -pe orte 256 ./run_mpas.qsub

Our IT was unable to set up the orte environment, so instead we have an "mpi" environment run as follows:

Code:

qsub -pe mpi 256 ./run_mpas.qsub

When I run MPAS, as the attached log file shows, it gets through most of the steps before the simulation begins to run the first time step, being spread out across 5 nodes with 56 cores / 128 GB RAM each (one of the nodes has 256 GB RAM), before abruptly aborting with an error file (attached below as the "run_mpas.qsub.o24" file), as well as two large binary core files which I subsequently deleted. I posted a small snippet of the error file below:

Code:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.


Backtrace for this error:


Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error"
#0  0x2b86cb08226f in ???
#1  0x2b86c7a03414 in ???
#2  0x2b86c7a0553e in ???
#3  0x2b86c7a09a04 in ???
#4  0x2b86c7a09e49 in ???
#5  0x2b86c784408c in ???
#6  0x2b86c9c74200 in ???
#7  0x2b86c9c7397a in ???
#8  0x2b86c9c72448 in ???
#9  0x2b86c9c4e3a1 in ???
#10  0x2b86c9c4ecc5 in ???
#11  0x9ca5c4 in ncmpio_read_write
    at /home/dgoines/utils/pnetcdf-1.12.3/src/drivers/ncmpio/ncmpio_file_io.c:109
#12  0x9ad51e in get_varm
   at /home/dgoines/utils/pnetcdf-1.12.3/src/drivers/ncmpio/ncmpio_getput.c:516
#13  0x9ad51e in ncmpio_get_var
    at /home/dgoines/utils/pnetcdf-1.12.3/src/drivers/ncmpio/ncmpio_getput.c:592

Understanding where the error is coming from and how to fix it is beyond my skill level, but two things that may potentially help are (1) a classmate who ran MPAS on Cheyenne got a similar error message at one point, which was resolved by running MPAS on large memory nodes, and (2) that the traceback error largely links to files associated with PNetCDF/PIO/MPI.

Has anyone encountered this before or has any suggestions on how to fix this? Any help would be greatly appreciated!

mgduda · Dec 21, 2022

There are a few options to try out that come to mind:

1) Does unlimiting the stack size just before the 'mpirun' command in your job script help? In bash, you can use the command

ulimit -s unlimited

2) If you recompile the model to use single-precision reals (using that PRECISION=single) in your build command, does that allow the model to run? If so, that might suggest the double-precision runs were hitting a memory limit.

3) Since the stack trace points to somewhere within PNetCDF or MPI-IO, you could try different configurations of the namelist options

&io
config_pio_num_iotasks = 0
config_pio_stride = 1
/

If the model fails in a different way when trying any of the above, that might offer some additional insight into what might be going wrong.

tomerburg · Dec 22, 2022

mgduda said:
There are a few options to try out that come to mind:

1) Does unlimiting the stack size just before the 'mpirun' command in your job script help? In bash, you can use the command

2) If you recompile the model to use single-precision reals (using that PRECISION=single) in your build command, does that allow the model to run? If so, that might suggest the double-precision runs were hitting a memory limit.

3) Since the stack trace points to somewhere within PNetCDF or MPI-IO, you could try different configurations of the namelist options

If the model fails in a different way when trying any of the above, that might offer some additional insight into what might be going wrong.

None of these were able to get MPAS to run unfortunately, but the model did fail in a different way:

1) using the "ulimit -s unlimited" command didn't change anything - it still crashed the same way.

2) recompiling and re-running the model using single precision didn't allow the model to run, but the error message was different ("error_single_precision.txt") - the original error log in my first post (run_mpas.qsub.o24) had duplicates of most error messages, while this time around it didn't have any duplicates, though the "log.atmosphere.0000.out" file was the same as before.

3) This one had the most notable differences. When I tried the following configuration:

&io
config_pio_num_iotasks = 1
config_pio_stride = 1
/

I got a shorter error message ("error_single_precision_config1.txt"), but the model didn't get as far along - the "log.atmosphere.0000.out" file was only able to get up to the following line with no subsequent output:

Reading streams configuration from file streams.atmosphere
Found mesh stream with filename template x4.535554.init.nc
Using default io_type for mesh stream
** Attempting to bootstrap MPAS framework using stream: input
Bootstrapping framework with mesh fields from input file 'x4.535554.init.nc'

When I tried setting either "config_pio_num_iotasks", "config_pio_stride" or both to 2, there was no "log.atmosphere.0000.out" file created, and while the model didn't crash, it was simply left hanging while I got the following message in a "run_mpas.qsub.o35" file:

./atmosphere_model: error while loading shared libraries: libnl.so.1: cannot open shared object file: No such file or directory

I should also note that while recompiling MPAS, I noticed a group of warning messages popping up, which I attached here as well (compilationg_warning_message.png) - I'm not sure what they mean, but I don't recall encountering them before when compiling MPAS. Additionally, running "init_atmosphere_model" outputted the following warning message:

[WARNING] yaksa: 27 leaked handle pool objects
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

"Segmentation Fault - Invalid Memory Reference" From Running MPAS

tomerburg

New member

Attachments

mgduda

Administrator

tomerburg

New member

Attachments