This is an issue my department's IT team and I been dealing with for several months trying to compile and run MPAS v7.3 on our computing cluster after an OS update (CentOS 7.0), and we've hit a brick wall that hopefully folks here can help us resolve.
After installing NetCDF (NetCDF-C v4.9.0 and NetCDF-fortran v4.6.0), PNetCDF v1.12.3, PIO v1.7.1, and mpich (I believe v4.0.2), with the latest version available for gcc (v12.2.0), I was able to compile MPAS using gfortran, create the static file and initial conditions file. To run MPAS, I use a qsub script attached here for reference. Normally the command we'd use to run MPAS would be as follows:
Our IT was unable to set up the orte environment, so instead we have an "mpi" environment run as follows:
When I run MPAS, as the attached log file shows, it gets through most of the steps before the simulation begins to run the first time step, being spread out across 5 nodes with 56 cores / 128 GB RAM each (one of the nodes has 256 GB RAM), before abruptly aborting with an error file (attached below as the "run_mpas.qsub.o24" file), as well as two large binary core files which I subsequently deleted. I posted a small snippet of the error file below:
Understanding where the error is coming from and how to fix it is beyond my skill level, but two things that may potentially help are (1) a classmate who ran MPAS on Cheyenne got a similar error message at one point, which was resolved by running MPAS on large memory nodes, and (2) that the traceback error largely links to files associated with PNetCDF/PIO/MPI.
Has anyone encountered this before or has any suggestions on how to fix this? Any help would be greatly appreciated!
After installing NetCDF (NetCDF-C v4.9.0 and NetCDF-fortran v4.6.0), PNetCDF v1.12.3, PIO v1.7.1, and mpich (I believe v4.0.2), with the latest version available for gcc (v12.2.0), I was able to compile MPAS using gfortran, create the static file and initial conditions file. To run MPAS, I use a qsub script attached here for reference. Normally the command we'd use to run MPAS would be as follows:
Code:
qsub -pe orte 256 ./run_mpas.qsub
Our IT was unable to set up the orte environment, so instead we have an "mpi" environment run as follows:
Code:
qsub -pe mpi 256 ./run_mpas.qsub
When I run MPAS, as the attached log file shows, it gets through most of the steps before the simulation begins to run the first time step, being spread out across 5 nodes with 56 cores / 128 GB RAM each (one of the nodes has 256 GB RAM), before abruptly aborting with an error file (attached below as the "run_mpas.qsub.o24" file), as well as two large binary core files which I subsequently deleted. I posted a small snippet of the error file below:
Code:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error"
#0 0x2b86cb08226f in ???
#1 0x2b86c7a03414 in ???
#2 0x2b86c7a0553e in ???
#3 0x2b86c7a09a04 in ???
#4 0x2b86c7a09e49 in ???
#5 0x2b86c784408c in ???
#6 0x2b86c9c74200 in ???
#7 0x2b86c9c7397a in ???
#8 0x2b86c9c72448 in ???
#9 0x2b86c9c4e3a1 in ???
#10 0x2b86c9c4ecc5 in ???
#11 0x9ca5c4 in ncmpio_read_write
at /home/dgoines/utils/pnetcdf-1.12.3/src/drivers/ncmpio/ncmpio_file_io.c:109
#12 0x9ad51e in get_varm
at /home/dgoines/utils/pnetcdf-1.12.3/src/drivers/ncmpio/ncmpio_getput.c:516
#13 0x9ad51e in ncmpio_get_var
at /home/dgoines/utils/pnetcdf-1.12.3/src/drivers/ncmpio/ncmpio_getput.c:592
Understanding where the error is coming from and how to fix it is beyond my skill level, but two things that may potentially help are (1) a classmate who ran MPAS on Cheyenne got a similar error message at one point, which was resolved by running MPAS on large memory nodes, and (2) that the traceback error largely links to files associated with PNetCDF/PIO/MPI.
Has anyone encountered this before or has any suggestions on how to fix this? Any help would be greatly appreciated!