wrf_alessandro
Member
Hi All,
I am able to run MPAS-A without any problem when compiling with "intel".
To test the model performances, I would like to run it on GPU thus, to enable GPU support, I compiled as following:
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/mpi/bin/:$PATH
make -j 4 nvhpc CORE=init_atmosphere OPENACC=true
make -j 4 nvhpc CORE=atmosphere OPENACC=true
When I run both init_atmoshere and atmoshere I get a segmentation fault when the code attempts to read the nc file containing grid information. For instance, this is the output of init_atmoshere:
----- done configuring registry-specified packages -----
Reading streams configuration from file streams.init_atmosphere
Found mesh stream with filename template x1.10242.grid.nc
Using default io_type for mesh stream
** Attempting to bootstrap MPAS framework using stream: input
while this is the error message I get:
mpirun -np 1 ./init_atmosphere_model
[canova:192675:0:192675] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd98f3dc0)
==== backtrace (tid: 192675) ====
0 0x0000000000016910 __funlockfile() ???:0
1 0x00000000000798b1 opal_info_dup() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/util/../../../opal/util/info.c:84
2 0x000000000006c780 PMPI_Info_dup() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinfo_dup.c:87
3 0x00000000000748f5 combine_env_hints() /home/mpas/UTIL/intel_libraries_intelmpi/sources/pnetcdf-1.14.0/src/dispatchers/file.c:177
4 0x0000000000075401 ncmpi_open() /home/mpas/UTIL/intel_libraries_intelmpi/sources/pnetcdf-1.14.0/src/dispatchers/file.c:663
5 0x00000000006d4785 SMIOL_open_file() /home/mpas/MODELS/MPAS/MPAS-Model/src/external/SMIOL/smiol.c:359
6 0x00000000006c3486 smiolf_smiolf_open_file_() /home/mpas/MODELS/MPAS/MPAS-Model/src/external/SMIOL/smiolf.F90:0
7 0x0000000000605fdb mpas_io_mpas_io_open_() /home/mpas/MODELS/MPAS/MPAS-Model/src/framework/mpas_io.F:437
8 0x000000000061f2d7 mpas_bootstrapping_mpas_bootstrap_framework_phase1_() /home/mpas/MODELS/MPAS/MPAS-Model/src/framework/mpas_bootstrapping.F:159
9 0x000000000040887c mpas_subdriver_mpas_init_() /home/mpas/MODELS/MPAS/MPAS-Model/src/driver/mpas_subdriver.F:356
10 0x0000000000407042 MAIN_() /home/mpas/MODELS/MPAS/MPAS-Model/src/driver/mpas.F:18
11 0x0000000000406fb1 main() ???:0
12 0x000000000003524d __libc_start_main() ???:0
13 0x0000000000406eaa _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
[canova:192675] *** Process received signal ***
[canova:192675] Signal: Segmentation fault (11)
[canova:192675] Signal code: (-6)
[canova:192675] Failing at address: 0x3f90002f0a3
[canova:192675] [ 0] /lib64/libpthread.so.0(+0x16910)[0x148ada502910]
[canova:192675] [ 1] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libopen-pal.so.40(opal_info_dup+0x21)[0x148ad18798b1]
[canova:192675] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(MPI_Info_dup+0x80)[0x148ad966c780]
[canova:192675] [ 3] /home/mpas/UTIL/pnetcdf-1.14.0/lib/libpnetcdf.so.6(+0x748f5)[0x148ada5ad8f5]
[canova:192675] [ 4] /home/mpas/UTIL/pnetcdf-1.14.0/lib/libpnetcdf.so.6(ncmpi_open+0x131)[0x148ada5ae401]
[canova:192675] [ 5] ./init_atmosphere_model[0x6d4785]
[canova:192675] [ 6] ./init_atmosphere_model[0x6c3486]
[canova:192675] [ 7] ./init_atmosphere_model[0x605fdb]
[canova:192675] [ 8] ./init_atmosphere_model[0x61f2d7]
[canova:192675] [ 9] ./init_atmosphere_model[0x40887c]
[canova:192675] [10] ./init_atmosphere_model[0x407042]
[canova:192675] [11] ./init_atmosphere_model[0x406fb1]
[canova:192675] [12] /lib64/libc.so.6(__libc_start_main+0xef)[0x148ada03e24d]
[canova:192675] [13] ./init_atmosphere_model[0x406eaa]
[canova:192675] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node canova exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Do you have any idea why this happens? Do I need to recompile the pnetcdf with CUDA compiler to overcome this issue?
Thanks into advance for help,
Alessandro
I am able to run MPAS-A without any problem when compiling with "intel".
To test the model performances, I would like to run it on GPU thus, to enable GPU support, I compiled as following:
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/mpi/bin/:$PATH
make -j 4 nvhpc CORE=init_atmosphere OPENACC=true
make -j 4 nvhpc CORE=atmosphere OPENACC=true
When I run both init_atmoshere and atmoshere I get a segmentation fault when the code attempts to read the nc file containing grid information. For instance, this is the output of init_atmoshere:
----- done configuring registry-specified packages -----
Reading streams configuration from file streams.init_atmosphere
Found mesh stream with filename template x1.10242.grid.nc
Using default io_type for mesh stream
** Attempting to bootstrap MPAS framework using stream: input
while this is the error message I get:
mpirun -np 1 ./init_atmosphere_model
[canova:192675:0:192675] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd98f3dc0)
==== backtrace (tid: 192675) ====
0 0x0000000000016910 __funlockfile() ???:0
1 0x00000000000798b1 opal_info_dup() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/util/../../../opal/util/info.c:84
2 0x000000000006c780 PMPI_Info_dup() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinfo_dup.c:87
3 0x00000000000748f5 combine_env_hints() /home/mpas/UTIL/intel_libraries_intelmpi/sources/pnetcdf-1.14.0/src/dispatchers/file.c:177
4 0x0000000000075401 ncmpi_open() /home/mpas/UTIL/intel_libraries_intelmpi/sources/pnetcdf-1.14.0/src/dispatchers/file.c:663
5 0x00000000006d4785 SMIOL_open_file() /home/mpas/MODELS/MPAS/MPAS-Model/src/external/SMIOL/smiol.c:359
6 0x00000000006c3486 smiolf_smiolf_open_file_() /home/mpas/MODELS/MPAS/MPAS-Model/src/external/SMIOL/smiolf.F90:0
7 0x0000000000605fdb mpas_io_mpas_io_open_() /home/mpas/MODELS/MPAS/MPAS-Model/src/framework/mpas_io.F:437
8 0x000000000061f2d7 mpas_bootstrapping_mpas_bootstrap_framework_phase1_() /home/mpas/MODELS/MPAS/MPAS-Model/src/framework/mpas_bootstrapping.F:159
9 0x000000000040887c mpas_subdriver_mpas_init_() /home/mpas/MODELS/MPAS/MPAS-Model/src/driver/mpas_subdriver.F:356
10 0x0000000000407042 MAIN_() /home/mpas/MODELS/MPAS/MPAS-Model/src/driver/mpas.F:18
11 0x0000000000406fb1 main() ???:0
12 0x000000000003524d __libc_start_main() ???:0
13 0x0000000000406eaa _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
[canova:192675] *** Process received signal ***
[canova:192675] Signal: Segmentation fault (11)
[canova:192675] Signal code: (-6)
[canova:192675] Failing at address: 0x3f90002f0a3
[canova:192675] [ 0] /lib64/libpthread.so.0(+0x16910)[0x148ada502910]
[canova:192675] [ 1] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libopen-pal.so.40(opal_info_dup+0x21)[0x148ad18798b1]
[canova:192675] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(MPI_Info_dup+0x80)[0x148ad966c780]
[canova:192675] [ 3] /home/mpas/UTIL/pnetcdf-1.14.0/lib/libpnetcdf.so.6(+0x748f5)[0x148ada5ad8f5]
[canova:192675] [ 4] /home/mpas/UTIL/pnetcdf-1.14.0/lib/libpnetcdf.so.6(ncmpi_open+0x131)[0x148ada5ae401]
[canova:192675] [ 5] ./init_atmosphere_model[0x6d4785]
[canova:192675] [ 6] ./init_atmosphere_model[0x6c3486]
[canova:192675] [ 7] ./init_atmosphere_model[0x605fdb]
[canova:192675] [ 8] ./init_atmosphere_model[0x61f2d7]
[canova:192675] [ 9] ./init_atmosphere_model[0x40887c]
[canova:192675] [10] ./init_atmosphere_model[0x407042]
[canova:192675] [11] ./init_atmosphere_model[0x406fb1]
[canova:192675] [12] /lib64/libc.so.6(__libc_start_main+0xef)[0x148ada03e24d]
[canova:192675] [13] ./init_atmosphere_model[0x406eaa]
[canova:192675] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node canova exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Do you have any idea why this happens? Do I need to recompile the pnetcdf with CUDA compiler to overcome this issue?
Thanks into advance for help,
Alessandro