Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

pnetcdf issue with nvhpc

Hi All,

I am able to run MPAS-A without any problem when compiling with "intel".

To test the model performances, I would like to run it on GPU thus, to enable GPU support, I compiled as following:

export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/mpi/bin/:$PATH
make -j 4 nvhpc CORE=init_atmosphere OPENACC=true
make -j 4 nvhpc CORE=atmosphere OPENACC=true


When I run both init_atmoshere and atmoshere I get a segmentation fault when the code attempts to read the nc file containing grid information. For instance, this is the output of init_atmoshere:

----- done configuring registry-specified packages -----

Reading streams configuration from file streams.init_atmosphere
Found mesh stream with filename template x1.10242.grid.nc
Using default io_type for mesh stream
** Attempting to bootstrap MPAS framework using stream: input

while this is the error message I get:

mpirun -np 1 ./init_atmosphere_model
[canova:192675:0:192675] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd98f3dc0)
==== backtrace (tid: 192675) ====
0 0x0000000000016910 __funlockfile() ???:0
1 0x00000000000798b1 opal_info_dup() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/util/../../../opal/util/info.c:84
2 0x000000000006c780 PMPI_Info_dup() /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinfo_dup.c:87
3 0x00000000000748f5 combine_env_hints() /home/mpas/UTIL/intel_libraries_intelmpi/sources/pnetcdf-1.14.0/src/dispatchers/file.c:177
4 0x0000000000075401 ncmpi_open() /home/mpas/UTIL/intel_libraries_intelmpi/sources/pnetcdf-1.14.0/src/dispatchers/file.c:663
5 0x00000000006d4785 SMIOL_open_file() /home/mpas/MODELS/MPAS/MPAS-Model/src/external/SMIOL/smiol.c:359
6 0x00000000006c3486 smiolf_smiolf_open_file_() /home/mpas/MODELS/MPAS/MPAS-Model/src/external/SMIOL/smiolf.F90:0
7 0x0000000000605fdb mpas_io_mpas_io_open_() /home/mpas/MODELS/MPAS/MPAS-Model/src/framework/mpas_io.F:437
8 0x000000000061f2d7 mpas_bootstrapping_mpas_bootstrap_framework_phase1_() /home/mpas/MODELS/MPAS/MPAS-Model/src/framework/mpas_bootstrapping.F:159
9 0x000000000040887c mpas_subdriver_mpas_init_() /home/mpas/MODELS/MPAS/MPAS-Model/src/driver/mpas_subdriver.F:356
10 0x0000000000407042 MAIN_() /home/mpas/MODELS/MPAS/MPAS-Model/src/driver/mpas.F:18
11 0x0000000000406fb1 main() ???:0
12 0x000000000003524d __libc_start_main() ???:0
13 0x0000000000406eaa _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
[canova:192675] *** Process received signal ***
[canova:192675] Signal: Segmentation fault (11)
[canova:192675] Signal code: (-6)
[canova:192675] Failing at address: 0x3f90002f0a3
[canova:192675] [ 0] /lib64/libpthread.so.0(+0x16910)[0x148ada502910]
[canova:192675] [ 1] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libopen-pal.so.40(opal_info_dup+0x21)[0x148ad18798b1]
[canova:192675] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(MPI_Info_dup+0x80)[0x148ad966c780]
[canova:192675] [ 3] /home/mpas/UTIL/pnetcdf-1.14.0/lib/libpnetcdf.so.6(+0x748f5)[0x148ada5ad8f5]
[canova:192675] [ 4] /home/mpas/UTIL/pnetcdf-1.14.0/lib/libpnetcdf.so.6(ncmpi_open+0x131)[0x148ada5ae401]
[canova:192675] [ 5] ./init_atmosphere_model[0x6d4785]
[canova:192675] [ 6] ./init_atmosphere_model[0x6c3486]
[canova:192675] [ 7] ./init_atmosphere_model[0x605fdb]
[canova:192675] [ 8] ./init_atmosphere_model[0x61f2d7]
[canova:192675] [ 9] ./init_atmosphere_model[0x40887c]
[canova:192675] [10] ./init_atmosphere_model[0x407042]
[canova:192675] [11] ./init_atmosphere_model[0x406fb1]
[canova:192675] [12] /lib64/libc.so.6(__libc_start_main+0xef)[0x148ada03e24d]
[canova:192675] [13] ./init_atmosphere_model[0x406eaa]
[canova:192675] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node canova exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


Do you have any idea why this happens? Do I need to recompile the pnetcdf with CUDA compiler to overcome this issue?
Thanks into advance for help,
Alessandro
 
Hi Alessandro,

Installing the software-stack from source can have a lot of gotchas. It is still the most preferred way by us MPAS developers.

Do I need to recompile the pnetcdf with CUDA compiler to overcome this issue?

Quoting from Additional Requirements of the MPAS User's Guide:
All libraries must be compiled with the same compilers that will be used to build MPAS.

Ideally everything from the MPI library's dependencies all the way through to PNetCDF (and PIO if using) should be built with the same compiler family and version. This is especially true for the NVIDIA compilers since code compiled with them has no guarantee to be compatible with code compiled by other compilers or other NVHPC versions. (This doesn't mean don't use those compilers or that they won't be compatible, just to be aware of that lack of guarantee.)

Best of luck getting things running!
 
Hi @gdicker, thanks for the prompt reply.

I confirm that building all the libraries with CUDA compiler solves the I/O issue.

I have just an another quick question: I am running and comparing a MPAS-A 240km_global_uniform experiment on a single CPU vs GPU and the performances on GPU are only slightly better:

CPU:
Timing for integration step: 17.0024 s
Timing for integration step: 5.22745 s
Timing for integration step: 17.2153 s
Timing for integration step: 5.16383 s
Timing for integration step: 16.9864 s
Timing for integration step: 5.24485 s
Timing for integration step: 17.2145 s
Timing for integration step: 5.36377 s
Timing for integration step: 17.0783 s
Timing for integration step: 5.49571 s


GPU:
Timing for integration step: 13.6203 s
Timing for integration step: 4.10412 s
Timing for integration step: 14.1355 s
Timing for integration step: 4.30890 s
Timing for integration step: 14.2983 s
Timing for integration step: 4.25423 s
Timing for integration step: 14.4433 s
Timing for integration step: 4.65796 s
Timing for integration step: 14.2697 s
Timing for integration step: 4.64362 s

Is this expected or am I setting something wrong?
 
Top