Hello,
I ran MPAS-A GPU version successfully using the following command:
$export MPAS_DYNAMICS_RANKS_PER_NODE=4
$export MPAS_RADIATION_RANKS_PER_NODE=2
$mpirun -np 6 ./atmosphere_model
The test case I used is the CFSR example (Sample real-data input files) provided on the MPAS-A official site (http://mpas-dev.github.io/).
But when I ran MPAS-A GPU version using the following command:
$export MPAS_DYNAMICS_RANKS_PER_NODE=24
$export MPAS_RADIATION_RANKS_PER_NODE=16
$mpirun -np 40 ./atmosphere_model
And I met the following segmentation error:
------------------------------------------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: gpu
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4123
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
[gpu:395574] [[59095,0],0] ORTE_ERROR_LOG: Out of resource in file ../../orte/util/show_help.c at line 507
[gpu:395574] 158 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[gpu:395574] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
[gpu:395580] *** Process received signal ***
[gpu:395580] Signal: Segmentation fault (11)
[gpu:395580] Signal code: Address not mapped (1)
[gpu:395580] Failing at address: (nil)
[gpu:395580] [ 0] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libpthread.so.0(+0x12b20)[0x15054402db20]
[gpu:395580] [ 1] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40(mca_btl_smcuda_sendi+0x6e3)[0x150541c4c603]
[gpu:395580] [ 2] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(mca_pml_ob1_isend+0x3a4)[0x1505463ae744]
[gpu:395580] [ 3] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(ompi_coll_base_bcast_intra_binomial+0x1d0)[0x1505462cce50]
[gpu:395580] [ 4] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(ompi_coll_tuned_bcast_intra_dec_fixed+0x35)[0x1505462dbc95]
[gpu:395580] [ 5] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(MPI_Bcast+0xd9)[0x15054628dc99]
[gpu:395580] [ 6] ./atmosphere_model[0xae2805]
[gpu:395580] [ 7] ./atmosphere_model[0xae98f5]
[gpu:395580] [ 8] ./atmosphere_model[0xae7d6e]
[gpu:395580] [ 9] ./atmosphere_model[0xaaf31d]
[gpu:395580] [10] ./atmosphere_model[0xaaee93]
[gpu:395580] [11] ./atmosphere_model[0xa099c3]
[gpu:395580] [12] ./atmosphere_model[0xaa82a8]
[gpu:395580] [13] ./atmosphere_model[0xa220d6]
[gpu:395580] [14] ./atmosphere_model[0xa1edea]
[gpu:395580] [15] ./atmosphere_model[0xa1d689]
[gpu:395580] [16] ./atmosphere_model[0x5194c2]
[gpu:395580] [17] ./atmosphere_model[0x40d444]
[gpu:395580] [18] ./atmosphere_model[0x40b697]
[gpu:395580] [19] ./atmosphere_model[0x40b633]
[gpu:395580] [20] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(__libc_start_main+0xf3)[0x1505433cf493]
[gpu:395580] [21] ./atmosphere_model[0x40b52e]
[gpu:395580] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gpu exited on signal 11 (Segmentation fault).
-----------------------------------------------------------------------------------------------------
I just wonder if the test could only be run using limited processes? Or it has something to do with my platform? Followed are my platform information:
CPU: 2 AMD EPYC 7742 64-Core Processor
GPU: 8 Nvidia A100 card
I'm very new to MPAS-A GPU. Any suggestions are appreciated!
I ran MPAS-A GPU version successfully using the following command:
$export MPAS_DYNAMICS_RANKS_PER_NODE=4
$export MPAS_RADIATION_RANKS_PER_NODE=2
$mpirun -np 6 ./atmosphere_model
The test case I used is the CFSR example (Sample real-data input files) provided on the MPAS-A official site (http://mpas-dev.github.io/).
But when I ran MPAS-A GPU version using the following command:
$export MPAS_DYNAMICS_RANKS_PER_NODE=24
$export MPAS_RADIATION_RANKS_PER_NODE=16
$mpirun -np 40 ./atmosphere_model
And I met the following segmentation error:
------------------------------------------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: gpu
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4123
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
[gpu:395574] [[59095,0],0] ORTE_ERROR_LOG: Out of resource in file ../../orte/util/show_help.c at line 507
[gpu:395574] 158 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[gpu:395574] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 2
Role leader is 0
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
My role is 1
Role leader is 1
[gpu:395580] *** Process received signal ***
[gpu:395580] Signal: Segmentation fault (11)
[gpu:395580] Signal code: Address not mapped (1)
[gpu:395580] Failing at address: (nil)
[gpu:395580] [ 0] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libpthread.so.0(+0x12b20)[0x15054402db20]
[gpu:395580] [ 1] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40(mca_btl_smcuda_sendi+0x6e3)[0x150541c4c603]
[gpu:395580] [ 2] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(mca_pml_ob1_isend+0x3a4)[0x1505463ae744]
[gpu:395580] [ 3] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(ompi_coll_base_bcast_intra_binomial+0x1d0)[0x1505462cce50]
[gpu:395580] [ 4] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(ompi_coll_tuned_bcast_intra_dec_fixed+0x35)[0x1505462dbc95]
[gpu:395580] [ 5] /root/bianqy/user/local/nvidia_hpc_sdk_multi/Linux_x86_64/22.1/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi.so.40(MPI_Bcast+0xd9)[0x15054628dc99]
[gpu:395580] [ 6] ./atmosphere_model[0xae2805]
[gpu:395580] [ 7] ./atmosphere_model[0xae98f5]
[gpu:395580] [ 8] ./atmosphere_model[0xae7d6e]
[gpu:395580] [ 9] ./atmosphere_model[0xaaf31d]
[gpu:395580] [10] ./atmosphere_model[0xaaee93]
[gpu:395580] [11] ./atmosphere_model[0xa099c3]
[gpu:395580] [12] ./atmosphere_model[0xaa82a8]
[gpu:395580] [13] ./atmosphere_model[0xa220d6]
[gpu:395580] [14] ./atmosphere_model[0xa1edea]
[gpu:395580] [15] ./atmosphere_model[0xa1d689]
[gpu:395580] [16] ./atmosphere_model[0x5194c2]
[gpu:395580] [17] ./atmosphere_model[0x40d444]
[gpu:395580] [18] ./atmosphere_model[0x40b697]
[gpu:395580] [19] ./atmosphere_model[0x40b633]
[gpu:395580] [20] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(__libc_start_main+0xf3)[0x1505433cf493]
[gpu:395580] [21] ./atmosphere_model[0x40b52e]
[gpu:395580] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gpu exited on signal 11 (Segmentation fault).
-----------------------------------------------------------------------------------------------------
I just wonder if the test could only be run using limited processes? Or it has something to do with my platform? Followed are my platform information:
CPU: 2 AMD EPYC 7742 64-Core Processor
GPU: 8 Nvidia A100 card
I'm very new to MPAS-A GPU. Any suggestions are appreciated!