I am running the WRF model in a slurm environment using openMPI.
A slurm script called "wrf.slurm" was used to perform work on the compute node, and the following error message was written to rsl.error.0018.
When I ran the same script two days ago, it ran normally, but after the error occurred yesterday, the error occurred again today.
I had the same situation a week ago.
If I did not use the compute node for about a day and ran the same script again a few days later, an error occurred.
It seems like there is something wrong with the node, but I would like to know what the error message written in rsl.error.0018 means.
Please help me.
A slurm script called "wrf.slurm" was used to perform work on the compute node, and the following error message was written to rsl.error.0018.
When I ran the same script two days ago, it ran normally, but after the error occurred yesterday, the error occurred again today.
I had the same situation a week ago.
If I did not use the compute node for about a day and ran the same script again a few days later, an error occurred.
It seems like there is something wrong with the node, but I would like to know what the error message written in rsl.error.0018 means.
rsl.error.0018 error message
[queue-1-dy-queue-1-cr-1-21:05078] *** Process received signal ***
[queue-1-dy-queue-1-cr-1-21:05078] *** Process received signal ***
[queue-1-dy-queue-1-cr-1-21:05078] Signal: Segmentation fault (11)
[queue-1-dy-queue-1-cr-1-21:05078] Signal code: Invalid permissions (2)
[queue-1-dy-queue-1-cr-1-21:05078] Failing at address: 0x40002ddbc1b8
[queue-1-dy-queue-1-cr-1-21:05078] Signal: Segmentation fault (11)
[queue-1-dy-queue-1-cr-1-21:05078] Signal code: Invalid permissions (2)
[queue-1-dy-queue-1-cr-1-21:05078] Failing at address: 0x40002ddbc1b8
[queue-1-dy-queue-1-cr-1-21:05078] [queue-1-dy-queue-1-cr-1-21:05078] [ 0] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x40002be5278c] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x40002be5278c]
[queue-1-dy-queue-1-cr-1-21:05078] [queue-1-dy-queue-1-cr-1-21:05078] [ 1] [ 1] /shared/openmpi-4.1.4-acfl/lib/libopen-pal.so.40(opal_show_help_yylex+0x2e8)[0x40002dd7c728] [queue-1-dy-queue-1-cr-1-21:05078] [ 2] /shared/openmpi-4.1.4-acfl/lib/libopen-pal.so.40(opal_show_help_yylex+0x2e8)[0x40002dd7c728]
[queue-1-dy-queue-1-cr-1-21:05078] [ 2] /shared/openmpi-4.1.4-acfl/lib/libopen-pal.so.40(opal_show_help_vstring+0x188)[0x40002dd7c128]
[queue-1-dy-queue-1-cr-1-21:05078] [ 3] /shared/openmpi-4.1.4-acfl/lib/libopen-pal.so.40(opal_show_help_vstring+0x188)[0x40002dd7c128]
[queue-1-dy-queue-1-cr-1-21:05078] [ 3] /shared/openmpi-4.1.4-acfl/lib/libopen-rte.so.40(orte_show_help+0xa0)[0x40002dc759d0]
[queue-1-dy-queue-1-cr-1-21:05078] [ 4] /shared/openmpi-4.1.4-acfl/lib/libopen-rte.so.40(orte_show_help+0xa0)[0x40002dc759d0]
[queue-1-dy-queue-1-cr-1-21:05078] [ 4] /shared/openmpi-4.1.4-acfl/lib/libmpi.so.40(MPI_Abort+0x80)[0x40002d3c2f50]
[queue-1-dy-queue-1-cr-1-21:05078] [ 5] /shared/openmpi-4.1.4-acfl/lib/libmpi.so.40(MPI_Abort+0x80)[0x40002d3c2f50]
[queue-1-dy-queue-1-cr-1-21:05078] [ 5] /shared/openmpi-4.1.4-acfl/lib/libmpi_mpifh.so.40(mpi_abort+0x24)[0x40002d33b684]
[queue-1-dy-queue-1-cr-1-21:05078] [ 6] /shared/openmpi-4.1.4-acfl/lib/libmpi_mpifh.so.40(mpi_abort+0x24)[0x40002d33b684]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x398dc4)[0xaaaaae568dc4] [ 6]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x398dc4)[0xaaaaae568dc4] [ 7]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x38f1b8)[0xaaaaae55f1b8] [ 7]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x38f1b8)[0xaaaaae55f1b8] [ 8]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d3b094)[0xaaaaaff0b094] [ 8]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d3b094)[0xaaaaaff0b094] [ 9]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d38c8c)[0xaaaaaff08c8c] [ 9]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d38c8c)[0xaaaaaff08c8c] [10]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d2c8bc)[0xaaaaafefc8bc] [10]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d2c8bc)[0xaaaaafefc8bc] [11]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1da3c6c)[0xaaaaaff73c6c] [11]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1da3c6c)[0xaaaaaff73c6c] [12]
[queue-1-dy-queue-1-cr-1-21:05078] [12] /shared/arm/arm-linux-compiler-23.04.1_Ubuntu-20.04/lib/libomp.so(__kmp_invoke_microtask+0x9c)[0x40002d9ddabc]
[queue-1-dy-queue-1-cr-1-21:05078] *** End of error message ***
/shared/arm/arm-linux-compiler-23.04.1_Ubuntu-20.04/lib/libomp.so(__kmp_invoke_microtask+0x9c)[0x40002d9ddabc]
[queue-1-dy-queue-1-cr-1-21:05078] *** End of error message ***
wrf. slurm
#!/bin/bash
#SBATCH --job-name=WRF
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=4
#SBATCH --exclusive
#SBATCH --time=04:00:00
#SBATCH --exclude=queue-1-dy-queue-1-cr-1-[1-16]
export I_MPI_OFI_LIBRARY_INTERNAL=0
set -x
ulimit -s unlimited
ulimit -a
export OMP_NUM_THREADS=16
export FI_PROVIDER=efa
export I_MPI_FABRICS=ofi
export I_MPI_OFI_PROVIDER=efa
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=compact
export I_MPI_DEBUG=4
export I_MPI_HYDRA_BOOTSTRAP=slurm
export I_MPI_ROOT=/shared/openmpi-4.1.4-acfl
time /shared/openmpi-4.1.4-acfl/bin/mpirun --map-by socketE=16 --bind-to core ./wrf.exe
Please help me.