WRF openMPI error on Slurm


I am running the WRF model in a slurm environment using openMPI.
A slurm script called "wrf.slurm" was used to perform work on the compute node, and the following error message was written to rsl.error.0018.

When I ran the same script two days ago, it ran normally, but after the error occurred yesterday, the error occurred again today.

I had the same situation a week ago.
If I did not use the compute node for about a day and ran the same script again a few days later, an error occurred.

It seems like there is something wrong with the node, but I would like to know what the error message written in rsl.error.0018 means.

rsl.error.0018 error message

[queue-1-dy-queue-1-cr-1-21:05078] *** Process received signal ***
[queue-1-dy-queue-1-cr-1-21:05078] *** Process received signal ***
[queue-1-dy-queue-1-cr-1-21:05078] Signal: Segmentation fault (11)
[queue-1-dy-queue-1-cr-1-21:05078] Signal code: Invalid permissions (2)
[queue-1-dy-queue-1-cr-1-21:05078] Failing at address: 0x40002ddbc1b8
[queue-1-dy-queue-1-cr-1-21:05078] Signal: Segmentation fault (11)
[queue-1-dy-queue-1-cr-1-21:05078] Signal code: Invalid permissions (2)
[queue-1-dy-queue-1-cr-1-21:05078] Failing at address: 0x40002ddbc1b8
[queue-1-dy-queue-1-cr-1-21:05078] [queue-1-dy-queue-1-cr-1-21:05078] [ 0] [ 0][0x40002be5278c][0x40002be5278c]
[queue-1-dy-queue-1-cr-1-21:05078] [queue-1-dy-queue-1-cr-1-21:05078] [ 1] [ 1] /shared/openmpi-4.1.4-acfl/lib/[0x40002dd7c728] [queue-1-dy-queue-1-cr-1-21:05078] [ 2] /shared/openmpi-4.1.4-acfl/lib/[0x40002dd7c728]
[queue-1-dy-queue-1-cr-1-21:05078] [ 2] /shared/openmpi-4.1.4-acfl/lib/[0x40002dd7c128]
[queue-1-dy-queue-1-cr-1-21:05078] [ 3] /shared/openmpi-4.1.4-acfl/lib/[0x40002dd7c128]
[queue-1-dy-queue-1-cr-1-21:05078] [ 3] /shared/openmpi-4.1.4-acfl/lib/[0x40002dc759d0]
[queue-1-dy-queue-1-cr-1-21:05078] [ 4] /shared/openmpi-4.1.4-acfl/lib/[0x40002dc759d0]
[queue-1-dy-queue-1-cr-1-21:05078] [ 4] /shared/openmpi-4.1.4-acfl/lib/[0x40002d3c2f50]
[queue-1-dy-queue-1-cr-1-21:05078] [ 5] /shared/openmpi-4.1.4-acfl/lib/[0x40002d3c2f50]
[queue-1-dy-queue-1-cr-1-21:05078] [ 5] /shared/openmpi-4.1.4-acfl/lib/[0x40002d33b684]
[queue-1-dy-queue-1-cr-1-21:05078] [ 6] /shared/openmpi-4.1.4-acfl/lib/[0x40002d33b684]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x398dc4)[0xaaaaae568dc4] [ 6]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x398dc4)[0xaaaaae568dc4] [ 7]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x38f1b8)[0xaaaaae55f1b8] [ 7]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x38f1b8)[0xaaaaae55f1b8] [ 8]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d3b094)[0xaaaaaff0b094] [ 8]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d3b094)[0xaaaaaff0b094] [ 9]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d38c8c)[0xaaaaaff08c8c] [ 9]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d38c8c)[0xaaaaaff08c8c] [10]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d2c8bc)[0xaaaaafefc8bc] [10]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1d2c8bc)[0xaaaaafefc8bc] [11]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1da3c6c)[0xaaaaaff73c6c] [11]
[queue-1-dy-queue-1-cr-1-21:05078] ./wrf.exe(+0x1da3c6c)[0xaaaaaff73c6c] [12]
[queue-1-dy-queue-1-cr-1-21:05078] [12] /shared/arm/arm-linux-compiler-23.04.1_Ubuntu-20.04/lib/[0x40002d9ddabc]
[queue-1-dy-queue-1-cr-1-21:05078] *** End of error message ***
[queue-1-dy-queue-1-cr-1-21:05078] *** End of error message ***

wrf. slurm


#SBATCH --job-name=WRF
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=4
#SBATCH --exclusive
#SBATCH --time=04:00:00
#SBATCH --exclude=queue-1-dy-queue-1-cr-1-[1-16]

set -x
ulimit -s unlimited
ulimit -a

export FI_PROVIDER=efa
export I_MPI_FABRICS=ofi
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=compact
export I_MPI_DEBUG=4
export I_MPI_ROOT=/shared/openmpi-4.1.4-acfl

time /shared/openmpi-4.1.4-acfl/bin/mpirun --map-by socket:pE=16 --bind-to core ./wrf.exe

Please help me.
Apologies for the delay due to the holidays.

It's difficult to say what the error printed means. I believe those prints are specific to your environment. If the failed simulation is identical to the successful one (i.e., same domain, dates, input data, namelist settings, etc.) then this also seems to indicate the issue is related to your specific environment. If you would like me to take a more in-depth look at your rsl files, please package all of them into a single *.tar file and attach that, along with your namelist.input file. Otherwise, I would suggest speaking to a systems administrator about the problem to see if they can help troubleshoot.