ehsantaghizadeh
Member
Hi,
I configured and compiled WRF using the 34. (dmpar) option.
The system I used has the following configuration:
"""
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 18
Socket(s): 2
Stepping: 4
"""
I tried to run WRF as follows:
$ ssh compute2
$ mpirun -np 72 ./wrf.exe
Although I received "SUCCESS COMPLETE WRF," some issues were observed in the rsl files:
"""
SUCCESS COMPLETE WRF
Program received signal SIGBUS: Access to an undefined portion of a memory object.
Backtrace for this error:
#0 0x7efe82623880 in ???
#1 0x7efe82622a25 in ???
#2 0x7efe7e43e6ef in ???
...
"""
Additionally, the following message was displayed upon completion of the run:
"""
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 382978 RUNNING AT compute2
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
"""
It seems WRF ran successfully, and all outputs were created, hopefully correctly. However, I'm unsure about this signal. I found some links (WRF core limit error with docker container, WRF nests in Dockerized environment : SIGBUS, Program received signal SIGBUS: Access to an undefined portion of a memory object) related to "Program received signal SIGBUS: Access to an undefined portion of a memory object," but they all pertain to Docker, which might not be my case.
I spoke with the admin, and the only suspicion is that this memory error might be related to the compute node running out of memory, as someone might be using the same node with a Slurm script. I'm curious how the run could be successful with these messages being displayed.
All rsl files are attached.
Any ideas would be appreciated.
I configured and compiled WRF using the 34. (dmpar) option.
The system I used has the following configuration:
"""
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 18
Socket(s): 2
Stepping: 4
"""
I tried to run WRF as follows:
$ ssh compute2
$ mpirun -np 72 ./wrf.exe
Although I received "SUCCESS COMPLETE WRF," some issues were observed in the rsl files:
"""
SUCCESS COMPLETE WRF
Program received signal SIGBUS: Access to an undefined portion of a memory object.
Backtrace for this error:
#0 0x7efe82623880 in ???
#1 0x7efe82622a25 in ???
#2 0x7efe7e43e6ef in ???
...
"""
Additionally, the following message was displayed upon completion of the run:
"""
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 382978 RUNNING AT compute2
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
"""
It seems WRF ran successfully, and all outputs were created, hopefully correctly. However, I'm unsure about this signal. I found some links (WRF core limit error with docker container, WRF nests in Dockerized environment : SIGBUS, Program received signal SIGBUS: Access to an undefined portion of a memory object) related to "Program received signal SIGBUS: Access to an undefined portion of a memory object," but they all pertain to Docker, which might not be my case.
I spoke with the admin, and the only suspicion is that this memory error might be related to the compute node running out of memory, as someone might be using the same node with a Slurm script. I'm curious how the run could be successful with these messages being displayed.
All rsl files are attached.
Any ideas would be appreciated.