Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Program received signal SIGBUS

Hi,

I configured and compiled WRF using the 34. (dmpar) option.

The system I used has the following configuration:

"""
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 18
Socket(s): 2
Stepping: 4
"""

I tried to run WRF as follows:

$ ssh compute2
$ mpirun -np 72 ./wrf.exe

Although I received "SUCCESS COMPLETE WRF," some issues were observed in the rsl files:

"""
SUCCESS COMPLETE WRF

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0 0x7efe82623880 in ???
#1 0x7efe82622a25 in ???
#2 0x7efe7e43e6ef in ???
...
"""

Additionally, the following message was displayed upon completion of the run:

"""
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 382978 RUNNING AT compute2

= EXIT CODE: 9

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
"""

It seems WRF ran successfully, and all outputs were created, hopefully correctly. However, I'm unsure about this signal. I found some links (WRF core limit error with docker container, WRF nests in Dockerized environment : SIGBUS, Program received signal SIGBUS: Access to an undefined portion of a memory object) related to "Program received signal SIGBUS: Access to an undefined portion of a memory object," but they all pertain to Docker, which might not be my case.

I spoke with the admin, and the only suspicion is that this memory error might be related to the compute node running out of memory, as someone might be using the same node with a Slurm script. I'm curious how the run could be successful with these messages being displayed.

All rsl files are attached.

Any ideas would be appreciated.
 

Attachments

  • rsl_sigbus.tar.gz
    2.9 MB · Views: 1
Hi,
It sounds like your systems administrator is likely right. Compilers can print out errors that stem from system/environment issues (e.g., something going wrong with the processes as soon as the simulation completes, but before the processors are turned off for your run). I suggest looking through your output. If it all looks okay and you don't notice any missing data or anything, and the results are reasonable, it's probably safe to assume the simulation completed as it should.
 
Top