Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF core limit error with docker container

wetter

New member
I‘m currently running a small domain 200x150 model in a WRFV4.5.2 docker container. The host Machine has a size of 48c96t AMD CPU and the system is ubuntu 22.04.
I tested the time consumption that I only need around 40 threads to reach a optimized perfomance under nc+hdf5 and gun compiler. but I still want to know what caused the following issue.

problem is when running in the docker, using CPU threads more than 64(64,72,80,90,96 etc al) will cause a problem of
Program received signal SIGBUS: Access to an undefined portion of a memory object.
The most cores I can use is 63 threads in the docker container and on the the host machine, the running is ok with all 96 threads or other combination.

I followed a previous post to set the docker in privilege mode but still the same problem.
I configure the -D the option, seems the problem is related with mpi issue, of course, but still I can't figure out what's the cause, hope someone can give me some guide.
the error log and namelist is attached.
 

Attachments

  • wrf.log
    24.7 KB · Views: 2
  • namelist.input
    3.7 KB · Views: 3
Last edited:
I am sorry that we have little experiences in docker, --- hope someone in the community can provide more information.
 
I am sorry that we have little experiences in docker, --- hope someone in the community can provide more information.
Solution
Increase the shared memory size of docker
e.g. --shm-size=512m
you can change it in docker hostconfig.json when you already have a container existed

Why
It maybe a problem with MPI if u use it. I've search the reason that mpi raised exit code 135.
By looking at the rsl.error file also illustrates the issue of memory access.
Obviously, the probability that the program itself has a memory access error is almost zero.

So i observe my docker share memory. When it meet the limit 64m, the docker default size, it will core dump for a while.
And i raise the limit to 1G, the problem gone.
I attach the shell that i use to oberserve the share memory. After 4h hours(32 process), the share memory raised to 75m. This confirms what I think.

The shell shm_monitor.sh
#!/bin/bash
PROGRAM_PID=$1
INTERVAL=1
LOGFILE="shm_usage.log"
function get_shm_usage() {
df -h /dev/shm | awk 'NR==2 {print $3}'
}
function monitor_shm_usage() {
echo "Monitoring shared memory usage for PID $PROGRAM_PID" > "$LOGFILE"
while kill -0 $PROGRAM_PID 2>/dev/null; do
SHM_USAGE=$(get_shm_usage)
echo "$(date '+%Y-%m-%d %H:%M:%S') - Shared Memory Used: $SHM_USAGE" >> "$LOGFILE"
sleep $INTERVAL
done
echo "$(date '+%Y-%m-%d %H:%M:%S') - Program terminated. Final Shared Memory Used: $SHM_USAGE" >> "$LOGFILE"
}
if kill -0 $PROGRAM_PID 2>/dev/null; then
monitor_shm_usage
else
echo "Error: Process with PID $PROGRAM_PID not found."
exit 1
fi
run ./shm_monitor <PID>
dont forget to chmod +x

PS:
I'm speculating that the same is true for other similar memory leaks issue, due to the step size set by namelist.input and the size of the prediction area, as well as issues such as the number of processes allocated, processor performance, etc.
Slow processing leads to a pile of task data, like a queue of tasks.
 
Top