Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Mpi + Openmp Results in Seg Fault Couple Mins Into Run

RobWaters

Member
Hi WRF team,


We've been running WRF successfully using pure mpi (dmpar) and want to explore potential gains when switching over to the hybrid mpi + openmp (dm+sm). Unfortunately, we get a seg fault about 2 mins into a model run. Any advice / pointer? I know it is usually advised to go with either a pure mpi or openmp solution, but I'd really like to find out what is causing crash.

We are using:
- Intel OneAPI 2022.2.1
- Intel MPI 2021.7.1

Namelist is attached which runs fine with mpi but crashes with mpi+opnemp. Test run script is also attached where stuff like the stack size is set to unlimited.

Let me know if I can provide any more info.

Thanks in advance for any help,

Rob
 

Attachments

  • configure.wrf.txt
    23.3 KB · Views: 2
  • rsl.error.0000
    10.5 KB · Views: 4
  • namelist.input
    5.6 KB · Views: 4
  • rsl.out.0000
    10.5 KB · Views: 1
  • run_wrf.sh.txt
    304 bytes · Views: 5
With OpenMP, you may need to increase the size of OMP_STACKSIZE. This is separate from the shell stacksize. Default value is usually 4M (MB). I would expect the seg fault to happen on the first time step, however, if this were the problem. You can try setting it larger, e.g.,

setenv OMP_STACKSIZE 16M

On the other hand, the sudden decrease in adaptive dt could mean that it has become unstable. You might try with a constant, smaller dt to see if that is the problem.
 
Thanks for your reply, inside the run script (run_wrf.sh.txt) we do set the stack size - we even tried crazy high ones - e.g. export OMP_STACKSIZE="30G"

The fact it gets to running suggests to me that I'm not dealing with any fundamental missing libraries etc.

Running with a single mpi task also results in the same issue.

Any other suggestions would be appreciated
 
There was another suggestion there: Did you try a fixed, smaller dt? Instabilities can cause out-of-bounds errors in physics (seg fault).
 
We're using adaptive time step and have successfully run an identical case without openMP. With no CFL errors being reported I'm not expecting this to be a physical instability.
 
It is generally not recommended to run WRF in dm+sm mode. This is because the OpenMP parallelization is not well implemented in WRF. Furthermore, with WRF being in maintenance status now, it is no longer our top priority to improve the openMP parallelization in WRF.
 
Top