segmentation fault with WRF 4.0.3 and OpenMPI

davidovens · Jan 8, 2019

I've been trying to get WRF 4.0.3 running on our cluster here at the
University of Washington. I've built with icc/ifort (17.0.0.098) and OpenMPI
(2.1.1) on our intel Xeon cluster. I had no trouble running this
simple 2-domain case with an SMP-compiled version, so I know that the
inputs and namelist.input are fine.

However, every build with OpenMPI -- trying different versions
of OpenMPI (1.8.8 and 2.1.1) and ifort (15.0.0.090 and 17.0.0.098) --
leads to a segmentation fault at the first time step (see below).
I've of course tried setting stacksize to a healthy 6000M (which is
more than enough for this small 36/12-km domain run).
I am not using the hybrid vertical coordinate. Does anyone have any
ideas or has anyone had similar problems with OpenMPI?

Error seen in rsl.error.0000:
....
D01 3-D analysis nudging reads new data at time = 0.000 min.
D01 3-D analysis nudging bracketing times = 0.00 180.00 min.
mediation_integrate.G 1943 DATASET=HISTORY
mediation_integrate.G 1944 grid%id 2 grid%oid 3
Timing for Writing wrfout_d02_2019-01-04_12:00:00 for domain 2: 1.12362 elapsed seconds
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS 1 IE 46 JS 1 JE 24
WRF NUMBER OF TILES = 1
Timing for main: time 2019-01-04_12:01:12 on domain 2: 1.78872 elapsed seconds
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
wrf.exe.mpi211if2 00000000033FFB61 tbk_trace_stack_i Unknown Unknown
wrf.exe.mpi211if2 00000000033FDC9B tbk_string_stack_ Unknown Unknown
wrf.exe.mpi211if2 0000000003378974 Unknown Unknown Unknown
wrf.exe.mpi211if2 0000000003378786 tbk_stack_trace Unknown Unknown
wrf.exe.mpi211if2 00000000033023A7 for__issue_diagno Unknown Unknown
wrf.exe.mpi211if2 0000000003309FD0 for__signal_handl Unknown Unknown
libpthread-2.19.s 00007FADE38D6890 Unknown Unknown Unknown
wrf.exe.mpi211if2 000000000040D46E Unknown Unknown Unknown
libc-2.19.so 00007FADE353DB45 __libc_start_main Unknown Unknown
wrf.exe.mpi211if2 000000000040D369 Unknown Unknown Unknown

Thanks,
David Ovens

kwthomas · Jan 8, 2019

David...

I have to do:

setenv KMP_STACKSIZE 256m

on systems that I use (TACC, PSC). If your MPI implementation is different, then the environmental
variable name may be different.

This setting came from one of the TACC staff. Before I did this, my runs would also seg fault.

This is unrelated to setting the shell stacksize.

davidovens · Jan 8, 2019

Actually, KMP_STACKSIZE is used for OpenMP, (the shared memory processing, aka smp) not OpenMPI (distributed memory processing, dmpar), at least on our machines. I set KMP_STACKSIZE when I run the smp code. But, setting KMP_STACKSIZE to 4000M doesn't help this dmpar/OpenMPI code.

As a follow-up, I recompiled 4.0.3 with an older version (2.0.2) of OpenMPI. In this instance, I see this error in the rsl.error.0000 file:
WRF NUMBER OF TILES = 1
D01 3-D analysis nudging reads new data at time = 0.000 min.
D01 3-D analysis nudging bracketing times = 0.00 180.00 min.
d01 2019-01-04_12:00:00 18828 points exceeded cfl=2 in domain d01 at time 2019-01-04_12:00:00 hours
forrtl: severe (154): array index out of bounds
Image PC Routine Line Source
wrf.exe.mpi202if2 000000000388C141 tbk_trace_stack_i Unknown Unknown
wrf.exe.mpi202if2 000000000388A27B tbk_string_stack_ Unknown Unknown

The cfl error and array index out of bounds problems are really odd, considering that the exact same wrfbdy and wrfinput files work fine
with my smp-compiled version of the same WRF 4.0.3 code. This seems to indicate that the OpenMPI compilation of the code has some
bad integers for array indices -- again, quite odd since the smp/OpenMP compilation of the code does not have that issue.

Ming Chen · Jan 9, 2019

David,

Would you please upload your namelist.input, wrfinput, wrfbdy and wrffdda etc. for me to take a look? I would try to repeat your case.

We have fort/icc version 17.0.1 here, which is quite similar to the version you use. Bu we have OpenMPI 3.0.1. Anyway I will see whether I can reset your problem first.

davidovens · Jan 9, 2019

Thanks for checking in.

I have gotten it working by modifying some things in the configure.wrf file. I am isolating the cause of the problem now, but suspect
it has something to do with possibly mixing a netcdf-4 and netcdf-3 library. I will post my findings once I've finished.

David

davidovens · Jan 11, 2019

For WRF v4.0.3 on our Intel Xeon (E5-2650) chips, it turns out that segmentation faults at the first time step are due
to using
"-fp-model fast=1"
"-fp-model fast=2"
in ifort/icc 17.0.0.

Using
"-fp-model precise"
allows the code to run fine (using SMP or DMPAR mode) with the ifort/icc 17.0.0 compiler. Here is the key line in my configure.wrf file:
FCBASEOPTS_NO_G = -ip -fp-model precise -w -ftz -align all -fno-alias $(FORMAT_FREE) $(BYTESWAPIO)

Also, WRF 4.0.3 code compiled using "-xHost" (which implies "-fp-model fast=2") will suffer the segmentation faults unless you add the
"-fp-model precise" flag as shown above.

I also discovered that ifort/icc 15.0.0 works fine with all the
"-fp-model fast=1"
"-fp-model fast=2"
"-fp-model precise"
compiler options in SMP/OpenMP or DMPAR/OpenMPI compilations.

I am not sure if this "bug" is an issue with the ifort/icc 17.0.0 compiler or is an issue in the WRF 4.0.3 code.

David Ovens

kwthomas · Jan 11, 2019

David...

Definitely compiler.

Intel 17.x and Intel 18.x optimizations sometimes make a mess of the WRF code. Been there, done that.l

segmentation fault with WRF 4.0.3 and OpenMPI

davidovens

New member

kwthomas

New member

davidovens

New member

Ming Chen

Moderator

davidovens

New member

davidovens

New member

kwthomas

New member