WRF v4.5.1 SIGSEGV during nested domain initialization

vayu

New member
Hi all,

I am getting SIGSEGV error while running wrf.exe. Any help to fix this error would be really appreciated.

System Information
  • WRF Version: V4.5.1
  • Compiler: GCC 14.3 via PrgEnv-gnu / Cray ftn wrapper
  • MPI: Cray MPICH 8.1.32
  • Build type: dmpar (distributed memory parallel), option 34
  • Number of MPI tasks: 64 (project allocation limit on Setonix)
  • Domains: 3 nested (27km/9km/3km), 163×112 cells each, 35 vertical levels
Stage 1 — First crash: SIGSEGV after Noah LSM and Thompson MP table reads


The very first run crashed immediately after reading Thompson microphysics lookup tables and initializing Noah LSM. At this stage the namelist had mp_physics=8 (Thompson) and sf_surface_physics=2 (Noah LSM).

ThompMP: read qr_acr_qg_V4.dat instead of computing
ThompMP: read qr_acr_qsV2.dat instead of computing
ThompMP: read freezeH2O.dat instead of computing
INITIALIZE THREE Noah LSM RELATED TABLES
LANDUSE TYPE = MODIFIED_IGBP_MODIS_NOAH FOUND 20 CATEGORIES
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#4 0x38481b4 in ???

What we ruled out:
  • OOM: --exclusive flag confirmed, full 245GB node available
  • Stack overflow: ulimit -s unlimited set in script and srun wrapper
  • Broken input files: wrfinput, wrfbdy all verified present and correct sizes

Stage 2 — Fixing namelist issues
We noticed the namelist contained several issues that should not be present for standalone WRF:

  • A &wrf_cmaq section left over from a WRF-CMAQ coupled build
  • auxhist2 chemistry output stream lines
  • Missing sst_update = 1 for a 73-day simulation
  • Missing auxinput4 SST update configuration

Changes made:
  • Removed &wrf_cmaq namelist block entirely
  • Removed auxhist2 lines
  • Added sst_update = 1 with auxinput4_inname, auxinput4_interval, io_form_auxinput4
  • Reran real.exe to generate wrflowinp_d01/02/03 files

Result: Crash address shifted from 0x38481b4 to 0x24efa74 but still SIGSEGV in the same RSL_LITE region. Crash still occurred after Thompson MP tables and Noah LSM initialization.

Stage 3 — Changing microphysics scheme


Suspected Thompson MP (mp_physics=8) lookup table reads might be involved in the crash. Switched to WSM6 (mp_physics=6).


Result: No change. Same SIGSEGV at 0x24efa74. Thompson tables no longer read (confirming the scheme changed) but crash location and address identical. This ruled out Thompson MP as the cause.


Stage 4 — Changing land surface model to Noah-MP

Found a WRF forum post suggesting switching from Noah LSM (sf_surface_physics=2) to Noah-MP (sf_surface_physics=4) resolved a similar crash on another system. Made the switch, added proper &noah_mp namelist section with correct option names (dveg, opt_crs, opt_btr, etc. — confirmed from Registry.EM_COMMON), and reran real.exe.

Result: Significant progress — WRF now gets considerably further through initialization:
SOIL TEXTURE CLASSIFICATION = STAS FOUND 19 CATEGORIES
start_domain_em: After call to phy_init
start_em: calling lightning_init
start_em: after calling lightning_init
calling inc/HALO_EM_INIT_1_inline.inc
calling inc/HALO_EM_INIT_2_inline.inc
calling inc/HALO_EM_INIT_3_inline.inc
calling inc/HALO_EM_INIT_4_inline.inc
calling inc/HALO_EM_INIT_5_inline.inc
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#4 0x24efa74 in ???
#5 0x24ef7e1 in ???
#6 0x24edaab in ???
#7 0x13cc4b7 in ???
#8 0x1b1c46e in ???
#9 0x17cf394 in ???
#10 0x12571be in ???
#11 0x406e5e in ???
#12 0x405b1b in ???


The crash is now clearly during halo exchange initialization rather than LSM table reads. Noah-MP got us past the LSM issue but revealed the underlying RSL_LITE problem.




Stage 5 — Reducing compiler optimisation


Based on GitHub Issue #1764 which describes RSL_LITE crashes with GCC 12+ and aggressive optimisation, we reduced optimisation in configure.wrf:


FCOPTIM = -O1 (was -O2 -ftree-vectorize -funroll-loops)
CFLAGS_LOCAL = -w -O1 -c (was -w -O3 -c)

Full clean recompile performed. Build took approximately 8 minutes confirming full recompile.


Result: Crash address remained at 0x24efa74 — identical. Reducing optimisation had no effect on the crash.


Current Status
The crash is deterministic, always at 0x24efa74 (rsl_free), always during HALO_EM_INIT_5, always before domain 2 initializes.

Questions
  1. Is there a known issue with WRF V4.5.1 + GCC 14 on Cray systems causing rsl_free corruption during HALO_EM_INIT_5?
  2. Does our domain decomposition (8×8 processor grid, 163×112 cells, 3 nests with identical e_we/e_sn) trigger any known RSL_LITE edge cases during HALO_EM_INIT_5?
 

Attachments

Back
Top