Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF Crashes With Seg-Fault After First Output [4.2.2]

kaiden_07

New member
I'm running a 3/1km simulation across Southeast Canada and the upper Northeast US over a 96-hour period. These simulations use the BEP/BEM scheme with MYNN PBL. I successfully ran this simulation to completion using the default geo_ems as input to metgrid, followed by real and wrf.exe. When I switched my geo_ems to ones that consider LCZs (urbanized via wudapt to WRF), metgrid and real ran without issues, but wrf.exe is now crashing with a seg-fault on the second timestep.

Snippet of rsl.error.0000 around error occurrence and traceback (debug_level 100):

Code:
d02 2022-05-19_00:00:00 calling inc/HALO_EM_SCALAR_E_5_inline.inc
d02 2022-05-19_00:00:05 module_integrate: back from solve interface
Timing for main: time 2022-05-19_00:00:05 on domain   2:    9.71652 elapsed seconds
d02 2022-05-19_00:00:05 module_integrate: calling solve interface
d02 2022-05-19_00:00:05  grid spacing, dt, time_step_sound=   1000.000       5.000000               4
d02 2022-05-19_00:00:05 calling inc/HALO_EM_MOIST_OLD_E_7_inline.inc
d02 2022-05-19_00:00:05 calling inc/PERIOD_BDY_EM_MOIST_OLD_inline.inc
d02 2022-05-19_00:00:05 calling inc/HALO_EM_A_inline.inc
d02 2022-05-19_00:00:05 calling inc/PERIOD_BDY_EM_A_inline.inc
d02 2022-05-19_00:00:05 calling inc/HALO_EM_PHYS_A_inline.inc
d02 2022-05-19_00:00:05 Top of Radiation Driver
d02 2022-05-19_00:00:05 calling inc/HALO_PWP_inline.inc
d02 2022-05-19_00:00:05 in MYNNSFC
[uagc22-06:95321:0:95321] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe07910de0)
==== backtrace (tid:  95321) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x14706130b13c]
 1  /lib64/libucs.so.0(+0x2c31c) [0x14706130b31c]
 2  /lib64/libucs.so.0(+0x2c4ea) [0x14706130b4ea]
 3  ./wrf.exe() [0x2c37d3a]
 4  ./wrf.exe() [0x2c37a44]
 5  ./wrf.exe() [0x2c31f56]
 6  ./wrf.exe() [0x2c3050d]
 7  ./wrf.exe() [0x224a903]
 8  ./wrf.exe() [0x1b95bba]
 9  ./wrf.exe() [0x150c337]
10  ./wrf.exe() [0x13402bc]
11  ./wrf.exe() [0x5918ff]
12  ./wrf.exe() [0x591f16]
13  ./wrf.exe() [0x414e51]
14  ./wrf.exe() [0x414e0f]
15  ./wrf.exe() [0x414da2]
16  /lib64/libc.so.6(__libc_start_main+0xe5) [0x1472bfe518a5]
17  ./wrf.exe() [0x414cae]
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
wrf.exe            0000000002DB9F2A  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  00001472C02009F0  Unknown               Unknown  Unknown
wrf.exe            0000000002C37D3A  Unknown               Unknown  Unknown
wrf.exe            0000000002C37A44  Unknown               Unknown  Unknown
wrf.exe            0000000002C31F56  Unknown               Unknown  Unknown
wrf.exe            0000000002C3050D  Unknown               Unknown  Unknown
wrf.exe            000000000224A903  Unknown               Unknown  Unknown
wrf.exe            0000000001B95BBA  Unknown               Unknown  Unknown
wrf.exe            000000000150C337  Unknown               Unknown  Unknown
wrf.exe            00000000013402BC  Unknown               Unknown  Unknown
wrf.exe            00000000005918FF  Unknown               Unknown  Unknown
wrf.exe            0000000000591F16  Unknown               Unknown  Unknown
wrf.exe            0000000000414E51  Unknown               Unknown  Unknown
wrf.exe            0000000000414E0F  Unknown               Unknown  Unknown
wrf.exe            0000000000414DA2  Unknown               Unknown  Unknown
libc-2.28.so       00001472BFE518A5  __libc_start_main     Unknown  Unknown
wrf.exe            0000000000414CAE  Unknown               Unknown  Unknown

I use 72 processors for this simulation, so I don't think that is an issue. I also see no CFL errors in the rsl files. In my job (bash) script, I call ulimit -s unlimited, as well as setenv MP_STACK_SIZE 64000000.

I noticed a thread from a few years ago (SegFault in MYNNSFC) had a VERY similar issue to mine, with a few differences, especially in e_vert and eta_levels:
Code:
e_vert                              = 51,    51,    51,
eta_levels                          = 1.,
                                       0.998743415,0.99748677,0.996230185,0.9949736,0.993716955,
                                       0.992334723,0.990814209,0.989141703,0.987301886,0.98527813,
                                       0.983051956,0.980603218,0.977909565,0.974946558,0.971687257,
                                       0.968101978,0.964158237,0.959820092,0.955048144,0.949799001,
                                       0.94402492,0.937673509,0.930686891,0.923001587,0.914547801,
                                       0.905248582,0.895019472,0.883767486,0.871390283,0.857775331,
                                       0.842798889,0.826324821,0.80820334,0.788269699,0.7663427,
                                       0.742223024,0.715691328,0.68650645,0.65440315,0.619089544,
                                       0.580244482,0.537514985,0.49051252,0.438809812,0.381936818,
                                       0.319376528,0.250560224,0.17486228,0.0915945247,0.,
 /

i then tried changing the subroutine in phys/module_sf_mynn.F zolri(), which was a fix implemented by the user with the same aforementioned issue (see: Divide by zero error in phys/module_sf_mynn.F sub zolri() · Issue #1386 · wrf-model/WRF). This was also unsuccessful.

Any help would be greatly appreciated! I've attached a copy of my namelist.input, namelist.wps, wrf.log, and my rsl.error.0000 files.
 

Attachments

  • namelist.wps
    1.1 KB · Views: 1
  • namelist.input
    6.7 KB · Views: 2
  • rsl.error.0000
    40.5 KB · Views: 1
  • wrf.log
    10.5 KB · Views: 1
Hi,
Apologies for the long delay in response while our team tended to time-sensitive obligations. Thank you for your patience.

You mention you run the simulation without issues, until you use the LCZ data during WPS. Is that the only difference between the failed and successful simulation? The rsl* file you sent indicates you are using 64 processors. For your domain sizes, you could use a lot more processors. Just as a test, will you try something like 144 processors (12x12) to see if that makes any difference?
 
Hi,

Thank you for the response! Yes, the only difference between the failed and successful simulation was using the LCZ data during WPS to generate the geo_em's.

I tried an mpirun with 144 processors as you suggested, but I still seg-fault after the first timestep. I don't understand what you mean when you say running 12x12, can you please clarify?

Thank you!
 
Hi,

A quick follow-up: When I changed the PBL scheme from MYNN to MYJ (a change in sf_sfclay_physics was also necessary), the run was successful to completion. I believe the error is likely related to the MYNN scheme, with this in mind.
 
Top