Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

SegFault in MYNNSFC

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

bartbrashers

New member
I'm running a 12/4/1.33 km simulation in central Alaska in the wintertime (T2 ranges from -40 degC to -5 degC). For a different WRF domain covering the North Slope, the MYNN2.5 PBL scheme worked very well, so I'd like to try it in this case too.

With both WRF-4.2 and WRF-4.2.2, I get a Segmentation fault after a few minutes of simulated time. I set debug_level = 300, and see the following:

Code:
==> 2019-11-30/rsl.error.0012 <==
d02 2019-11-29_12:05:32 Top of Radiation Driver
d02 2019-11-29_12:05:32 SW surface irradiance interpolation
d02 2019-11-29_12:05:32 calling inc/HALO_PWP_inline.inc
d02 2019-11-29_12:05:32  call surface_driver
d02 2019-11-29_12:05:32 SST_UPDATE is on
d02 2019-11-29_12:05:32 in MYNNSFC
[c09:25092] *** Process received signal ***
[c09:25092] Signal: Segmentation fault (11)
[c09:25092] Signal code: Address not mapped (1)
[c09:25092] Failing at address: 0xfffffffe07fa40e4

All the other 31 threads' last printed line in rsl.error.* is the same as the calling inc/HALO_PWP_inline.inc line above.

This project is required to match the vertical eta levels from previous simulations of the same area, done by others before me. The eta levels they used (for WRF-3.1) are pretty intense:

Code:
 e_vert                              = 39,      39,      39,      39,      39,
 eta_levels                          = 1.0000,  0.9995,  0.9990,  0.9984,  0.99705,
                                       0.99415, 0.99155, 0.9860,  0.9780,  0.9660,
                                       0.9500,  0.9340,  0.9180,  0.9020,  0.8860,
                                       0.8660,  0.8420,  0.8140,  0.7800,  0.7400,
                                       0.6940,  0.6480,  0.6020,  0.5560,  0.5100,
                                       0.4640,  0.4180,  0.3720,  0.3260,  0.2820,
                                       0.2400,  0.2000,  0.1630,  0.1280,  0.0960,
                                       0.0660,  0.0400,  0.0180,  0.0000,

The two lowest layers are about 3.4m deep, assuming 1000mb and 273K.

If I use a different set of eta_levels with ~10m deep lowest layers, I can avoid this crash. That seems like a big hint to me.

What are the next steps I should do to find the SegFault-causing bug?
 
Please recompile WRF in debug mode, i.e.,
./clean -a
./configure -D
Then recompile the code.
Please rerun this failed case with the executable files created in debug mode. In RSL file, you will find in which code and which line the errors appear first. From where you can further trace what is wrong.
 
Thanks for the reply. I made a "debug" version as instructed, and ran it. The tail of rsl.error.0006 (running on compute node c08) shows:

Code:
d03 2019-11-29_12:10:04+07/25  DEBUG wrf_timetoa():  returning with str = [2019-11-29_12:10:04]
d03 2019-11-29_12:10:04+07/25  call radiation_driver
d03 2019-11-29_12:10:04+07/25 Top of Radiation Driver
d03 2019-11-29_12:10:04+07/25 SW surface irradiance interpolation
d03 2019-11-29_12:10:04+07/25 calling inc/HALO_PWP_inline.inc
d03 2019-11-29_12:10:04+07/25  call surface_driver
d03 2019-11-29_12:10:04+07/25 SST_UPDATE is on
d03 2019-11-29_12:10:04+07/25 in MYNNSFC
[c08:29024] *** Process received signal ***
[c08:29024] Signal: Floating point exception (8)
[c08:29024] Signal code: Floating point divide-by-zero (3)
[c08:29024] Failing at address: 0x3dc1945
[c08:29024] [ 0] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0(+0xf5f0)[0x7fce0aa3d5f0]
[c08:29024] [ 1] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_sf_mynn_zolri_+0x155)[0x3dc1945]
[c08:29024] [ 2] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_sf_mynn_sfclay1d_mynn_+0x65c3)[0x3db60b3]
[c08:29024] [ 3] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_sf_mynn_sfclay_mynn_+0x3c75)[0x3daf9b5]
[c08:29024] [ 4] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_surface_driver_surface_driver_+0x12436)[0x2e89ef6]
[c08:29024] [ 5] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_first_rk_step_part1_first_rk_step_part1_+0x243da)[0x1f7597a]
[c08:29024] [ 6] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(solve_em_+0x8873)[0x15caa23]
[c08:29024] [ 7] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(solve_interface_+0x2587)[0x13c22f7]
[c08:29024] [ 8] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_integrate_integrate_+0x34a)[0x4e44aa]
[c08:29024] [ 9] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_integrate_integrate_+0xa5a)[0x4e4bba]
[c08:29024] [10] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_integrate_integrate_+0xa5a)[0x4e4bba]
[c08:29024] [11] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_wrf_top_wrf_run_+0x27)[0x48c937]
[c08:29024] [12] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(MAIN_+0x35)[0x48c4a5]
[c08:29024] [13] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(main+0x44)[0x48c444]
[c08:29024] [14] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(__libc_start_main+0xf5)[0x7fce09e08505]
[c08:29024] [15] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe[0x48c339]
[c08:29024] *** End of error message ***

In phys/module_sf_mynn.F there's a REAL function zolri(ri,za,z0,zt,zol1) which seems like a possible culprit - some unprotected divisions like

Code:
x1=x1-fx1/(fx2-fx1)*(x2-x1)

But that subroutine also calls REAL function zolri2(zol2,ri2,za,z0,zt). From the above rsl output, can we be confident I don't need to look in zolri2?
 
Adding a check for (fx2-fx1) being too small (if smaller than 1.e-6, set to 1.e-6) made WRF continue running past the point it failed in 2 previous test runs, so I think that's the culprit.

What's next? Do you want me to file a bug report on the Github site?
 
Thanks for the detailed description of the problem. Please submit a PR in GitHub, and let's see what the developers would say about this issue.
 
Top