SegFault in MYNNSFC

Topics specifically related to the wrf.exe program
Post Reply
bartbrashers
Posts: 73
Joined: Wed Aug 08, 2018 2:21 pm

SegFault in MYNNSFC

Post by bartbrashers » Sun Jan 17, 2021 10:36 pm

I'm running a 12/4/1.33 km simulation in central Alaska in the wintertime (T2 ranges from -40 degC to -5 degC). For a different WRF domain covering the North Slope, the MYNN2.5 PBL scheme worked very well, so I'd like to try it in this case too.

With both WRF-4.2 and WRF-4.2.2, I get a Segmentation fault after a few minutes of simulated time. I set debug_level = 300, and see the following:

Code: Select all

==> 2019-11-30/rsl.error.0012 <==
d02 2019-11-29_12:05:32 Top of Radiation Driver
d02 2019-11-29_12:05:32 SW surface irradiance interpolation
d02 2019-11-29_12:05:32 calling inc/HALO_PWP_inline.inc
d02 2019-11-29_12:05:32  call surface_driver
d02 2019-11-29_12:05:32 SST_UPDATE is on
d02 2019-11-29_12:05:32 in MYNNSFC
[c09:25092] *** Process received signal ***
[c09:25092] Signal: Segmentation fault (11)
[c09:25092] Signal code: Address not mapped (1)
[c09:25092] Failing at address: 0xfffffffe07fa40e4
All the other 31 threads' last printed line in rsl.error.* is the same as the calling inc/HALO_PWP_inline.inc line above.

This project is required to match the vertical eta levels from previous simulations of the same area, done by others before me. The eta levels they used (for WRF-3.1) are pretty intense:

Code: Select all

 e_vert                              = 39,      39,      39,      39,      39,
 eta_levels                          = 1.0000,  0.9995,  0.9990,  0.9984,  0.99705,
                                       0.99415, 0.99155, 0.9860,  0.9780,  0.9660,
                                       0.9500,  0.9340,  0.9180,  0.9020,  0.8860,
                                       0.8660,  0.8420,  0.8140,  0.7800,  0.7400,
                                       0.6940,  0.6480,  0.6020,  0.5560,  0.5100,
                                       0.4640,  0.4180,  0.3720,  0.3260,  0.2820,
                                       0.2400,  0.2000,  0.1630,  0.1280,  0.0960,
                                       0.0660,  0.0400,  0.0180,  0.0000,
The two lowest layers are about 3.4m deep, assuming 1000mb and 273K.

If I use a different set of eta_levels with ~10m deep lowest layers, I can avoid this crash. That seems like a big hint to me.

What are the next steps I should do to find the SegFault-causing bug?

Ming Chen
Posts: 1290
Joined: Mon Apr 23, 2018 9:42 pm

Re: SegFault in MYNNSFC

Post by Ming Chen » Wed Jan 20, 2021 10:08 pm

Please recompile WRF in debug mode, i.e.,
./clean -a
./configure -D
Then recompile the code.
Please rerun this failed case with the executable files created in debug mode. In RSL file, you will find in which code and which line the errors appear first. From where you can further trace what is wrong.
WRF Help Desk

bartbrashers
Posts: 73
Joined: Wed Aug 08, 2018 2:21 pm

Re: SegFault in MYNNSFC

Post by bartbrashers » Fri Jan 22, 2021 6:12 pm

Thanks for the reply. I made a "debug" version as instructed, and ran it. The tail of rsl.error.0006 (running on compute node c08) shows:

Code: Select all

d03 2019-11-29_12:10:04+07/25  DEBUG wrf_timetoa():  returning with str = [2019-11-29_12:10:04]
d03 2019-11-29_12:10:04+07/25  call radiation_driver
d03 2019-11-29_12:10:04+07/25 Top of Radiation Driver
d03 2019-11-29_12:10:04+07/25 SW surface irradiance interpolation
d03 2019-11-29_12:10:04+07/25 calling inc/HALO_PWP_inline.inc
d03 2019-11-29_12:10:04+07/25  call surface_driver
d03 2019-11-29_12:10:04+07/25 SST_UPDATE is on
d03 2019-11-29_12:10:04+07/25 in MYNNSFC
[c08:29024] *** Process received signal ***
[c08:29024] Signal: Floating point exception (8)
[c08:29024] Signal code: Floating point divide-by-zero (3)
[c08:29024] Failing at address: 0x3dc1945
[c08:29024] [ 0] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libpthread.so.0(+0xf5f0)[0x7fce0aa3d5f0]
[c08:29024] [ 1] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_sf_mynn_zolri_+0x155)[0x3dc1945]
[c08:29024] [ 2] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_sf_mynn_sfclay1d_mynn_+0x65c3)[0x3db60b3]
[c08:29024] [ 3] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_sf_mynn_sfclay_mynn_+0x3c75)[0x3daf9b5]
[c08:29024] [ 4] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_surface_driver_surface_driver_+0x12436)[0x2e89ef6]
[c08:29024] [ 5] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_first_rk_step_part1_first_rk_step_part1_+0x243da)[0x1f7597a]
[c08:29024] [ 6] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(solve_em_+0x8873)[0x15caa23]
[c08:29024] [ 7] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(solve_interface_+0x2587)[0x13c22f7]
[c08:29024] [ 8] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_integrate_integrate_+0x34a)[0x4e44aa]
[c08:29024] [ 9] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_integrate_integrate_+0xa5a)[0x4e4bba]
[c08:29024] [10] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_integrate_integrate_+0xa5a)[0x4e4bba]
[c08:29024] [11] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(module_wrf_top_wrf_run_+0x27)[0x48c937]
[c08:29024] [12] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(MAIN_+0x35)[0x48c4a5]
[c08:29024] [13] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe(main+0x44)[0x48c444]
[c08:29024] [14] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(__libc_start_main+0xf5)[0x7fce09e08505]
[c08:29024] [15] /usr/local/src/wrf/WRF-4.2.2-debug/main/wrf.exe[0x48c339]
[c08:29024] *** End of error message ***
In phys/module_sf_mynn.F there's a REAL function zolri(ri,za,z0,zt,zol1) which seems like a possible culprit - some unprotected divisions like

Code: Select all

x1=x1-fx1/(fx2-fx1)*(x2-x1)
But that subroutine also calls REAL function zolri2(zol2,ri2,za,z0,zt). From the above rsl output, can we be confident I don't need to look in zolri2?

bartbrashers
Posts: 73
Joined: Wed Aug 08, 2018 2:21 pm

Re: SegFault in MYNNSFC

Post by bartbrashers » Sun Jan 24, 2021 5:42 pm

Adding a check for (fx2-fx1) being too small (if smaller than 1.e-6, set to 1.e-6) made WRF continue running past the point it failed in 2 previous test runs, so I think that's the culprit.

What's next? Do you want me to file a bug report on the Github site?

Ming Chen
Posts: 1290
Joined: Mon Apr 23, 2018 9:42 pm

Re: SegFault in MYNNSFC

Post by Ming Chen » Tue Jan 26, 2021 5:16 pm

Thanks for the detailed description of the problem. Please submit a PR in GitHub, and let's see what the developers would say about this issue.
WRF Help Desk

Post Reply

Return to “wrf.exe”