Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault with RRTMG 'rtmg_lw_taumoltaumol_mp_taugb3_() module_ra_rrtmg_lw.f90:0'

LluisFB

Member
Dear all,
I am trying to run a 3-nested domain (see attached namelist.input) with WRF v4.5.1 using LCZ and urban scheme, but after 18 hours I got segmentation fault:
Code:
(...)
d01 2022-12-02_18:00:21  Input data is acceptable to use:
[irene2313:1139687:0:1139687] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffe007cee840)
==== backtrace (tid:1139687) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000026624f5 rrtmg_lw_taumoltaumol_mp_taugb3_()  module_ra_rrtmg_lw.f90:0
 2 0x0000000002625a44 rrtmg_lw_taumol_mp_taumol_()  ???:0
 3 0x000000000261d75c rrtmg_lw_rad_mp_rrtmg_lw_()  ???:0
 4 0x000000000260f3ec module_ra_rrtmg_lw_mp_rrtmg_lwrad_()  ???:0
 5 0x0000000001d238a3 module_radiation_driver_mp_radiation_driver_()  ???:0
 6 0x0000000002152d3c module_first_rk_step_part1_mp_first_rk_step_part1_()  ???:0
 7 0x00000000016e3702 solve_em_()  ???:0
 8 0x00000000014e4268 solve_interface_()  ???:0
 9 0x00000000005952f3 module_integrate_mp_integrate_()  ???:0
10 0x0000000000595910 module_integrate_mp_integrate_()  ???:0
11 0x0000000000595910 module_integrate_mp_integrate_()  ???:0
12 0x0000000000414951 module_wrf_top_mp_wrf_run_()  ???:0
13 0x000000000041490f MAIN__()  ???:0
14 0x00000000004148a2 main()  ???:0
15 0x000000000003ad85 __libc_start_main()  ???:0
16 0x00000000004147ae _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
wrf.exe            0000000003492B9A  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  000014FEBAE22CF0  Unknown               Unknown  Unkno
(...)
I remembered I found a similar thread, but now I can not found it. It was explaining that it might be related to the cumulus scheme. I already tried different options, but it did not work. I already tried options 1 (Kain-Fritsch), 5 (Grell-3) and 6 (Tiedke) for the first (25 km) and 2nd (5k) domain, not for the third at 1 km

From the segfault meesage seems to happen at taugb3 subroutine within phys/module_ra_rrtmg_lw.F:
Code:
     subroutine taugb3
!----------------------------------------------------------------------------
!
!     band 3:  500-630 cm-1 (low key - h2o,co2; low minor - n2o)
!                           (high key - h2o,co2; high minor - n2o)

I am trying to run it with a debugging compilation, but I got stack because I found another issue with NoahMP (see post WRF-MPAS forum post 14788#6)
 

Attachments

  • namelist.input
    5.4 KB · Views: 7
Hi,
The NoahMP issue has been fixed. Please take a look at the document here and incorporate the changes for your future runs.

For the problem in radiation, we need to first identify possible reasons. RRTMG has been well tested and used for quite a while, thereby at present I would like to believe it works fine. Since lCZ is relatively new in WRF, I am not 100% sure whether the LCZ option resulted in something wrong.

Would you please run the same case with LCZ off, and set the option:
radt = 10, 10, 10,

If this case can run to the end, we can then turn on LCZ to rerun the case.

Please let me know the results. These tests can help us to identify whether LCZ is the culprit for the failure.
 
Thanks Ming Chen,
I ran the test, but it failed too.
I am fully aware of the intensive use of the RRTMG radiative scheme. So, I am guessing, I must have something wrong in my forcing data. It fails at the same time (around 2022-12-02_18:29:17) after the reading of forcing data at 18:00:00.
I will try to figure out, if there is something wrong.

Lluís
 
Hi, I'm also getting errors in taugb3 of the radiation scheme. Does anyone know what this refers to?
I've run two sensitivity tests with two different radiation schemes

in module_ra_rrtmg_lw.f90 the error is apparently in the line:
(ka_mn2o(jmn2o+1,indm,ig) - ka_mn2o(jmn2o,indm,ig))

In module module_ra_rrtm.f90 the error is apparently in the line:
+ N2OMULT * ABSN2OAC3(IG)

Thanks! Emily
 
Hi @epotter1, my error is in the exact same line in module_ra_rrtmg_lw.f90 and seems to be related to the value of jmn2o being out of bounds. If I understand correctly, it should be between 1 and 9 according to the size dimensions of ka_mn2o defined in module rrlw_kg03.

Thanks! Maria
 
Hi All, In my case, I have identified the root cause of the error to be the calculation of coldry (which represents the dry air column density in mol/cm2) within the subroutine inatm. While analyzing the vertical profile, I observed that at a certain point, coldry(k) becomes zero. This is happening because the pressure vector has the same value in two subsequent grid points resulting in pz(l-1)-pz(l) = 0. At this moment, I am still trying to understand why the pressure vector is repeating itself in consecutive grid points.

Thanks, Maria
 
Hi Maria,

Thanks very much for your help. Can I ask, how did you identify this? Have you compiled in a way that gives more debugging information? (I am using the -g option). I'd like to see if I have the same issue.

I'm still trying to identify whether this is a bug with WRF (or the setup I've used), or if it's a combination with the HPC I'm using (ARCHER2 in the UK).

Cheers,

Emily
 
Hi All, In my case, I have identified the root cause of the error to be the calculation of coldry (which represents the dry air column density in mol/cm2) within the subroutine inatm. While analyzing the vertical profile, I observed that at a certain point, coldry(k) becomes zero. This is happening because the pressure vector has the same value in two subsequent grid points resulting in pz(l-1)-pz(l) = 0. At this moment, I am still trying to understand why the pressure vector is repeating itself in consecutive grid points.

Thanks, Maria
Thank you @mchinita,

Did you reach any solution,conclusion?

I did not have time to have a look on it. Do you think that can be at least fixed from a coding perspective (imposing a sort of IF to prevent at least the crash)?

Lluís
 
Hi all, I got the same message in my case:
[comput3:56433:0:56433] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffe02f339ec0)
==== backtrace (tid: 56433) ====
0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
1 0x00000000023ea805 rrtmg_lw_taumoltaumol_mp_taugb3_() module_ra_rrtmg_lw.f90:0
2 0x00000000023add54 rrtmg_lw_taumol_mp_taumol_() ???:0
3 0x00000000023a5a6f rrtmg_lw_rad_mp_rrtmg_lw_() ???:0
4 0x00000000023976dc module_ra_rrtmg_lw_mp_rrtmg_lwrad_() ???:0
5 0x0000000001d3b5f3 module_radiation_driver_mp_radiation_driver_() ???:0
6 0x0000000001e70c6f module_first_rk_step_part1_mp_first_rk_step_part1_() ???:0
7 0x00000000016f828d solve_em_() ???:0
8 0x0000000001515aa8 solve_interface_() ???:0
9 0x00000000005ca671 module_integrate_mp_integrate_() ???:0
10 0x00000000004164c1 module_wrf_top_mp_wrf_run_() ???:0
11 0x000000000041647f MAIN__() ???:0
12 0x0000000000416412 main() ???:0
13 0x0000000000022555 __libc_start_main() ???:0
14 0x0000000000416329 _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
wrf.exe 000000000327290A for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B743FFB0630 Unknown Unknown Unknown
wrf.exe 00000000023EA805 Unknown Unknown Unknown
wrf.exe 00000000023ADD54 Unknown Unknown Unknown
wrf.exe 00000000023A5A6F Unknown Unknown Unknown
wrf.exe 00000000023976DC Unknown Unknown Unknown
wrf.exe 0000000001D3B5F3 Unknown Unknown Unknown
wrf.exe 0000000001E70C6F Unknown Unknown Unknown
wrf.exe 00000000016F828D Unknown Unknown Unknown
wrf.exe 0000000001515AA8 Unknown Unknown Unknown
wrf.exe 00000000005CA671 Unknown Unknown Unknown
wrf.exe 00000000004164C1 Unknown Unknown Unknown
wrf.exe 000000000041647F Unknown Unknown Unknown
wrf.exe 0000000000416412 Unknown Unknown Unknown
libc-2.17.so 00002B74403E3555 __libc_start_main Unknown Unknown
wrf.exe 0000000000416329 Unknown Unknown Unknown

I turned off the use_adaptive_time_step = .false., and set time step to 75 (my log showed that time step was 148 while using adaptive time step). It worked in my case!
 
Top