Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

crashed when solving the MYNNPBL at the first time step of wrf.exe

jianglizhi

New member
Hello,
I am going to run the wrf.exe. And I would like to use the shceme combination of Thompson+MYNN-EDMF+Noah.
Unfortunately, the wrf.exe crashed at the first time step when calculating the MYNNPBL, throngth out the segment error.
I don't know the real problem is. So could anyone give some advices? Thanks in advance.

error message(part of rsl.error.0002);
d01 2023-07-27_18:00:00 in MYNNPBL
[c739:63903:0:63903] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 63903) ====
0 0x0000000000021423 ucs_debug_print_backtrace() /opt/oscer/EasyBuild/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x00000000001448ac rml::internal::MemoryPool::putToLLOCache() ???:0
2 0x00000000032ad806 for_dealloc_allocatable() ???:0
3 0x00000000029eb315 module_bl_mynn_wrapper_mp_mynnedmf_wrapper_run_() ???:0
4 0x00000000025f4edf module_pbl_driver_mp_pbl_driver_() ???:0
5 0x0000000001e8874f module_first_rk_step_part1_mp_first_rk_step_part1_() ???:0
6 0x00000000016f429d solve_em_() ???:0
7 0x0000000001511ab8 solve_interface_() ???:0
8 0x00000000005c9199 module_integrate_mp_integrate_() ???:0
9 0x0000000000415281 module_wrf_top_mp_wrf_run_() ???:0
10 0x000000000041523f MAIN__() ???:0
11 0x00000000004151d2 main() ???:0
12 0x0000000000022555 __libc_start_main() ???:0
13 0x00000000004150e9 _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
wrf.exe 000000000326910A for__signal_handl Unknown Unknown
libpthread-2.17.s 00002ACFC16B4630 Unknown Unknown Unknown
libiomp5.so 00002ACFC1FF08AC Unknown Unknown Unknown
wrf.exe 00000000032AD806 for_dealloc_alloc Unknown Unknown
wrf.exe 00000000029EB315 Unknown Unknown Unknown
wrf.exe 00000000025F4EDF Unknown Unknown Unknown
wrf.exe 0000000001E8874F Unknown Unknown Unknown
wrf.exe 00000000016F429D Unknown Unknown Unknown
wrf.exe 0000000001511AB8 Unknown Unknown Unknown
wrf.exe 00000000005C9199 Unknown Unknown Unknown
wrf.exe 0000000000415281 Unknown Unknown Unknown
wrf.exe 000000000041523F Unknown Unknown Unknown
wrf.exe 00000000004151D2 Unknown Unknown Unknown
libc-2.17.so 00002ACFC1AE7555 __libc_start_main Unknown Unknown
wrf.exe 00000000004150E9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
wrf.exe 00000000032691AE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002ACFC16B4630 Unknown Unknown Unknown
libiomp5.so 00002ACFC1FF1C06 Unknown Unknown Unknown
libiomp5.so 00002ACFC1FF18EB Unknown Unknown Unknown
libiomp5.so 00002ACFC1FF13E0 Unknown Unknown Unknown
libiomp5.so 00002ACFC1FF0229 Unknown Unknown Unknown
libiomp5.so 00002ACFC1F8F005 Unknown Unknown Unknown
libiomp5.so 00002ACFC1F6DD3B Unknown Unknown Unknown
libiomp5.so 00002ACFC1F6DCE8 Unknown Unknown Unknown
ld-2.17.so 00002ACFBF0B207A Unknown Unknown Unknown
libc-2.17.so 00002ACFC1AFECE9 Unknown Unknown Unknown
libc-2.17.so 00002ACFC1AFED37 Unknown Unknown Unknown
wrf.exe 000000000325F3CC for__issue_diagno Unknown Unknown
wrf.exe 000000000326910A for__signal_handl Unknown Unknown
libpthread-2.17.s 00002ACFC16B4630 Unknown Unknown Unknown
libiomp5.so 00002ACFC1FF08AC Unknown Unknown Unknown
wrf.exe 00000000032AD806 for_dealloc_alloc Unknown Unknown
wrf.exe 00000000029EB315 Unknown Unknown Unknown
wrf.exe 00000000025F4EDF Unknown Unknown Unknown
wrf.exe 0000000001E8874F Unknown Unknown Unknown
wrf.exe 00000000016F429D Unknown Unknown Unknown
wrf.exe 0000000001511AB8 Unknown Unknown Unknown
wrf.exe 00000000005C9199 Unknown Unknown Unknown
wrf.exe 0000000000415281 Unknown Unknown Unknown
wrf.exe 000000000041523F Unknown Unknown Unknown
wrf.exe 00000000004151D2 Unknown Unknown Unknown
libc-2.17.so 00002ACFC1AE7555 __libc_start_main Unknown Unknown
wrf.exe 00000000004150E9 Unknown Unknown Unknown
 

Attachments

  • rsl_files.zip
    33.8 KB · Views: 2
  • namelist.output.txt
    84.8 KB · Views: 2
  • namelist.input
    13.6 KB · Views: 9
Last edited:
Hello Kwerner,

Thank you for your reply. I read the essay and tried again. According to my domain configuration (1100x938), the suggested number of processors ranges from 104 to 1650.
I have tested the jobs with different ntasks values: 256, 192, and 128.
All of them crashed during the calculation of MYNN PBL.
Specifically, an OMP error appeared in some of the rsl.error files.

Regards,
Lizhi

rsl.error.0191
d01 2023-07-27_18:00:00 in MYNNPBL
[c817:25531:0:25531] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 25531) ====
0 0x00000000031c8cc7 wrf_esmf_clockmod_mp_esmf_clockget_() ???:0
1 0x00000000005cab58 module_domain_mp_domain_get_current_time_() ???:0
2 0x00000000005cc26b module_domain_mp_domain_clock_get_() ???:0
3 0x00000000005d1cc1 get_current_time_string_() ???:0
4 0x0000000000af29b5 wrf_debug_() ???:0
5 0x0000000001e94686 module_first_rk_step_part1_mp_first_rk_step_part1_() ???:0
6 0x00000000016fc9a0 solve_em_() ???:0
7 0x000000000151a438 solve_interface_() ???:0
8 0x00000000005d2ffd module_integrate_mp_integrate_() ???:0
9 0x0000000000416341 module_wrf_top_mp_wrf_run_() ???:0
10 0x00000000004162ff MAIN__() ???:0
11 0x000000000041629d main() ???:0
12 0x0000000000022555 __libc_start_main() ???:0
13 0x00000000004161cb _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.17.s 00002B2F56A11630 Unknown Unknown Unknown

wrf.exe 00000000031C8CC7 Unknown Unknown Unknown
wrf.exe 00000000005CAB58 Unknown Unknown Unknown
wrf.exe 00000000005CC26B Unknown Unknown Unknown
wrf.exe 00000000005D1CC1 Unknown Unknown Unknown
wrf.exe 0000000000AF29B5 Unknown Unknown Unknown
wrf.exe 0000000001E94686 Unknown Unknown Unknown
wrf.exe 00000000016FC9A0 Unknown Unknown Unknown
wrf.exe 000000000151A438 Unknown Unknown Unknown
wrf.exe 00000000005D2FFD Unknown Unknown Unknown
wrf.exe 0000000000416341 Unknown Unknown Unknown
wrf.exe 00000000004162FF Unknown Unknown Unknown
wrf.exe 000000000041629D Unknown Unknown Unknown
libc-2.17.so 00002B2F58D4B555 __libc_start_main Unknown Unknown
wrf.exe 00000000004161CB Unknown Unknown Unknown
OMP: Error #13: Assertion failure at kmp_runtime.cpp(4351).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
 

Attachments

  • rslfiles.tar.gz
    253.5 KB · Views: 0
Update:
I had submit the jobs with the debug version of wrf.exe. The jobs crashed at the same point, the rsl files givess more information.
part of rsl.errror.0000
d01 2023-07-27_18:00:00 in MYNNPBL
[c739:46983:0:46983] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 46983) ====
0 0x000000000d817f67 wrf_esmf_clockmod_mp_esmf_clockget_() ???:0
1 0x000000000061357a module_domain_mp_domain_get_current_time_() /home/lzjiang/WRF/WRFV4.6.0_debug/frame/module_domain.f90:22398
2 0x0000000000615ace module_domain_mp_domain_clock_get_() /home/lzjiang/WRF/WRFV4.6.0_debug/frame/module_domain.f90:22789
3 0x000000000061dd9f get_current_time_string_() /home/lzjiang/WRF/WRFV4.6.0_debug/frame/module_domain.f90:23399
4 0x0000000001ae30b3 wrf_debug_() /home/lzjiang/WRF/WRFV4.6.0_debug/frame/wrf_debug.f90:33
5 0x000000000580146b module_first_rk_step_part1_mp_first_rk_step_part1_() /home/lzjiang/WRF/WRFV4.6.0_debug/dyn_em/module_first_rk_step_part1.f90:666
6 0x0000000004468d29 solve_em_() /home/lzjiang/WRF/WRFV4.6.0_debug/dyn_em/solve_em.f90:943
7 0x0000000003d4af83 solve_interface_() /home/lzjiang/WRF/WRFV4.6.0_debug/share/solve_interface.f90:123
8 0x00000000006209f8 module_integrate_mp_integrate_() /home/lzjiang/WRF/WRFV4.6.0_debug/frame/module_integrate.f90:329
9 0x0000000000416ec2 module_wrf_top_mp_wrf_run_() /home/lzjiang/WRF/WRFV4.6.0_debug/main/../main/module_wrf_top.f90:326
10 0x00000000004162dd MAIN__() /home/lzjiang/WRF/WRFV4.6.0_debug/main/wrf.f90:29
11 0x000000000041629d main() ???:0
12 0x0000000000022555 __libc_start_main() ???:0
13 0x00000000004161cb _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.17.s 00002B1981884630 Unknown Unknown Unknown
wrf.exe 000000000D817F67 Unknown Unknown Unknown
wrf.exe 000000000061357A module_domain_mp_ 22398 module_domain.f90
wrf.exe 0000000000615ACE module_domain_mp_ 22789 module_domain.f90
wrf.exe 000000000061DD9F get_current_time_ 23399 module_domain.f90
wrf.exe 0000000001AE30B3 wrf_debug_ 33 wrf_debug.f90
wrf.exe 000000000580146B module_first_rk_s 666 module_first_rk_step_part1.f90
wrf.exe 0000000004468D29 solve_em_ 943 solve_em.f90
wrf.exe 0000000003D4AF83 solve_interface_ 123 solve_interface.f90
wrf.exe 00000000006209F8 module_integrate_ 329 module_integrate.f90
wrf.exe 0000000000416EC2 module_wrf_top_mp 326 module_wrf_top.f90
wrf.exe 00000000004162DD MAIN__ 29 wrf.f90
wrf.exe 000000000041629D Unknown Unknown Unknown
libc-2.17.so 00002B1983BBE555 __libc_start_main Unknown Unknown
wrf.exe 00000000004161CB Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.17.s 00002B1981884630 Unknown Unknown Unknown
libiomp5.so 00002B19815991DC Unknown Unknown Unknown
libiomp5.so 00002B19815991AA Unknown Unknown Unknown
libiomp5.so 00002B198159BFDF Unknown Unknown Unknown
libiomp5.so 00002B1981485CC3 Unknown Unknown Unknown
libiomp5.so 00002B19814FDA58 Unknown Unknown Unknown
libiomp5.so 00002B1981505EA1 Unknown Unknown Unknown
ld-2.17.so 00002B198061807A Unknown Unknown Unknown
libc-2.17.so 00002B1983BD5CE9 Unknown Unknown Unknown
libc-2.17.so 00002B1983BD5D37 Unknown Unknown Unknown
wrf.exe 000000000D8B220F Unknown Unknown Unknown
wrf.exe 00000000004153C0 Unknown Unknown Unknown
libpthread-2.17.s 00002B1981884630 Unknown Unknown Unknown
wrf.exe 000000000D817F67 Unknown Unknown Unknown
wrf.exe 000000000061357A module_domain_mp_ 22398 module_domain.f90
wrf.exe 0000000000615ACE module_domain_mp_ 22789 module_domain.f90
wrf.exe 000000000061DD9F get_current_time_ 23399 module_domain.f90
wrf.exe 0000000001AE30B3 wrf_debug_ 33 wrf_debug.f90
wrf.exe 000000000580146B module_first_rk_s 666 module_first_rk_step_part1.f90
wrf.exe 0000000004468D29 solve_em_ 943 solve_em.f90
wrf.exe 0000000003D4AF83 solve_interface_ 123 solve_interface.f90
wrf.exe 00000000006209F8 module_integrate_ 329 module_integrate.f90
wrf.exe 0000000000416EC2 module_wrf_top_mp 326 module_wrf_top.f90
wrf.exe 00000000004162DD MAIN__ 29 wrf.f90
wrf.exe 000000000041629D Unknown Unknown Unknown
libc-2.17.so 00002B1983BBE555 __libc_start_main Unknown Unknown
wrf.exe 00000000004161CB Unknown Unknown Unknown

Update2:
I had decreased the time_step into 6s (2dx), same error encountered.
 

Attachments

  • rslfiles_debugversion.zip
    929.7 KB · Views: 3
Last edited:
Update3:
After change the option form bl_mynn_edmf = 2 into bl_mynn_edmf = 1,
The problem solved and the program can runs normally.
The option 2 is a new feature of WRF4.6.0. So the main cause to the crash is due to the Total Energy Mass-Flux (TEMF) scheme.
 
Last edited:
Update3:
After change the option form bl_mynn_edmf = 2 into bl_mynn_edmf = 1,
The problem solved and the program can runs normally.
The option 2 is a new feature of WRF4.6.0. So the main cause to the crash is due tue the Total Energy Mass-Flux (TEMF) scheme.
Thanks for this testing and information. Are you okay with using bl_mynn_edmf=1 for now, or do you need to use option 2?
 
Top