While chasing down a different out-of-bound problem, I built WRF 3.9.1.1 with bounds checking using Intel compilers and MPI. I got an error message such as the following:
forrtl: severe (408): fort: (3): Subscript #2 of the array H0ML has value 1 which is less than the lower bound of 169
in every single rsl.error.* except rsl.error.0000. I tracked this down to line 230 of phys/module_sf_oml.F:
WRITE(message,*)'Initializing OML with real HML0, h(1,1) = ', h0ml(1,1)
It is the same line in both WRF versions, and it needs fixing because "h0ml(1,1)" is only valid for MPI task 0 of any dmpar or dm+sm build of WRF that is run with two or more MPI tasks.
For completeness, the original out-of-bounds problem I ran across occurred when running WRF with the 4km Sandy hurricane dataset; the error happens when running with 320 or fewer MPI tasks (using 1 or 2 threads, for instance, on 8 or 16 nodes --respectively-- of a cluster of Intel 6138 processors -- 20 cores/socket, 2 sockets/node). When built with Open MPI 4.0, the traceback was:
When built with Intel MPI, the traceback was:
In addition, the error occurs in only 6 or 7 of the MPI tasks; for 320 tasks, those are tasks 114 through 119 or 120.
I'm attaching the namelist.input file for my job in case there is something blatantly wrong with it.
Saludos,
Gerardo
--
Gerardo Cisneros-Stoianowski
Senior Engineer, HPC Applications Performance
Mellanox Technologies
forrtl: severe (408): fort: (3): Subscript #2 of the array H0ML has value 1 which is less than the lower bound of 169
in every single rsl.error.* except rsl.error.0000. I tracked this down to line 230 of phys/module_sf_oml.F:
WRITE(message,*)'Initializing OML with real HML0, h(1,1) = ', h0ml(1,1)
It is the same line in both WRF versions, and it needs fixing because "h0ml(1,1)" is only valid for MPI task 0 of any dmpar or dm+sm build of WRF that is run with two or more MPI tasks.
For completeness, the original out-of-bounds problem I ran across occurred when running WRF with the 4km Sandy hurricane dataset; the error happens when running with 320 or fewer MPI tasks (using 1 or 2 threads, for instance, on 8 or 16 nodes --respectively-- of a cluster of Intel 6138 processors -- 20 cores/socket, 2 sockets/node). When built with Open MPI 4.0, the traceback was:
Code:
[helios006:279983:0:280004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe05cc3980)
==== backtrace (tid: 280004) ====
0 0x0000000002457e38 module_ra_rrtm_mp_taugb3_() ???:0
1 0x00000000024542bd module_ra_rrtm_mp_gasabs_() ???:0
2 0x000000000244d410 module_ra_rrtm_mp_rrtm_() ???:0
3 0x000000000244bcf3 module_ra_rrtm_mp_rrtmlwrad_() ???:0
4 0x0000000001d698a2 module_radiation_driver_mp_radiation_driver_() ???:0
5 0x00000000000ed103 __kmp_invoke_microtask() ???:0
6 0x00000000000ae94b __kmp_invoke_task_func() /nfs/site/proj/openmp/promo/20180913/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7412
7 0x00000000000ade74 __kmp_launch_thread() /nfs/site/proj/openmp/promo/20180913/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:5993
8 0x00000000000ed571 _INTERNAL_26_______src_z_Linux_util_cpp_51eec780::__kmp_launch_worker() /nfs/site/proj/openmp/promo/20180913/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:585
9 0x0000000000007dd5 start_thread() pthread_create.c:0
10 0x00000000000fe02d __clone() ???:0
=================================
When built with Intel MPI, the traceback was:
Code:
forrtl: severe (154): array index out of bounds
Image PC Routine Line Source
wrf.exe_i19impi18 00000000034FB304 for__signal_handl Unknown Unknown
libpthread-2.17.s 00007F9E156BA5D0 Unknown Unknown Unknown
wrf.exe_i19impi18 0000000002458018 Unknown Unknown Unknown
wrf.exe_i19impi18 000000000245449D Unknown Unknown Unknown
wrf.exe_i19impi18 000000000244D5F0 Unknown Unknown Unknown
wrf.exe_i19impi18 000000000244BED3 Unknown Unknown Unknown
wrf.exe_i19impi18 0000000001D69A82 Unknown Unknown Unknown
libiomp5.so 00007F9E150AE103 __kmp_invoke_micr Unknown Unknown
libiomp5.so 00007F9E1506F94B Unknown Unknown Unknown
libiomp5.so 00007F9E1506EE74 Unknown Unknown Unknown
libiomp5.so 00007F9E150AE571 Unknown Unknown Unknown
libpthread-2.17.s 00007F9E156B2DD5 Unknown Unknown Unknown
libc-2.17.so 00007F9E14CF202D clone Unknown Unknown
In addition, the error occurs in only 6 or 7 of the MPI tasks; for 320 tasks, those are tasks 114 through 119 or 120.
I'm attaching the namelist.input file for my job in case there is something blatantly wrong with it.
Saludos,
Gerardo
--
Gerardo Cisneros-Stoianowski
Senior Engineer, HPC Applications Performance
Mellanox Technologies