Index out of bounds in WRF 3.9.1.1 (and 4.0.1)

gerardo · Sep 18, 2019

While chasing down a different out-of-bound problem, I built WRF 3.9.1.1 with bounds checking using Intel compilers and MPI. I got an error message such as the following:

forrtl: severe (408): fort: (3): Subscript #2 of the array H0ML has value 1 which is less than the lower bound of 169

in every single rsl.error.* except rsl.error.0000. I tracked this down to line 230 of phys/module_sf_oml.F:

WRITE(message,*)'Initializing OML with real HML0, h(1,1) = ', h0ml(1,1)

It is the same line in both WRF versions, and it needs fixing because "h0ml(1,1)" is only valid for MPI task 0 of any dmpar or dm+sm build of WRF that is run with two or more MPI tasks.

For completeness, the original out-of-bounds problem I ran across occurred when running WRF with the 4km Sandy hurricane dataset; the error happens when running with 320 or fewer MPI tasks (using 1 or 2 threads, for instance, on 8 or 16 nodes --respectively-- of a cluster of Intel 6138 processors -- 20 cores/socket, 2 sockets/node). When built with Open MPI 4.0, the traceback was:

Code:

[helios006:279983:0:280004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe05cc3980)
==== backtrace (tid: 280004) ====
 0 0x0000000002457e38 module_ra_rrtm_mp_taugb3_()  ???:0
 1 0x00000000024542bd module_ra_rrtm_mp_gasabs_()  ???:0
 2 0x000000000244d410 module_ra_rrtm_mp_rrtm_()  ???:0
 3 0x000000000244bcf3 module_ra_rrtm_mp_rrtmlwrad_()  ???:0
 4 0x0000000001d698a2 module_radiation_driver_mp_radiation_driver_()  ???:0
 5 0x00000000000ed103 __kmp_invoke_microtask()  ???:0
 6 0x00000000000ae94b __kmp_invoke_task_func()  /nfs/site/proj/openmp/promo/20180913/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7412
 7 0x00000000000ade74 __kmp_launch_thread()  /nfs/site/proj/openmp/promo/20180913/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:5993
 8 0x00000000000ed571 _INTERNAL_26_______src_z_Linux_util_cpp_51eec780::__kmp_launch_worker()  /nfs/site/proj/openmp/promo/20180913/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:585
 9 0x0000000000007dd5 start_thread()  pthread_create.c:0
10 0x00000000000fe02d __clone()  ???:0
=================================

When built with Intel MPI, the traceback was:

Code:

forrtl: severe (154): array index out of bounds
Image              PC                Routine            Line        Source             
wrf.exe_i19impi18  00000000034FB304  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00007F9E156BA5D0  Unknown               Unknown  Unknown
wrf.exe_i19impi18  0000000002458018  Unknown               Unknown  Unknown
wrf.exe_i19impi18  000000000245449D  Unknown               Unknown  Unknown
wrf.exe_i19impi18  000000000244D5F0  Unknown               Unknown  Unknown
wrf.exe_i19impi18  000000000244BED3  Unknown               Unknown  Unknown
wrf.exe_i19impi18  0000000001D69A82  Unknown               Unknown  Unknown
libiomp5.so        00007F9E150AE103  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00007F9E1506F94B  Unknown               Unknown  Unknown
libiomp5.so        00007F9E1506EE74  Unknown               Unknown  Unknown
libiomp5.so        00007F9E150AE571  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F9E156B2DD5  Unknown               Unknown  Unknown
libc-2.17.so       00007F9E14CF202D  clone                 Unknown  Unknown

In addition, the error occurs in only 6 or 7 of the MPI tasks; for 320 tasks, those are tasks 114 through 119 or 120.

I'm attaching the namelist.input file for my job in case there is something blatantly wrong with it.

Saludos,

Gerardo
--
Gerardo Cisneros-Stoianowski
Senior Engineer, HPC Applications Performance
Mellanox Technologies

kwerner · Sep 20, 2019

Hi,
Thank you for sharing the information you found to fix the problem with the phys/module_sf_oml.F file. I have tested this with a basic WRF namelist that uses the OML model and I see the same problem. Were you able to correct this yourself, or do you need a solution from us for this?

As for the original out of bounds problem - are you still having that problem, as well, or were you able to get past that?

gerardo · Sep 25, 2019

kwerner said:
Hi,
Thank you for sharing the information you found to fix the problem with the phys/module_sf_oml.F file. I have tested this with a basic WRF namelist that uses the OML model and I see the same problem. Were you able to correct this yourself, or do you need a solution from us for this?

You're welcome. I corrected line 230 of the phys/module_sf_oml.F file replacing (1,1) with the obvious indices to use for any arbitrary MPI rank, and I posted an issue in the WRF GitHub site (https://github.com/wrf-model/WRF/issues/986) but it hasn't been addressed yet.

As for the original out of bounds problem - are you still having that problem, as well, or were you able to get past that?

Using the bounds-checking enabled binary (with the OML fix), I chased down the original out of bounds problem to indices into lookup tables being computed from floating-point values, which weren't checked before use. Unfortunately, each time I inserted a bounds check for those computed indices, the out-of-bounds problem just moved elsewhere. I would like to give more details, but I'm attending the MultiCore 9 Workshop at NCAR and I can't seem to be able to log in to the system where I have my sources at this time. In any event, the problem was localized to the same few MPI tasks at each new attempt to contain the out-of-bounds problem, so I suspect there may be a problem the with the appropriateness of the selected physics options in the namelist.input in relation to the 4km Sandy wrfinput and wrfbdy I'm working with.

Saludos,

Gerardo

kwerner · Sep 26, 2019

Hi,
I just responded to your issue on GitHub. We haven't committed it yet because the corrections need to be tweaked a bit, but we are in the process. Thank you for that.

As for the other problem, I have a few questions:
1) When you say you're using the Sandy 4km input - I assume this originated from the testing input that we provided a couple of years ago. Is that correct?
2) I notice that your namelist is set up for 2 domains. Do you have this problem when only running 1 domain?
3) I also notice that you have interval_seconds set to 86400, which does not match ours (set to 10800). Was that intentional? I would assume if you are only wanting to run for 12 hours, that you would want this to be set for something more frequent than every 24 hours.
4) Do you have the problem when you turn off the ocean model?

To make your debugging process a bit easier, I would like you to try the coarse resolution Sandy input (40 km) so that the runs are a lot smaller and faster. I'm attaching a tar file with the necessary input and namelist. I would try to run this first, as is, to establish a baseline. Then run with your namelist settings, but with the domain/date/time settings for the coarse 40 km case to see if you still get the out-of-bounds problem. If so, I would start back with the baseline run and perhaps add in your options one at a time to see if you're able to narrow-down the cause.

gerardo · Oct 18, 2019

Sorry for the delay in my reply. My work pulled me towards other applications.

1) When you say you're using the Sandy 4km input - I assume this originated from the testing input that we provided a couple of years ago. Is that correct?

I believe that is correct. A friend at another company pointed me to the input.

2) I notice that your namelist is set up for 2 domains. Do you have this problem when only running 1 domain?

I have not tried running it with a single domain. I'm trying to see whether MPI collective offloads help vortex-following WRF at 4km the way it did at 2km.

3) I also notice that you have interval_seconds set to 86400, which does not match ours (set to 10800). Was that intentional? I would assume if you are only wanting to run for 12 hours, that you would want this to be set for something more frequent than every 24 hours.

I now realize that changing the interval_seconds setting was a mistake. I will retry my job with the correct setting.

4) Do you have the problem when you turn off the ocean model?

I have yet to try that.

Thanks for the 40km data.

Saludos,

Gerardo

gerardo · Oct 18, 2019

Hi again.

a) Running with a single domain works.

b) Setting interval_seconds to 10800 didn't help the two-domain job.

c) How do I turn off the ocean model? (Although I assume the ocean model would be important when tracking a hurricane.)

Thanks again for your help.

Saludos,

Gerardo

kwerner · Oct 18, 2019

Gerardo,
It's possible that you'll want to use the ocean model, but the questions I'm asking are simply to try to troubleshoot, so I was curious if when you turned off the ocean model, it would get past the problem - therefore letting me know that the problem was with having that turned on. The following is a setting In your namelist:

Code:

sf_ocean_physics                    = 1,

Just set that to 0 to do a test with it turned off.

If that does work, it's probably a good idea to know whether the ocean model is something you need before blindly turning it on. It's not something that everyone uses, even for hurricane tracking. You can read a bit more about the options here:
http://www2.mmm.ucar.edu/wrf/users/phys_references.html#OCEAN

Were you able to run the 40km default case with the default namelist provided? The namelist is set for only 1 domain, but if you simply change the following lines to:

Code:

input_from_file = .true., .false.
max_dom = 2
e_vert = 60, 60

then you should be able to run the test for vortex-following, with 2 domains.

gerardo · Oct 18, 2019

Thanks for the link.

I tried with the ocean model turned off. On a 320-MPI-task 2-domain job, I still got a crash in MPI ranks 114 through 120; the top four lines of the traceback read as follows:

Code:

 0 0x0000000002457e38 module_ra_rrtm_mp_taugb3_()  ???:0
 1 0x00000000024542bd module_ra_rrtm_mp_gasabs_()  ???:0
 2 0x000000000244d410 module_ra_rrtm_mp_rrtm_()  ???:0
 3 0x000000000244bcf3 module_ra_rrtm_mp_rrtmlwrad_()  ???:0
 4 0x0000000001d698a2 module_radiation_driver_mp_radiation_driver_()  ???:0

(I had earlier tracked this down to an out-of-bounds index into a table, which is computed from a floating-point value. Inserting a bounds check and choosing the first or last valid index only shifted the problem to another, similarly computed index; repeating the process eventually produced a crash due to a NaN in DPTHMX.)

Saludos,

Gerardo

gerardo · Oct 18, 2019

Were you able to run the 40km default case with the default namelist provided? The namelist is set for only 1 domain, but if you simply change the following lines ...

No, I didn't test the 40km case yet, but I should mention that I have run Sandy at 2km with a vortex-following nested domain from 1024 to 4096 MPI tasks without any trouble, so I'm not sure going to lower resolution will help. In any case, I will give it a try later. (Also, the crashes I'm getting with the 4km input don't occur if I run with 640 or 1280 MPI tasks -- only 320 or fewer.)

kwerner · Nov 1, 2019

Hi,
I just wanted to let you know I haven't forgotten about you. I've been conducting several different tests. I am able to repeat this even with several different namelist settings than yours. I have passed this along to our software engineer who is trying to figure out the problem. I'll keep you posted.

Index out of bounds in WRF 3.9.1.1 (and 4.0.1)

gerardo

New member

Attachments

kwerner

Administrator

gerardo

New member

kwerner

Administrator

Attachments

gerardo

New member

gerardo

New member

kwerner

Administrator

gerardo

New member

gerardo

New member

kwerner

Administrator