Floating point exception in interpolation routine

lnpilz · Oct 12, 2020

Hi there,

I am using WRF v4.2.2 compiled with intel 18.0.1. On the second part of an ndown run, I get a floating point overflow in sint.F L:301-304 (cf. https://github.com/wrf-model/WRF/blob/f311cd5e136631ebf3ebaa02b4b7be3816ed171f/share/sint.F#L301-L304).

Code:

d02 2019-07-21_06:00:00 calling inc/HALO_FORCE_DOWN_inline.inc
forrtl: error (72): floating overflow
Image              PC                Routine            Line        Source     
wrf.exe            0000000013DBA9AE  Unknown               Unknown  Unknown
libpthread-2.12.s  00002B58535EE7E0  Unknown               Unknown  Unknown
wrf.exe            0000000007142B4F  sintb_                    302  sint.f90
wrf.exe            00000000070B4580  bdy_interp1_             2557  interp_fcn.f90
wrf.exe            00000000070B073C  bdy_interp_              2365  interp_fcn.f90
wrf.exe            0000000003F99B1F  force_domain_em_p       11330  module_dm.f90
wrf.exe            0000000008317DB7  med_force_domain_         543  mediation_force_domain.f90
wrf.exe            00000000074F8A72  med_nest_force_           660  mediation_integrate.f90
wrf.exe            00000000005F6923  module_integrate_         361  module_integrate.f90
wrf.exe            0000000000414286  module_wrf_top_mp         324  module_wrf_top.f90
wrf.exe            00000000004136E5  MAIN__                     44  wrf.f90
wrf.exe            000000000041369E  Unknown               Unknown  Unknown
libc-2.12.so       00002B585381AD20  __libc_start_main     Unknown  Unknown
wrf.exe            00000000004135A9  Unknown               Unknown  Unknown

While trying to figure out what went wrong, I saw some strange behaviour in the vertical wind component (cf. attachment). Also all fields associated with W are 0 in the wrfbdy_d01, which was generated by ndown.exe, however I didn't see any errors in the rsl.error* logs of ndown.

Is there a possibility that the W fields being zero in the bdy file causes trouble in the interpolation routine?

Thanks in advance,

Lukas

PS: please find namelist.input file attached

kwerner · Oct 12, 2020

Hi,
I just checked a wrfbdy_d01 file I have lying around (to use for basic testing) and all "W" fields in there are also zero - probably because they are tendencies, so I don't think that's the problem. It looks like you're running WRF-Chem. If so, can you compile basic WRF and run this same test without the chemistry options and see if you still have the problem? You'll probably need to go through the steps again (staring from real.exe). Thanks!

lnpilz · Oct 15, 2020

Hi,
thanks for the quick reply.

With WRF-Chem not compiled in, it doesn't break at the interpolation anymore. However, it breaks 17 Minutes into the simulation with a SEGFAULT:

Code:

Rank 0:
Timing for main: time 2019-07-21_06:17:29 on domain   2:    0.10374 elapsed seconds

Rank 23:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
wrf.exe            000000000310FCED  for__signal_handl     Unknown  Unknown
libpthread-2.12.s  00002B442D7577E0  Unknown               Unknown  Unknown
wrf.exe            0000000001FCF4CC  Unknown               Unknown  Unknown
wrf.exe            0000000001FBCBB5  Unknown               Unknown  Unknown
wrf.exe            0000000001AF6239  Unknown               Unknown  Unknown
wrf.exe            000000000154C556  Unknown               Unknown  Unknown
wrf.exe            000000000138610C  Unknown               Unknown  Unknown
wrf.exe            000000000056F8C7  Unknown               Unknown  Unknown
wrf.exe            000000000056FEDE  Unknown               Unknown  Unknown
wrf.exe            0000000000412F81  Unknown               Unknown  Unknown
wrf.exe            0000000000412F3F  Unknown               Unknown  Unknown
wrf.exe            0000000000412EDE  Unknown               Unknown  Unknown
libc-2.12.so       00002B442D983D20  __libc_start_main     Unknown  Unknown
wrf.exe            0000000000412DE9  Unknown               Unknown  Unknown[\code]

I will try recompiling the code with debug flags to get a bit more information on where exactly it breaks.

lnpilz · Oct 15, 2020

Hi,

with ./configure -D, the run breaks much earlier (after 35 seconds) with a divide by zero error in phys/module_bl_shinhong.F:975

cf:

Code:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source             
wrf.exe            000000000CA7DA1E  Unknown               Unknown  Unknown
libpthread-2.12.s  00002B1FD96517E0  Unknown               Unknown  Unknown
wrf.exe            00000000094F1794  module_bl_shinhon         975  module_bl_shinhong.f90
wrf.exe            00000000094B1601  module_bl_shinhon         219  module_bl_shinhong.f90
wrf.exe            0000000007FF2B02  module_pbl_driver        1186  module_pbl_driver.f90
wrf.exe            0000000004F4A0FD  module_first_rk_s         542  module_first_rk_step_part1.f90
wrf.exe            0000000003DB28A0  solve_em_                 897  solve_em.f90
wrf.exe            00000000036C8D13  solve_interface_          124  solve_interface.f90
wrf.exe            000000000058E743  module_integrate_         338  module_integrate.f90
wrf.exe            000000000059031E  module_integrate_         375  module_integrate.f90
wrf.exe            0000000000413AC6  module_wrf_top_mp         324  module_wrf_top.f90
wrf.exe            0000000000412F25  MAIN__                     44  wrf.f90
wrf.exe            0000000000412EDE  Unknown               Unknown  Unknown
libc-2.12.so       00002B1FD987DD20  __libc_start_main     Unknown  Unknown
wrf.exe            0000000000412DE9  Unknown               Unknown  Unknown

I assume (and I haven't checked this with a debugger yet) that wstar3(i) is 0 because it is set to 0 in line 773 if sfcflg(i) is False and then used in a division in the offending line 975.

lnpilz · Oct 16, 2020

Hey,
so I finally hooked it up to a debugger and it is a very weird bug indeed.

As it turns out, in line 975 rigs(i) is exactly the decimal value of 0.4 cast to float32 (0.400000006). This causes (1.+cpent/rigs(i)) (with cpent == -0.4) to become zero, thus causing the divide by zero exception.

As far as I can see, this is not even because of two parameters in the rigs computation being set to the same value and cancelling each other out thus leaving a third parameter which is exactly 0.4. It rather seems, that this is a genuine result of the performed calculations.

I have attached the parameter values as shown by the DDT debugger. However, these are probably float64, so they don't quite accurately represent the internal values.

Unfortunately I don't quite know how to proceed, as I couldn't find any documentation on the variables rigs and cpent. I'd appreciate any suggestions.

Thanks in advance,

Lukas

kwerner · Oct 20, 2020

Lukas,
I just want to clarify that you are saying, from your understanding, that rigs(i) = 0.4 is specific to this particular simulation, at this particular time - and not that this would always be the case? Is that correct?
I'm not sure what exactly rigs(i) is, and how much variance you can see in the value, but would it be possible to add some sort of if statement to the code, essentially saying that if the value of rigs(i) equals something that will cause the "divide by zero" problem, then rigs(i) = rigs(i) + (some value that makes sense and won't noticeably modify the results) above this line? If you do modify this code, you will need to recompile. You will NOT need to issue a 'clean -a' or reconfigure. Since it's just a small change in a physics routine, you can simply recompile and it should be pretty quick.

lnpilz · Oct 30, 2020

Hi Kelly,
sorry for the delay. Yes, from my understanding, this might just be a freak case. Nudging it will hotfix it for now, but deciding on the direction might influence physics a tad, as this is a pole.

Nevertheless, this is of course a bug which has to be fixed. I'll file a Github issue when I can find some time.

Also, unfortunately I couldn't find any time to continue debugging in the last week, but I will update you when I eventually get around to it.

Cheers, Lukas

Floating point exception in interpolation routine

lnpilz

New member

Attachments

kwerner

Administrator

lnpilz

New member

lnpilz

New member

lnpilz

New member

Attachments

kwerner

Administrator

lnpilz

New member