I have encountered a curious problem. Here is the setup:
Running core em_quarter_ss from version 4.7.1 - I did make some code changes to add a couple of diagnostic arrays to the microphysics driver, but that does not seem to be related to the problem.
I have run hundreds, if not 1000, simulations using the canonical input sounding for the supercell simulation on a 200 km x 200 km grid with either 60 or 80 (unstaggered) vertical levels on a range of horizontal grid spacings from 100 m to 4.0 km successfully. I used a model time step of 1.0 s for every one of these simulations. I also ran a 50-member ensemble by leveraging the SKEB scheme for ensemble diversity.
Never once had a crash.
Then I changed experiment dimensions. Instead of varying the horizontal grid spacing, I fixed it at 3.0 km and began varying the model time step. I ran 50-member ensembles using time steps of 24, 20, 16, 12, 10, 8, 6, 4, 3, 2, and 1 seconds, again with zero crashes.
But then I decreased the time step further to 0.5 s. That's when the crash began.
The crash is repeatable. It happens in every one of the 50 members at the exact same time step (10 minutes, 35.0 seconds into the simulation).
I recompiled WRF in debug mode to get a full traceback and it led me to the module_sf_mynn.f90 source code file, this line:
I traced this back to the top of the code in that file to learn that the problem was that dz8w had become NaN at at least one grid point.
I then traced the computation of dz8w back to its calculation in the dyn/module_big_step_utilities_em.F source code file. It is in two subroutines, phy_prep and moist_physics_prep_em, but the code looks to be identical in both routines:
I stopped tracing at this point because I'm way deep into the model dynamics at this point, which I am less familiar with dealing with. I wish I could use this as an excuse to declare that the problem can't possibly be physical, but there are peculiar factors:
-The problem goes away if I turn off the MYNN surface-layer (and PBL) scheme.
-The particular time in the simulation of 10m35.0s is a time when the initial warm bubble is probably forming an intense updraft, which implies that vertical accelerations/motions could be playing a role somehow.
I almost wonder if the model levels are somehow being compressed so that they're touching. However, the grid point at which I saw dz8w become NaN was not directly under the center of the perturbation bubble, and in fact is probably outside the warm bubble altogether. So maybe that means it's a computational wave of some kind? But the more curious issue is why a really short time step would be the cause of this at all. It doesn't make sense that a model time step could be too small...smaller should always be better with NWP resolution parameters, right?
...right?
Anyway, I have attached some files to help if anyone wants to investigate. I'll try changing some other namelist options to see if anything else makes a difference.
I am also including the bottom of the rsl.error.0000 file with a debug_level of 1000 (the whole thing is too large to attach here):
Running core em_quarter_ss from version 4.7.1 - I did make some code changes to add a couple of diagnostic arrays to the microphysics driver, but that does not seem to be related to the problem.
I have run hundreds, if not 1000, simulations using the canonical input sounding for the supercell simulation on a 200 km x 200 km grid with either 60 or 80 (unstaggered) vertical levels on a range of horizontal grid spacings from 100 m to 4.0 km successfully. I used a model time step of 1.0 s for every one of these simulations. I also ran a 50-member ensemble by leveraging the SKEB scheme for ensemble diversity.
Never once had a crash.
Then I changed experiment dimensions. Instead of varying the horizontal grid spacing, I fixed it at 3.0 km and began varying the model time step. I ran 50-member ensembles using time steps of 24, 20, 16, 12, 10, 8, 6, 4, 3, 2, and 1 seconds, again with zero crashes.
But then I decreased the time step further to 0.5 s. That's when the crash began.
The crash is repeatable. It happens in every one of the 50 members at the exact same time step (10 minutes, 35.0 seconds into the simulation).
I recompiled WRF in debug mode to get a full traceback and it led me to the module_sf_mynn.f90 source code file, this line:
I began inserting PRINT statements to trace the error. It turns out nzol had become a NaN.psih_unstable = psih_unstab(nzol) + rzol*(psih_unstab(nzol+1)-psih_unstab(nzol))
I traced this back to the top of the code in that file to learn that the problem was that dz8w had become NaN at at least one grid point.
I then traced the computation of dz8w back to its calculation in the dyn/module_big_step_utilities_em.F source code file. It is in two subroutines, phy_prep and moist_physics_prep_em, but the code looks to be identical in both routines:
One can see that dz8w depends on the variable z_at_w, which itself depends on the geopotential arrays ph and phb.DO j = j_start, j_end
DO k = k_start, kte
DO i = i_start, i_end
z_at_w(i,k,j) = (ph(i,k,j)+phb(i,k,j))/g
ENDDO
ENDDO
ENDDO
do j = j_start,j_end
do k = k_start, kte-1
do i = i_start, i_end
dz8w(i,k,j) = z_at_w(i,k+1,j)-z_at_w(i,k,j)
enddo
enddo
enddo
I stopped tracing at this point because I'm way deep into the model dynamics at this point, which I am less familiar with dealing with. I wish I could use this as an excuse to declare that the problem can't possibly be physical, but there are peculiar factors:
-The problem goes away if I turn off the MYNN surface-layer (and PBL) scheme.
-The particular time in the simulation of 10m35.0s is a time when the initial warm bubble is probably forming an intense updraft, which implies that vertical accelerations/motions could be playing a role somehow.
I almost wonder if the model levels are somehow being compressed so that they're touching. However, the grid point at which I saw dz8w become NaN was not directly under the center of the perturbation bubble, and in fact is probably outside the warm bubble altogether. So maybe that means it's a computational wave of some kind? But the more curious issue is why a really short time step would be the cause of this at all. It doesn't make sense that a model time step could be too small...smaller should always be better with NWP resolution parameters, right?
...right?
Anyway, I have attached some files to help if anyone wants to investigate. I'll try changing some other namelist options to see if anything else makes a difference.
I am also including the bottom of the rsl.error.0000 file with a debug_level of 1000 (the whole thing is too large to attach here):
d01 0001-01-01_00:10:34+01/02 DEBUG wrf_timeinttoa(): returning with str = [0000000000_000:000:000]
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock start time = 0001-01-01_00:00:00
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock current time = 0001-01-01_00:10:34
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock stop time = 0001-01-01_04:00:00
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock time step = 0000000000_000:000:000
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:10:35]
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:00:00]
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_04:00:00]
d01 0001-01-01_00:10:35 DEBUG wrf_timeinttoa(): returning with str = [0000000000_000:000:000]
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock start time = 0001-01-01_00:00:00
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock current time = 0001-01-01_00:10:35
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock stop time = 0001-01-01_04:00:00
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock time step = 0000000000_000:000:000
d01 0001-01-01_00:10:35 module_integrate: back from solve interface
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:10:35]
Timing for main: time 0001-01-01_00:10:35 on domain 1: 0.28754 elapsed seconds
d01 0001-01-01_00:10:35 in med_latbound_in
d01 0001-01-01_00:10:35 module_integrate: calling solve interface
d01 0001-01-01_00:10:35 calling inc/HALO_EM_MOIST_OLD_E_7_inline.inc
d01 0001-01-01_00:10:35 calling inc/PERIOD_BDY_EM_MOIST_OLD_inline.inc
d01 0001-01-01_00:10:35 call rk_step_prep
d01 0001-01-01_00:10:35 calling inc/HALO_EM_A_inline.inc
d01 0001-01-01_00:10:35 calling inc/PERIOD_BDY_EM_A_inline.inc
d01 0001-01-01_00:10:35 call rk_phys_bc_dry_1
d01 0001-01-01_00:10:35 call init_zero_tendency
d01 0001-01-01_00:10:35 calling inc/HALO_EM_PHYS_A_inline.inc
d01 0001-01-01_00:10:35 call phy_prep
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:10:35]
d01 0001-01-01_00:10:35 call radiation_driver
d01 0001-01-01_00:10:35 Top of Radiation Driver
d01 0001-01-01_00:10:35 calling inc/HALO_PWP_inline.inc
d01 0001-01-01_00:10:35 call surface_driver
d01 0001-01-01_00:10:35 in MYNNSFC
[v101:2501166:0:2501166] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe2fa8c0e4)
Attachments
Last edited: