Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Model time step too small for MYNN (sfc layer)?

Jeff_Duda

New member
I have encountered a curious problem. Here is the setup:

Running core em_quarter_ss from version 4.7.1 - I did make some code changes to add a couple of diagnostic arrays to the microphysics driver, but that does not seem to be related to the problem.

I have run hundreds, if not 1000, simulations using the canonical input sounding for the supercell simulation on a 200 km x 200 km grid with either 60 or 80 (unstaggered) vertical levels on a range of horizontal grid spacings from 100 m to 4.0 km successfully. I used a model time step of 1.0 s for every one of these simulations. I also ran a 50-member ensemble by leveraging the SKEB scheme for ensemble diversity.

Never once had a crash.

Then I changed experiment dimensions. Instead of varying the horizontal grid spacing, I fixed it at 3.0 km and began varying the model time step. I ran 50-member ensembles using time steps of 24, 20, 16, 12, 10, 8, 6, 4, 3, 2, and 1 seconds, again with zero crashes.

But then I decreased the time step further to 0.5 s. That's when the crash began.

The crash is repeatable. It happens in every one of the 50 members at the exact same time step (10 minutes, 35.0 seconds into the simulation).

I recompiled WRF in debug mode to get a full traceback and it led me to the module_sf_mynn.f90 source code file, this line:

psih_unstable = psih_unstab(nzol) + rzol*(psih_unstab(nzol+1)-psih_unstab(nzol))
I began inserting PRINT statements to trace the error. It turns out nzol had become a NaN.

I traced this back to the top of the code in that file to learn that the problem was that dz8w had become NaN at at least one grid point.

I then traced the computation of dz8w back to its calculation in the dyn/module_big_step_utilities_em.F source code file. It is in two subroutines, phy_prep and moist_physics_prep_em, but the code looks to be identical in both routines:

DO j = j_start, j_end
DO k = k_start, kte
DO i = i_start, i_end
z_at_w(i,k,j) = (ph(i,k,j)+phb(i,k,j))/g
ENDDO
ENDDO
ENDDO

do j = j_start,j_end
do k = k_start, kte-1
do i = i_start, i_end
dz8w(i,k,j) = z_at_w(i,k+1,j)-z_at_w(i,k,j)
enddo
enddo
enddo
One can see that dz8w depends on the variable z_at_w, which itself depends on the geopotential arrays ph and phb.

I stopped tracing at this point because I'm way deep into the model dynamics at this point, which I am less familiar with dealing with. I wish I could use this as an excuse to declare that the problem can't possibly be physical, but there are peculiar factors:

-The problem goes away if I turn off the MYNN surface-layer (and PBL) scheme.
-The particular time in the simulation of 10m35.0s is a time when the initial warm bubble is probably forming an intense updraft, which implies that vertical accelerations/motions could be playing a role somehow.

I almost wonder if the model levels are somehow being compressed so that they're touching. However, the grid point at which I saw dz8w become NaN was not directly under the center of the perturbation bubble, and in fact is probably outside the warm bubble altogether. So maybe that means it's a computational wave of some kind? But the more curious issue is why a really short time step would be the cause of this at all. It doesn't make sense that a model time step could be too small...smaller should always be better with NWP resolution parameters, right?

...right?

Anyway, I have attached some files to help if anyone wants to investigate. I'll try changing some other namelist options to see if anything else makes a difference.
I am also including the bottom of the rsl.error.0000 file with a debug_level of 1000 (the whole thing is too large to attach here):

d01 0001-01-01_00:10:34+01/02 DEBUG wrf_timeinttoa(): returning with str = [0000000000_000:000:000]
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock start time = 0001-01-01_00:00:00
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock current time = 0001-01-01_00:10:34
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock stop time = 0001-01-01_04:00:00
DEBUG domain_clockadvance(): before WRFU_ClockAdvance, clock time step = 0000000000_000:000:000
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:10:35]
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:00:00]
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_04:00:00]
d01 0001-01-01_00:10:35 DEBUG wrf_timeinttoa(): returning with str = [0000000000_000:000:000]
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock start time = 0001-01-01_00:00:00
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock current time = 0001-01-01_00:10:35
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock stop time = 0001-01-01_04:00:00
DEBUG domain_clockadvance(): after WRFU_ClockAdvance, clock time step = 0000000000_000:000:000
d01 0001-01-01_00:10:35 module_integrate: back from solve interface
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:10:35]
Timing for main: time 0001-01-01_00:10:35 on domain 1: 0.28754 elapsed seconds
d01 0001-01-01_00:10:35 in med_latbound_in
d01 0001-01-01_00:10:35 module_integrate: calling solve interface
d01 0001-01-01_00:10:35 calling inc/HALO_EM_MOIST_OLD_E_7_inline.inc
d01 0001-01-01_00:10:35 calling inc/PERIOD_BDY_EM_MOIST_OLD_inline.inc
d01 0001-01-01_00:10:35 call rk_step_prep
d01 0001-01-01_00:10:35 calling inc/HALO_EM_A_inline.inc
d01 0001-01-01_00:10:35 calling inc/PERIOD_BDY_EM_A_inline.inc
d01 0001-01-01_00:10:35 call rk_phys_bc_dry_1
d01 0001-01-01_00:10:35 call init_zero_tendency
d01 0001-01-01_00:10:35 calling inc/HALO_EM_PHYS_A_inline.inc
d01 0001-01-01_00:10:35 call phy_prep
d01 0001-01-01_00:10:35 DEBUG wrf_timetoa(): returning with str = [0001-01-01_00:10:35]
d01 0001-01-01_00:10:35 call radiation_driver
d01 0001-01-01_00:10:35 Top of Radiation Driver
d01 0001-01-01_00:10:35 calling inc/HALO_PWP_inline.inc
d01 0001-01-01_00:10:35 call surface_driver
d01 0001-01-01_00:10:35 in MYNNSFC
[v101:2501166:0:2501166] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe2fa8c0e4)
 

Attachments

  • namelist.output.txt
    85 KB · Views: 0
  • rsl.error.0000.txt
    109.5 KB · Views: 0
Last edited:
UPDATE: FIXED (at least temporarily)

I was able to trace the issues to the time step. When using a model time step of less than 1.0 s, one sets the namelist variables time_step = 0 and then uses time_step_fract_num and time_step_fract_den to control the fractional time step.

It turns out the subroutine SETUP_RAND_PERTURB in dyn_em/module_stoch.F used grid%time_step for calculations. grid%time_step is an integer variable that is set based on the same namelist variable. grid%dt, on the other hand, is a floating point representation of the model time step. So I replaced grid%time_step in the call to SETUP_RAND_PERTURB with grid%dt and then changed the variable type of the corresponding subroutine variable itime_step to a REAL as well.

Strangely, the compilation failed initially because itime_step is enclosed within a FLOAT() command, and my compiler thought FLOAT was a custom variable. So I had to remove those functions from itime_step in the subroutine.

But, with all those changes made, the simulation has now run past the crashing point from before and appears to be sailing smoothly.

I suppose one clue I could have seen before (although why would I have known to check this?) was the status messages in rsl.out.0000 for SKEBS:
==============================================
>> Initializing STREAMFUNCTION forcing pattern of <<
>> stochastic kinetic-energy backscatter scheme <<
Total backscattered energy, TOT_BACKSCAT_PSI 0.10000E-04
Exponent for energy spectra, REXPONENT_PSI =-0.22500E+01
Minimal wavenumber of streamfunction forcing, LMINFORC = 1
Maximal wavenumber of streamfunction forcing, LMAXFORC = 29
Minimal wavenumber of streamfunction forcing, KMINFORC = 1
Maximal wavenumber of streamfunction forcing, KMAXFORC = 29
skebs_vertstruc 1
Time step: itime_step= 0
Decorrelation time of noise, ZTAU_PSI = 0.60000E+03
Variance of noise, ZSIGMA2_EPS = Infinity
Autoregressive parameter 1-ALPH_PSI = 1.0E+03

==============================================
These values are indicative of a problem in the SKEBS setup, although it's not immediately clear that there would be a problem. However, I suppose that on the certain time step an extremely large noise value got selected randomly and that probably caused overflows or NaNs to start appearing in such a way that it filtered into dz8w and into the MYNN surface-layer scheme.
 
Top