Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF Stops Prematurely if adaptive time step is used (RESOLVED)

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

pvahmani

New member
Hi,

I am using WRF 4.2.1 for a long simulation (40 years). I restart the code every 7 days. I am 10 years in. Now the code stops at 2 or 5 days with no errors and a SUCCESS COMPLETE WRF message as if it was supposed to stop there. If I turn off the adaptive time step this issue goes away but I really need to use adaptive timestep for efficiency and cost and each run has to be 7 days.
My relavant namelist options are:

&time_control
run_days = 7,
...
interval_seconds = 10800
input_from_file = .true.,
history_interval = 60,
frames_per_outfile = 300000,
restart = .true.,
restart_interval = 10080,
write_hist_at_0h_rst = .false.,
io_form_history = 2
io_form_restart = 2
io_form_input = 2
io_form_boundary = 2
io_form_auxinput4 = 2
io_form_auxhist1 = 2
auxinput4_inname = "wrflowinp_d<domain>",
auxinput4_interval_m = 360, 360, 360, 360,
iofields_filename = "myoutfields.txt","myoutfields.txt","myoutfields.txt",
auxhist1_outname = "wrfout_d<domain>_<date>_3hourly"
frames_per_auxhist1 = 300000,
auxhist1_interval = 180,
io_form_auxhist1 = 2
adjust_output_times = .true.
override_restart_timers = .true.,
/

&domains
time_step = 60,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 1,
...
smooth_option = 1,
use_adaptive_time_step = .false.,
step_to_output_time = .true.
target_cfl = 1.0, 1.0, 1.0,
max_step_increase_pct = 51, 51, 51,
max_time_step = 140,
smooth_cg_topo = .true.
/

Would appreciate any help or advice!
 
I am also experiencing this same issue, but with version 3.9. I compared my version of adapt_timestep_em.F with the latest and it was very similar. I also tried the bug fix of adding " .AND. (time_to_bc .GT. 0.0)" to line ~309, mentioned in the other forum post, but it seemed to make no difference.

I am attempting to run a 6 hour forecast, but WRF "completes" after reaching the first BC time (+1 hour). The rsl.out ends with:

Code:
Timing for main (dt= 11.16): time 2020-05-22_22:59:39 on domain   1:    0.51416 elapsed seconds
Timing for main (dt= 11.16): time 2020-05-22_22:59:50 on domain   1:    0.57459 elapsed seconds
Timing for main (dt= 11.16): time 2020-05-22_23:00:01 on domain   1:    0.51417 elapsed seconds
Timing for Writing wrfwof_d01_2020-05-22_23_00_00 for domain        1:    1.82607 elapsed seconds
Timing for processing lateral boundary for domain        1:    0.50809 elapsed seconds
Timing for main (dt=  6.60): time 2020-05-22_23:00:08 on domain   1:    2.83671 elapsed seconds
Timing for main (dt= -8.40): time 2020-05-22_23:00:00 on domain   1:    0.58087 elapsed seconds
d01 2020-05-22_23:00:00 wrf: SUCCESS COMPLETE WRF

The run starts at 22Z, but once it reaches 23Z, it just stops. What's very strange is the -dt, despite applying the fix from another thread. Does your rsl out contain a similar message?
 
Hi @joshuam and @pvahmani,
The code for adaptive time-step has not been modified in some time now, so it should be the same for both versions you are using. Please look in the file WRF/dyn_em/adapt_timestep_em.F and look for line 299:
Code:
elseif (tmpTimeInterval .LE. dtInterval) then
Change that to
Code:
elseif ((tmpTimeInterval .LE. dtInterval) .AND. (time_to_bc .GT. 0.0)) then
Then save the file and you'll need to recompile the code. You do NOT need to do a 'clean -a' or reconfigure. You can simply recompile to incorporate the change. After that, try to run again and please let me know if that makes any difference. Thanks!
 
Hi @kwerner,

Thanks for your response! I have tried that solution though, from an older post, without success. The same happens where the code stops with the SUCCESS message before the expected time is reached. Although I need to mention that I am trying this on an ongoing simulation, meaning that I didn't start the simulation over with the updated adapt_timestep_em.F. I restarted the code with the new adapt_timestep_em.F. I am not sure though if that matters.
What else can I try to get the adaptive time step working again? It was working fine for a while before it started with this strange behavior.

Thanks
Pouya
 
Hi @kwerner,

Thank you for your response. Yes, like @pvahmani, I also added that line of code from a previous post but am getting the same response.

I placed some write statements before (adapt_timestepA) and after (adapt_timestepB) that whole if/elseif/else block, and here is the result:

Code:
Timing for main (dt= 11.50): time 2020-05-22_22:59:49 on domain   1:   10.33783 elapsed seconds

  adapt_timestepA(dtInterval,curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =                    11                   61                  100           0 ,   3589.45996     ,   25.5412598     ,        2554 ,         100 ,                   25                   27                   50           0 , F
  
  adapt_timestepB(curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =    3589.45996     ,   25.5412598     ,        2554 ,         100 ,                   25                   27                   50           0 , F

Timing for main (dt= 11.61): time 2020-05-22_23:00:01 on domain   1:    9.01582 elapsed seconds

  adapt_timestepA(dtInterval,curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =                    11                   73                  100           0 ,   3601.07007     ,   13.9311523     ,        1393 ,         100 ,                   13                   93                  100           0 , F

  adapt_timestepB(curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =    3601.07007     ,   13.9311523     ,        1393 ,         100 ,                   13                   93                  100           0 , T

Timing for main (dt=  6.97): time 2020-05-22_23:00:08 on domain   1:    7.74518 elapsed seconds

  adapt_timestepA(dtInterval,curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =                    11                   73                  100           0 ,   3608.03491     ,   6.96606445     ,         697 ,         100 ,                    6                   97                  100           0 , F

  adapt_timestepB(curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =    3608.03491     ,   6.96606445     ,         697 ,         100 ,                 3600                    0                    0           0 , T

Timing for main (dt= -8.03): time 2020-05-22_23:00:00 on domain   1:    7.62586 elapsed seconds

d01 2020-05-22_23:00:00 wrf: SUCCESS COMPLETE WRF

I am still working to understand the code, but what I find interesting is grid%stepping_to_time=T during two cycles. Also, it does not land directly at the top of the hour, but one second over (2020-05-22_23:00:01), I'm also wondering if that has something to do with the issue. I should also note that this was compiled with GNU. @pvahmani did you compile with Intel or GNU?

Any help would be greatly appreciated! Thanks!

Joshua
 
Thanks for the update. This has been a known issue for a while and we have yet to solve it - mostly because we aren't able to recreate the problem with any cases we've tested and when we've asked for user files in the past (several years ago), they would never respond. Can each of you attach the following so that I can try to test with your input, and possibly find a link between your cases and what may cause this?

1) Your full namelist.input file
2) wrfinput* and wrfbdy_d01 files, and anything else I'll need to run the test (e.g., a wrfrst* file)? These may be too large to attach. If so, take a look at the home page of this forum for instructions on sending large files.
3) Your full rsl.error.0000 file, showing the false success.

Thanks!
 
Hello,

Thanks for looking into this! Attached is my RSL error and namelist files. Also, the lines similar to "Timing for Writing wrfwof_d01_2020-05-22_23_00_00 for domain" are misleading; the filename is hardcoded to end with _00. If I had used the standard <date> template, it would match the seconds in the actual timestep (which sometimes could be 00, but often over or under by a few).

Here are the download links to the WRF input/bdy files:

https://storwrfmodelprod001.blob.core.windows.net/wrf/output/ModelRun20210408-100104/202005222200/wrf/wrfbdy_d01.1?sv=2020-04-08&st=2021-04-12T18%3A13%3A53Z&se=2021-05-13T18%3A13%3A00Z&sr=b&sp=r&sig=JIZsVmAjqvNTFeZhvi3v3uETfVylvVSzDt5U5FHmkKQ%3D

https://storwrfmodelprod001.blob.core.windows.net/wrf/output/ModelRun20210408-100104/202005222200/wrf/wrfinput_d01.1?sv=2020-04-08&st=2021-04-12T18%3A14%3A32Z&se=2021-05-13T18%3A14%3A00Z&sr=b&sp=r&sig=rJwggr%2BD%2FYHlpXDXrT%2BMQuhZFS41e5V4gTXoEGgrGkg%3D

Thanks again!
Joshua
 

Attachments

  • files.zip
    7.9 KB · Views: 57
I seem to have a temporary fix for my particular issue, but unfortunately I still do not have a clear understanding as to why the issue occurred in the first place. It's more of a bandaid fix. The changes I made were:

1. In adapt_timestep_em, around line 140, after "curr_secs=0", I added:
Code:
grid%dtbc = 0

2. In mediation_integrate, around line 1986, I replaced this line
Code:
IF ( (currentTime .EQ. grid%this_bdy_time)) grid%dtbc = 0.
With this
Code:
IF ( (currentTime .EQ. grid%this_bdy_time) .and. grid%itimestep .GT. 1) grid%dtbc = 0.

This seems to get it working for me (so far, not all the way through the run yet but well beyond the last point of failure.) Now the question is: why did it fail? Below is a sample of my RSL error with additional information that I had included from adapt_timestep (before I made changes):

Code:
Timing for processing wrfinput file (stream 0) for domain        1:    3.85851 elapsed seconds
Max map factor in domain 1 =  1.00. Scale the dt in the model accordingly.
  domain_get_advanceCount =            0
  grid%interval_seconds =         3600
  grid%dtbc =    885.000000    
  adapt_timestepA(dtInterval,curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =                    15                    0                  100           0 ,   0.00000000     ,   3600.00000     ,      360000 ,         100 ,                 3600                    0                  100           0 , F
  adapt_timestep(stepping_to_bc) =  F
  adapt_timestepB(curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =    0.00000000     ,   3600.00000     ,      360000 ,         100 ,                 3600                    0                  100           0 , F
  adapt_timestep(history_interval_sec) =        21600
  adapt_timestep(time_to_output) =    21600.0000    
  set grid%dt =    15.0000000    
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
 LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND          33  CATEGORIES           2  SEASONS WATER CATEGORY =           17  SNOW CATEGORY =           15
Climatological albedo is used instead of table values
INITIALIZE THREE LSM RELATED TABLES
 INPUT VEGPARM FOR MODI-RUC
 VEGPARM FOR USGS     FOUND          27  CATEGORIES
 Skipping USGS     table
 VEGPARM FOR MODIFIED FOUND          20  CATEGORIES
 Skipping MODIFIED table
 VEGPARM FOR NLCD40   FOUND          40  CATEGORIES
 Skipping NLCD40   table
 VEGPARM FOR USGS-RUC FOUND          28  CATEGORIES
 Skipping USGS-RUC table
 VEGPARM FOR MODI-RUC FOUND          21  CATEGORIES
 Found MODI-RUC table
 Reading MODI-RUC table
 INPUT SOIL TEXTURE CLASSIFICATION = STAS-RUC
 SOIL TEXTURE CLASSIFICATION = STAS-RUC FOUND          19  CATEGORIES
 icefallfac =    1.50000000    
 snowfallfac =    1.25000000    
 icefallopt =            3
 ehw0,ehlw0 =   0.899999976      0.899999976    
 Tile Strategy is not specified. Assuming 1D-Y
WRF TILE   1 IS      1 IE    150 JS      1 JE    100
WRF NUMBER OF TILES =   1
open_hist_w : error opening wrfout_d01_2020-05-22_22_00_00 for writing.      100
 mediation_integrate.G        1728 DATASET=HISTORY
 mediation_integrate.G        1729  grid%id            1  grid%oid            1
Timing for Writing wrfout_d01_2020-05-22_22_00_00 for domain        1:    0.05035 elapsed seconds
open_hist_w : error opening wrfwof_d01_2020-05-22_22_00_00 for writing.      100
Timing for Writing wrfwof_d01_2020-05-22_22_00_00 for domain        1:    0.05416 elapsed seconds
 THIS TIME 2020-05-22_15:00:00, NEXT TIME 2020-05-22_16:00:00
 THIS TIME 2020-05-22_16:00:00, NEXT TIME 2020-05-22_17:00:00
 THIS TIME 2020-05-22_17:00:00, NEXT TIME 2020-05-22_18:00:00
 THIS TIME 2020-05-22_18:00:00, NEXT TIME 2020-05-22_19:00:00
 THIS TIME 2020-05-22_19:00:00, NEXT TIME 2020-05-22_20:00:00
 THIS TIME 2020-05-22_20:00:00, NEXT TIME 2020-05-22_21:00:00
 THIS TIME 2020-05-22_21:00:00, NEXT TIME 2020-05-22_22:00:00
Timing for processing lateral boundary for domain        1:    0.90070 elapsed seconds
d01 2020-05-22_22:00:00  ----------------------------------------
d01 2020-05-22_22:00:00  WW_SPLIT:  ZADVECT_IMPLICT =            0
d01 2020-05-22_22:00:00  WW_SPLIT:  dt =    15.0000000
d01 2020-05-22_22:00:00  WW_SPLIT:  alpha_max/min =    1.00000000      0.800000012
d01 2020-05-22_22:00:00  ----------------------------------------
d01 2020-05-22_22:00:00  W_DAMP CRITICAL COURANT NUMBER =    2.00000000
d01 2020-05-22_22:00:00  ----------------------------------------
d01 2020-05-22_22:00:00  ----------------------------------------
d01 2020-05-22_22:00:00  SOLVE_EM:  RK_ORDER =            3
d01 2020-05-22_22:00:00  TOTAL NUMBER OF SMALL TIME STEPS =            7
d01 2020-05-22_22:00:00  ----------------------------------------
Timing for main (dt= 15.00): time 2020-05-22_22:00:15 on domain   1:   29.21122 elapsed seconds
  domain_get_advanceCount =            1
  grid%interval_seconds =         3600
  grid%dtbc =    0.00000000    
  adapt_timestepA(dtInterval,curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =                    10                    0                    1           0 ,   15.0000000     ,   3585.00000     ,      358500 ,         100 ,                 3585                    0                  100           0 , F
  adapt_timestep(stepping_to_bc) =  F
  adapt_timestepB(curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =    15.0000000     ,   3585.00000     ,      358500 ,         100 ,                 3585                    0                  100           0 , F
  adapt_timestep(history_interval_sec) =        21600
  adapt_timestep(time_to_output) =    21585.0000    
  set grid%dt =    10.0000000    
Timing for main (dt= 10.00): time 2020-05-22_22:00:25 on domain   1:    6.91731 elapsed seconds
  domain_get_advanceCount =            2
  grid%interval_seconds =         3600
  grid%dtbc =    10.0000000    
  adapt_timestepA(dtInterval,curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =                    10                    1                    2           0 ,   25.0000000     ,   3575.00000     ,      357500 ,         100 ,                 3575                    0                  100           0 , F
  adapt_timestep(stepping_to_bc) =  F
  adapt_timestepB(curr_secs,time_to_bc,num,den,tmpTimeInterval,grid%stepping_to_time) =    25.0000000     ,   3575.00000     ,      357500 ,         100 ,                 3575                    0                  100           0 , F
  adapt_timestep(history_interval_sec) =        21600
  adapt_timestep(time_to_output) =    21575.0000    
  set grid%dt =    10.5000000


The issue stems from grid%dtbc==0 when itimestep==1. As can be seen from the logs, the initial call to adapt_timestep happens before WRF processes the lateral boundary for the domain, as well as other things. At this time, for some reason, grid%dtbc==885. I'm not sure where this number came from. Then, when itimestep==1, grid%dtbc==0. And from then on, it increments correctly -- but unfortunately, it's behind by 1 dt. Thus, when the first BC time comes around, it oversteps by 1 dt, which then produces a -dt. So to me, it seems like the issue has to do with adapt_timestep being called too soon, prior to other initialization codes. But I am still trying to learn the codebase.

Thanks for your help!
Joshua
 
Hi @kwerner,

Thank you for your response!
I have uploaded relevant files to Nextcloud. Look for WRF_adaptive_timestep_debug.zip.

Thanks
Pouya
 
Thanks to both of you for sharing your files. I'll test some things out and let you know if I can figure out a solution. In the meantime, I'm very happy to hear that you found a work-around, Joshua.

Pouya, have you been able to test out Joshua's fix to see if it would work for you too?
 
Hi @kwerner,

I did try Joshua's solution but it didn't change anything. Were you able to replicate the issue with the files I sent previously? I would really appreciate any help with this problem. Our Berkeley Lab team is conducting CONUS scale simulations for a multi-institutional (PNNL, ORNL, LBNL, etc) project with a tight deadline. We planned the release of the data to the other teams based on the initial performance we got from WRF and now that the adaptive timestep has stopped working we are facing serious delays and disruptions to the project and would greatly appreciate any help we can get from UCAR.

Thanks
Pouya
 
Hi Pouya,
I tried to download the files, but your wrfbdy file is very large and it keeps stopping before completed. I assume you've created that file for either the entire span of simulations, or a large chunk of it? If that's the case, do you think you could create one that's smaller - just to cover the dates I'll need to run the 7 day simulation for the namelist you sent (1991-10-01_00 to 1991-10-08_00)? Then upload that one. It would be a lot smaller and easier to download. Thanks!
 
The wrfbdy has data for 6 months. The input files are created by another team for these simulations and I don't really know to extract a subset of the file. If you can tell me how to modify the existing wrfbdy to make it smaller I can do it. Or is there any other way I can send you the large file? Google Drive?

Thanks
Pouya
 
Yes, you can try to send through Google Drive, or any other link you may have available to send it. You could also just try to package the one wrfbdy file up for Nextcloud. That way it wouldn't also have the wrfrst* file in there. If you want to share a link for Google Drive here, you can. Otherwise, you can send it to wrfhelp at ucar dot edu (I'll delete that from this public forum after I get it - we could also remove your link to the file on Google Drive if you send it that way).
 
Hi, please find the files at this link:

https://drive.google.com/drive/folders/13UVOmcQx-zHte9QrcQvSIUIdEF3f-c1q?usp=sharing

Thanks
Pouya
 
Pouya,
Thank you for sending those. First, I'd like you to try the top of the repository WRF code to see if it's any better. There was a fix that was put in the code in Feb (after our last release). If you have access to git, can you issue the following:
Code:
git clone https://github.com/wrf-model/WRF.git
Then go into that new WRF directory and issue
Code:
git checkout release-v4.3
This is the release branch that is not currently released yet, but it includes the fix I mentioned. You'll then need to compile the code and try running your case again. Let me know if that works for you.
 
Hi,

I tried using the new release but I get errors on the VEGPARM.TBL. I think due to the new implementation of local climate zones! Basically, I cannot use the new code with my old input files unless you tell me how. Are there specific files from the new release that contain the 'fix' that I can put in the old (V421) code?

Thanks
Pouya
 
Pouya,
It's likely that you just need to link a particular table into your running directory. If you are running in the test/em_real directory, try issuing
Code:
ln -sf ../../run/*.TBL .
and just link-in all those tables. If you're still getting an error, please send me the error message so that I can diagnose it more precisely. Thanks!
 
Thanks!
I run wrf from test/em_real directory. I do have VEGPARM.TBL and other TBLs.
This is the error:

Input data is acceptable to use: wrfrst_d01_1991-10-01_00:00:00
Timing for processing restart file for domain 1: 11.61727 elapsed seconds
Max map factor in domain 1 = 1.05. Scale the dt in the model accordingly.
Using NLCD40 for Noah, redefine urban categories
INITIALIZE THREE Noah LSM RELATED TABLES
Skipping over LUTYPE = USGS
Skipping over LUTYPE = MODIFIED_IGBP_MODIS_NOAH
LANDUSE TYPE = NLCD40 FOUND 40 CATEGORIES
forrtl: severe (59): list-directed I/O syntax error, unit 19, file /global/cscratch1/sd/pvahmani/PROJECT_2021_IM3_Phase_2/STUDY_2020_Climate_Scenario_IM3/test_NERSC2/VEGPARM.TBL
Image PC Routine Line Source
wrf.exe 00000000232AFB6B for__io_return Unknown Unknown
wrf.exe 00000000232E659B for_read_seq_lis_ Unknown Unknown
wrf.exe 00000000232E4E05 for_read_seq_lis Unknown Unknown
wrf.exe 0000000022E32CEC Unknown Unknown Unknown
wrf.exe 0000000022E3C6B2 Unknown Unknown Unknown
 
Hi Pouya,
I think the reason you were getting that error is due to the fact that several things were changed with the VEGPARM.TBL, related to some urban PRs and since you were using older input (wrfrst*), it was complaining. We won't worry about that, then.

The good news is that I was able to use all of your files (thanks again for sending them) and recreate the problem. I then went back to the particular PR that had the adaptive time step fix and ran the case again and the modifications seem to fix your problem! In case you're interested, here is the code commit with those changes and information. Can you try this version? You'll need to issue the following commands:
Code:
git clone https://github.com/wrf-model/WRF.git
cd into the new cloned WRF code, and then
Code:
git checkout b71e0a837626d3809ec6
It then will tell you you're in a 'floating head' state and you'll need to rename the branch. You can just say
Code:
git checkout -b b71e0a837626d3809ec6
Then recompile the code and try that run again to see if it works. Please let me know. Thanks!
 
Top