Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF suddenly stopped reading restart files

Rtsquared

Member
I've been running several simulations using WRF v4.6.0 on the Derecho computer and had no issue doing so until Wednesday. Because that computer limits jobs to 12 hours and my simulations take 23-26 to complete (one domain runs a little faster than the other, I have them both going simultaneously), I've been using restart files and then stitching the output files together. I had no problem doing this and completed 5 of my planned 12 simulations when all of a sudden WRF started being unable to read the input files. I changed nothing on my end (and tried rerunning the simulations from the start- they run until I have to restart) but now it doesn't work and I have no idea why. I have gotten the following errors:

RESTART run: opening wrfrst_d01_2023-07-03_21:00:00 for reading
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 242
program wrf: error opening wrfrst_d01_2023-07-03_21:00:00 for reading

RESTART run: opening wrfrst_d01_2022-06-07_12:00:00 for reading
Error trying to read metadata
File name that is causing troubles = wrfrst_d01_2022-06-07_12:00:00
You can try 1) ensure that the input file was created with WRF v4 pre-processors, or
2) use force_use_old_data=T in the time_control record of the namelist.input file
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 342
---- ERROR: The input file appears to be from a pre-v4 version of WRF initialization routines

The second error message is a weird one- it's different than the one I got a couple days ago (which was the same as the first) and I created those restart files with WRFv4.6.0 an hour or so before I tried to do the restart run. I get the same errors when trying to use earlier restart files too (I create them every 3 hours). I assume this is a problem with the Derecho computer (maybe a change to the loaded modules?), but could be wrong.
 
Regarding the first error, this sometimes can be due to a disk space issue. Where are you running this on Derecho? If it's in the /glade/u/home space, that may be the problem since that disk has a much smaller capacity. If that's not the case, can you point me to your Derecho running path? This may also help me to analyze the second problem, as well, if need be.
 
I'm running my simulations in /glade/campaign/univ/uncs0044/roger/wrfv4.6.0/test/phoenix (first error) and atlanta (second error). I don't think it's a disc space issue and I know the project still has enough allotment to finish my simulations, but if the problem is disc space I can remove my old simulations as that particular combo (Noah MP + YSU + BEP-BEM) has an odd issue where the surface temperature in cities increases after dark in the lowest couple of model levels (does not happen when using Boulac instead which is what I'm currently running, but both have problems with the surface wind being too low).
 
The /glade/campaign space is supposed to just be used for storage, I think. It is odd that you were able to run there until a certain point, though. I would first recommend checking with the CISL support group to see if that could be causing the issue. Make sure you're supposed to be running in that directory. Alternatively, you can try moving the entire wrfv4.6.0 directory to your /glade/scratch space and see if you're still getting the errors there.
 
I moved the wrfv4.6.0 directory to my scratch space (user rwturnau), and I'm getting the same errors. By the way, is there a way to have my jobs run for longer than 12 hours? If I could let them run for ~26 I wouldn't need the restart files.
 
I don't think there is a way to run for longer than 12 hours, but you can ask CISL support to see if they have a method for that. Can you modify the permissions on your wrfv4.6.0 directory in your scratch space? I'm not able to look in that directory and it could be helpful for me to see what's going on in there. Thanks!
 
I still get a "permission denied" message, unfortunately. Can you try the following setting:

chmod -R o+rx wrfv4.6.0

You can also try just creating a new directory and putting the files that would be useful for me (namelist.input, wrfbdy_d01, wrfrst* from the time you're trying to start, and maybe the rsl.error.0000 file that shows the error(s) you're getting), and then issuing the above "chmod" command on each of those individual files. If it still doesn't work for me, I'll have to ask you to package the files up and share them with me using the instructions on this forum's home page - for sharing large files. But I'll let you know when that time comes.
 
Permissions changed, hopefully it works this time. Were any packages used by WRF changed or updated last week? That could be what's causing the problem.
 
Ok, that worked and I can see everything now. Thank you!

So for the Phoenix test, if you list the wrfrst files and see their sizes (ls -ls wrfrst*) you can see that the file size of the wrfrst_d0* files from July 1 at 21 UTC and later are all much smaller, meaning something went wrong when they were created during your last wrf.exe simulation. The reason wrf can't open the file is because it's an incomplete file. So you'll need to go back to the previous run and see if you can determine what went wrong.

Although the Atlanta test gives a different error, it looks like the same thing occurred there - the files aren't complete for the time period you're running wrf.
 
That's quite odd. I wonder why the restart files aren't being created properly. I'll try rerunning the simulations with shorter restart file output (there's a weird bug that sometimes shows up where the simulation hangs when trying to write the file if it's too far apart- like 12 hours or 1 day instead of 3 hours, is related to the output interval), but that's the only thing I can think of. I changed the timestep of the Atlanta run from 18 to 12 as it was having some stability problems, and with the phoenix simulation all I did was turn off urban physics and remove urban areas (in the wrf input file).
 
The only explanation I may have for the model stumbling when the restart intervals are further apart is that the file sizes are larger then because there's more data going into each file. That being said, though, your files aren't tremendously large, so I'm not sure I have a great explanation. Let me know how it goes and if you're able to get past this issue.
 
It's working for now. The Phoenix simulation should complete, but Atlanta only made it 5 days (hourly restart files significantly slow things down) so we'll have to see if the files are usable tomorrow.
 
Atlanta is also finished and I've started the next set, so things seem to be working after fiddling with the restart file output frequency. I'm still baffled about what happened in the first place and why making a minor change like that actually worked to get it going again.
 
Top