Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF stuck when generating the first restart file

chenghao

New member
I am using the latest version of WRF (v4.5.2) for a simulation with 4 nested domains. I found that wrf.exe becomes unresponsive when attempting to generate the first wrfrst file for domain 3 (d03). My configuration is set to produce wrfrst files every 24 hours. While the first restart files for d01 and d02 have been successfully generated, this process stalls when generating the file for d03. The job is not killed - it's simply not progressing.

I found a similar post here: wrfrst file error. I was wondering if NCAR has observed similar issues with WRF v4.5.2.

I have attached my namelist.input file for your reference.

P.S.: I am running another simulation with identical settings but a smaller timestep. So far I haven't seen this issue.

Thanks,

Chenghao
 

Attachments

  • namelist.input
    4.8 KB · Views: 33
When you set a larger restart_interval to prevent the output of wrfrst, for example restart_interval = 144000, can the case run successfully?
 
Yes! Using a larger restart interval solved this issue. Do you know what triggered this weird behavior?

Thanks,

Chenghao
 
Chenghao,

This is an issue we are aware of, but we cannot repeat the same behavior at NCAR, which makes it hard for us to fix the problem. The larger restart interval is just a temporal fix.
 
Hello Ming,

I am running WRF-Chem (v4.6.0) on Derecho and saw the same error mentioned in this post. I have three domains, the model can run successfully for 12 wall clock hours by setting the restart interval to a very large number. However, when I set the restart interval to 1 hour, wrf.exe freezes at producing the restart file for the third domain.

Previously, increasing the CPU number can solve this problem for a different model configuration, but this time even I used as many cpu as possible (i.e., at least 10 grid points on each cpu), the problem still exists.

In case you want to repeat the same behavior at NCAR, my WRF directories are listed below.
My WRF directory is at
/glade/work/qinjian/africa/WRF4/test/em_real

The WRF output directory is at
/glade/derecho/scratch/qinjian/africa/wrf4/chem

Thanks you in advance!
Qinjian Jin
 
Qinjian,
I am not a wrf-chem expert, but the issue you have is possibly derived from WRF-ARW. We did see similar issues before and we manage to overcome it by using more number of processors.

I am able to repeat your case, and I am sorry that I don't have an immediate solution to the problem.

Is it possible that you reduce the grid numbers of your D03 and retry this case? I expect that with smaller grid numbers, the case should be able to run.
 
Hello Ming,

Thanks for your response. I am sorry for my late response due to my travel during the past weeks.

I did more tests and found that the issue has a higher chance to occur when the restart_interval is set to a longer time. For example, all restart files can be created when it was set to 1 hour and 6 hours, but not for 12 hours and 24 hours.

Best,
Qinjian
 
Qinjian,

Thank you for the information. Please keep me updated if you find a solution to this issue....
 
Hello,
I wanted to add onto this post as I have been experiencing this problem more often than before. I am using WRF 4.6.0 with the following libraries:
gcc
openmpi
hdf5/1.12.2-mpi
netcdf-c/4.8.1-mpi
netcdf-fortran/4.5.4
jasper/3.0.3

I am simulating a portion of Greece centered on the Peloponnese

Usually this event happens to me once every handful of simulations, but yesterday I had this problem occur consistently for 4 simulations (a, b, c, d) that are all quite similar to one another with only slight differences (date of simulation or boundary layer scheme). For 3 of the simulations a, b, c I tried to first decrease the wrf restart write interval, from 12 hours to 6 hours and neither worked so I set the wrf restart write interval to be larger than the run time and those models are current running.

For the last simulation d, I tried to run the job at 3 hours and it still didn't work. I unfortunately then ran the real.exe instead of the wrf.exe before I could save the wrfinputs and wrfbdys separately so they got over written. After real.exe was run again the simulation was able to work.

I mostly posted this just to keep record of when these events occur, but also the hope is that since some of my files would consistently crash, if you are interested in trying them out on your servers perhaps these ones may also freeze there.

Please let me know if you would like me to send you the wrfinpt, wrfbdy, and namelist.input files.

Other than the 4 that consistently failed two other simulation failed the first time wrf.exe was launched, but were able to run when wrf.exe was launched again wth no changes called simulations e and f.

Note that the set up used for all of these files have been run on several other cases and the wrf completed successfully so this problem does appear to be random.

Simulation descriptions
all simulations have almost the same namelist input with slight differences being the boundary layer scheme and for some of them the simuation dates.
a) october 20 -> 23 bl: YSU
b) october 24 -> 27 bl: YSU
c) october 17 -> 20 bl: MYNN2.5
d) october 17 -> 20 bl: KEPS
e) october 27 -> 30 bl: KEPS
f) october 19 -> 21 bl: YSU
 
Hi,
I encountered a similar issue while running WRF for 156 hours. I mistakenly set restart = .false. and restart_interval = 7200 (120 hours). As a result, I only have wrfout files for the first 120 hours and wrfrst_d0*_2019-02-06_12:00:00 files.

Although the rsl.error. and rsl.out. files have stopped updating, I still see wrf.exe running in the top command. Could someone advise me on what steps to take next? Should I kill the wrf.exe command?

Any help would be greatly appreciated. Should I create a new thread for this issue?
The namelist.input file is attached.

Best regards,
 

Attachments

  • namelist.input
    4 KB · Views: 7
I had the same issue using the WRF model V4.6.0 on a machine with AMD processors. When running a 60-hour simulation with 2 domains and saving the first 24 hours of the simulated domains, the following hours were not generated. The solution we found was to follow Ming Chen's suggestion of using a larger restart_interval = 144000. With this adjustment, the model continued running smoothly after the initial 24 hours, successfully completing the simulation.
 
Using a larger restart_interval only works for simulations of a short period where restart is not required. For longer simulations that take several wall-clock days, restart is required due to wall-clock limit and the issue would show up again. We really need to fix this issue but have no idea who will save WRF users experiencing this issue.
 
We are aware of the restart issue. One possible fix is to increase the number of processors you used to run the case, which will give you larger memory and helps to output wrfrst files.

Please try and let me know whether it works for you.
 
We are also experiencing this issue. AMD CPUs using AOCC, openMPI and WRF version 3.5.2. WRF just stops running after the wrfrst files are written out. When we increase the restart_interval, the WRF run succeeds.
 
This is what I expected, ---- sometimes WRF cannot output wrfrst and fails subsequently. Sorry that we haven't fixed this issue yet.
 
Last edited:
hello everyone, I had the same issue using the WRF v4.5.2. After I changed io_form_restart from 2 to 102, the problem disappeared, even though the restart files (wrfrst) for my three domains are all smaller than 200 MB.
My questions are:
  1. I ran exactly the same case on two different machines and saw the same issue: wrfrst could not be written, yet wrf.exe kept running.
  2. I once ran the same case (only the simulation period was different, restart_interval is the same.) and the restart file was written successfully, although I did not let that run finish.
 
Hi Lue,

This is a model behavior we have seen before, although we don't know what could be the reason behind. The 'wrfrst' failure seems randomly distributed in cases. Sometimes we can overcome this issue by increasing the number processors, although this approach doesn't always work.
 
Top