Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrfrst file error

@@michal_galkowski
I run your case in NCAR, using your wrfinput and wrfbdy files. I modified the namelist.input to make it work for WRF-ARW. Please see the attached namelist.input I used. I run this case using standard WRFv4.5 and 4 nodes (144 processors). This case is done successfully.
It is annoying that I cannot repeat the error you (and others) reported regarding the wrfrst issue. I am really perplexed by this issue and will continue to pay attention to it. We will keep our users updated once we know for sure what is going on...
 

Attachments

  • namelist.input.txt
    5 KB · Views: 3
@Ming Chen

Thanks for looking into this. It is frustrating indeed.

Over the past two weeks I've been experimenting with different settings - wasn't able to solve it. Just for posterity I list my findings:

- I also ran the case without chem_opt=0 and it ended successfully. I tried it only once so I cannot confirm with 100% certainty that it wasn't just luck (see below).
- this problem appears to occur at random, with about 50% probability (didn't calculate exact). Submission of the same job later may run successfully, but other times hangs the same way.
- increase in the number of processors does not seem to improve the situation - the problem still occurs.
- neither does decreasing the number of processors-per-node (thus increasing memory per process) - tested by reducing by about 10% (keeping the total number of processors similar).

As it is random, it is possible for me to complete my simulations (wasting some computing time in the process), but it doesn't solve the main issue. I need to stop diagnosing now, but if 'll I pick it up in the future Ibe looking into:

- submitting to nodes with larger available memory, albeit I'm nowhere close to the limit on the ones I use. But I cannot exclude that jobs are assigned in the cluster to those and that makes the jobs run.
- running with a different netcdf (currently running with 4.5.0).
- looking into saving tiled wrfrst files (i.e. in chunks) and concatenating them later.

Maybe somebody can solve this earlier, then I'd appreciate posting the message here.
 
Thank you for the update. I do understand the frustrations you have experienced.
I am really sorry to tell that we made no progress on this issue because we cannot repeat this error in NCAR HPC. I am not sure whether this is machine-related, because we did receive a few reports but we cannot repeat any of these reported cases.
I will post update here once we know for sure what is going on.
 
Is there any debug mode we can run WRF in to supply you guys with extra info for when it happens again?
It's just happened to me there two times in a row - but in the run immediately before it worked. I changed nothing inbetween runs.
I'm now in the process of submitting the same job for a third time... and it works. It just hangs on that third restart file completely at random.
 
Would you please recompile WRF in debug mode, i.e., ./clean -a, ./configure -D and ./compile em_real, then run the case with the restarting issue? If the case just hangs there until the required time is reached, hopefully the log file can tell when and where it stops. This may give us hints what is going on.

I apologize for the annoying issue and thank you for helping us to debug the problem.
 
Hi all,

The same thing happened to me. The wrfrst files were never created. Different intervals were tried for wrfrst files, but none were generated. Lastly, I added "write_hist_at_0h_rst = .true.," at &domains section, and the model worked.
 
Sorry for the late answer, --- somehow this post is skipped. I just want to update the result of my test. The case is done successfully and I cannot repeat the restart issue. A few users reported the same problem but we have no clue what is the reason behind. Without being able to repeat the error, it is hard to debug the problem. Sorry for not being able to help.
 
I have found a solution that seems to work, as in, I haven't had the model freeze writing restart files since.

change io_form_restart from 2 to 102.

This change splits the output restarts into one file per processor (per time). This does mean you can have hundreds of thousands of restart files if you have many domains, long runs and use lots of processors... but it does work at least
 
I have found a solution that seems to work, as in, I haven't had the model freeze writing restart files since.

change io_form_restart from 2 to 102.

This change splits the output restarts into one file per processor (per time). This does mean you can have hundreds of thousands of restart files if you have many domains, long runs and use lots of processors... but it does work at least
It worked! Thanks for the suggestion.
I was facing the same problem on an HPC different from UCAR’s.
 
To add more about this behavior, this does not seem to be happening on UCAR’s HPC.
I fill it has to do with the HPC setup somehow. Memory management?! Just guessing.
It happened at the last domain on day one or two. If I had 2 domains on rst d02 and if there were 3 domains, it happened on same time d03.
 
Top