Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrfrst file error

haohe1982

New member
I am running the latest WRF v4.4.1, and find the wrf.exe sometimes stuck at generating the 1st day's wrfrst file (the model does stop or is not killed, but just hanges there).

This error happens kind of randomly (I tried to kill the job and re-submit it, sometimes wrf works fine). I turn on the debug model, but the log file does not have useful information related to this error. Since the supercomputer I used is new, my hypothesis is that it could be related to the hardware setting. For instance, when the job is assigned to some computing nodes it works, while in other nodes, it didn't work. I am just wondering if anyone meet this similar problem, and how should I fix it? Is it related to memory setting?

Sincerely appreciate your help.

Hao
 
Hao,
We are not aware of this issue and nobody has reported similar problems. Please talk to your computer manager and see whether they have any clue?
Please keep us updated if you figure out what is wrong.
 
Having the exact same problem right now with this model version. Randomly hangs and does nothing writing the very first restart files. If it gets past the first there is never a problem.
Has there been an update with this issue?
 
Does this happen if the case is big ( large grid numbers) or it also happens in small case?
Also, have you tried other versions like WRFv4.4 and WRFV4.4.2? Thanks.
 
Haven't gotten around to trying the other model versions, but the problem doesn't happen every time, like one in 5 or so times. It's how random it is that makes it strange. Rusubmit the exact script again and it works.

The innermost domain is relatively small (D3, nx*ny*nz = 251*251*70), but the outer domains aren't much bigger in horizontal extent.
 
It seems that if you run the case with more processors, then this issue can be fixed and wrfrst can be generated successfully.
Please let me know whether this approach works for you. Thanks.
 
Hi, Thanks for investigating!
Unfortunately with my setup, adding more CPUs would require another node, and then my node hour requests double. In the long term it would be very costly...

I get the same result (it works again) when I redo REAL.exe and submit everything again. But then unfortunately I must check every job manually so that it isn't stuck.

Is the issue linked to not having enough processors? I've ran into the problem with both 128 and 256 processors.
 
I did some further tests. I compiled an old version of WRF v4.2.2 on our supercomputer with the same compiler and library, same input files, and same namelist for both WPS and WRF. The v422 model runs well. But for the V442, the restart problem is still there. I believe there is a bug in the latest WRF. Here are the librrary and compiler I used. Hope they can help.

gfortran and gcc 9.4.0,
openmpi 4.1.1,
netcdf-c 4.8.1,
netcdf-f 4.5.3,
hdf5 1.10.7
 
Hi,
Would you please upload your namelist.input for me to take a look?
Also, how did you compile WRF and run it?

I was able to repeat your issue when I run a big case using a small number of processors. But the same case can be run when I triple the number of processors. I also tested another case and got the same results, i.e., with larger number of processors, wrfrst can be written successfully.

By the way, I run the standard WRFv4.5, compiled it in dmpar mode using intel compiler.
 
Attached please find the namelist. I have try to reduce the max_domain to 2, but still has problem to generate the wrfrst for D02 ( the D01 wrfrst was successfully generated).

btw, I believe the model worked couple of months ago, but had the problem recently. Our supercomputer has continuously changed with software update. Is it possible these patches cause the problem?
 

Attachments

  • namelist.input
    3.9 KB · Views: 8
Your case is not a big case and I am perplexed why wrfrst cannot be generated ....
This seems a random model behavior, probably due to some issues in the model itself.
I will try a few more tests and hopefully I can repeat this issue.
 
Some update. our supercomputer had a system maintenance this Monday, and now wrf4.4.2 can generate wrfrst. I believe it is something unstable in WRF, so it is hard to replicate. Hope you can figure out what is the bug. Good luck!
 
I tried the latest WRF4.5, but this error still exists. I have run 4 cases, two of them run well, but another two met this no wrfrst error. Very annoying and seems like it is a random error. Tried to adjust the CPU numbers, still not help.
 
Can you run the two cases that failed to produce wrfrst files using WRFV4.4.2? Please let me know whether WRRFv4.4.2 has the same issue.
The trouble for me is that I cannot repeat the model behavior, which makes it hard to debug the issue. I apologize for the problem.
 
Hi, just wanted to report that I'm facing exactly the same issue running WRF4.4.1., also intel dmpar compilation. Scaled up my 2-domain configuration (nodes: 340x340 d01, 341x341 d02) up to 576 processes (12 x 48), and exactly the same erratic behaviour occurs -- wrfrst_d01 always written successfully, wrfrst_d02 hangs with size of 96B, no errors. Just hangs (and eats the resources from my computing grant at my HPC centre...). Tried with quilting and without, currently testing bumping the number of processors to max allowed.

Looks like it might be related to something similar reported back in 2022:

But there there was a proper error and a crash...

In any case running with output format 102 for rst might help someone; won't work for me as I need to do some manual adjustments to wrfrst files for my experiment, and that would require some major recoding of my processing scripts...
 
@Ming Chen

Just FYI, increasing the number of processors from 576 (12 nodes) to 960 (20 nodes) didn't change the situation.
Memory doesn't seem to be the issue, as the slurm reports max memory usage at 42.4% anytime during the run.

I've uploaded the files to the Dropbox here:


Just FYI I'm using a compilation with chemistry enabled, we use chem_opt = 17 with extra tracers enabled, as in the attached namelist. I don't know whether this is directly related to WRF-Chem, considering that others encountered similar issues before, but I'll understand if the policy is to redirect WRF-Chem troubleshooting to the dedicated part of the forum. I'd still appreciate if you can check the run with chem_opt = 0. Your call.

I'm currently waiting in queue to check if the issue occurs with chemistry disabled. Due to long queue times at our cluster it will take a while before I can report back on this.
 
Top