Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF model crashing/error after running for 4 hrs

Hi, I am trying to run a WRF simulation for 5 days forecast, I am attaching the namelist file I am using.
I am running this model on HPC nodes by parallel processing using MPI program with 256 cores.

I don't understand what is wrong here, why my model is getting crashed or giving the following error when I am running the model for 5 days forecast.
"srun: error: houcy1-n-cp320d52: task 254: Exited with exit code 1"

another interesting thing is that when I am running a model for just one or two days forecast with everything same in namelist and the same GFS boundary condition, it is running perfectly.
I am attaching the rsl.* error files zip file also, Please have a look at it and tell me where might be the issue?

Thank you
 

Attachments

  • all_rsl_files.zip
    1 MB · Views: 2
  • namelist.input
    8.4 KB · Views: 5
It looks like you're writing out a lot of files to your directory, and that your output writing intervals to those files are very often (every 15 mins or 30 mins), meaning more data is being added to the files. Could you potentially be running out of disk space? If that's not the issue, if you were to set frames_per_outfile to something smaller, like =48 (so that you would get a history file for each day), would that make a difference?
 
Hi, the disc space is not the issue as we have around 7TB of space and out of which only 2TB is utilized as of now.
I would try out your suggestion to set frames_per_outfile to 48 and see if that works.

Thank you so much for your suggestion, I'll update you when the model run is completed.
 
Hi, I tried with frames_per_outfile= 24 with history_interval = 60
so that I have I file for each day, but still I am getting errors after the model ran for 4 hours.

I am attaching the new rsl.error files and namelist file, please have a look.
 

Attachments

  • all_rsl_files.zip
    1 MB · Views: 3
  • namelist.input
    8.3 KB · Views: 4
Thanks for trying that. Based on the rsl* files, it looks like it's getting stuck when trying to write the restart file. As a test, can you try setting restart_interval to something smaller, like =180 and see if it's able to print it out at that time (you can run a shorter simulation just to see if it gets past that restart output time). If so, you can set it a bit larger (e.g., 720 - every 12 hours) and try again, and run the full 5 days, or until it stops. If it stops at the same place again, can you try to simulate, starting with the last full restart file time that outputs, and see if you're able to get past the original stop time?
 
Last edited:
Hi, I tried with restart_interval = 180 for a 24 hrs forecast but the model ran and gave output for up to 3hrs then it crashed and gave the same error. which means it was only giving output for the duration of restart_interval.
Then I tried with restart_interval = 9360 for 5 days forecast and it worked, the model ran successfully. :)

Thank you so much for your help. Can you please tell me what is the actual use of this restart_interval field in the namelist.input file? why does it influence wrf model run?
 
That is very interesting, but I'm so glad you were able to find a work-around! The restart_interval is the interval at which a WRF restart file will be written out. When you have a restart file, you can run a simulation starting at that restart time, forward until the end of the full simulation period. This is sometimes necessary for those who run really long simulations and perhaps they are using a batch queueing system that only allows them to run for a certain number of wall-clock hours before the simulation is stopped. If this is the case, they need to restart the model from the time of the restart file to be able to complete the full simulation. But it sounds like that's not something you need for this case, so that's good news!
 
Top