Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

real.exe stops when running for several months of simulations on Cheyenne

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Chapacha

New member
I am testing running WRF V3.9.1 for seasonal (6-month) forecasts driven by the 6-hourly CFSv2 data. WPS works fine for processing the 6-month data but real.exe stops after processing about one month and 25-day's worth of the CFSv2 data. I tested with starting real.exe at different initialization times and real.exe stops every time after processing roughly one month and 25-day's worth of data. I also issued "unlimit" before starting real.exe but this did not help. The error message looks like this: "MPT ERROR: MPI_COMM_WORLD rank 23 has terminated without calling MPI_Finalize() aborting job". I am wondering if you could help me fix this problem.

Thanks,
Yongxin
 
Yongxin,

Can you please attach your namelist.input file, along with your running log files (e.g., rsl* files) files. If you have several rsl* files, you can package them all into one *.tar file and attach that.

Thanks,
Kelly
 
Hi Kelly,

Thank you so much for your quick response. Attached please find my namelist.input file and the tarred up rsl files. Please let me know if you need any other files.

Yongxin
 

Attachments

  • namelist.input
    3.9 KB · Views: 57
  • rsl.tar
    26.8 MB · Views: 45
Hi Yongxin,

1) Can you try to run this with fewer processors (perhaps no more than 36 processors) to see if that makes a difference?

2) If that doesn't work, can you try to run this with only 1 domain to see if it continues further? If so, try 2 domains to see if that would work? I know that you need 3 domains, but this is just a test.

3) If 1) doesn't work, and you need to do 2), after that, can you go to your running directory and issue:

ls -ls >& ls.txt

and attach that ls.txt file, along with your configure.wrf file that was created prior to compiling the code?

Thanks,
Kelly
 
Hi Kelly,

Thank you very much for all the suggestions. I tried with using fewer processors (i.e., #PBS -l select=1:ncpus=36:mpiprocs=36 and #PBS -l select=1:ncpus=18:mpiprocs=18) but that did not make any difference. I then tried with running Domain 1 only and still the run stopped at the same time as before. Attached please find the ls.txt file after issuing the command "ls -ls >& ls.txt" and my configure.wrf file.

Thanks!
Yongxin
 

Attachments

  • ls.txt
    15.5 KB · Views: 52
  • configure.wrf
    23.1 KB · Views: 42
Thanks for sending those. I don't see anything out of the ordinary in those files. Here are some suggestions:

1) Since this stops whether you have 1, 2, or 3 domains, I would suggest setting max_dom = 1 so that you're not running anything unnecessary when trying to track this down.

2) Since you said you can start this at varying times, and it always stops after the same amount of time, it seems that this is likely NOT related to the specific data at any certain time, and seems more likely to be a size issue. Can you check to see how much disk space you have? It's possible that as the wrfbdy_d01 file approaches a certain size, there is no more room to write any more.

3) This shouldn't be affecting your run, but as a side note, your domains are very small. We don't recommend using domains with a size any smaller than 100x100 grid spaces because there isn't enough space in the domain to get any reasonable results. This may be something you will need to consider changing for your simulation. For information on setting up a "good" domain, take a look at this WPS Best Practice web page:
http://www2.mmm.ucar.edu/wrf/users/namelist_best_prac_wps.html
There is also a page for the namelist.input file, in case you're interested:
http://www2.mmm.ucar.edu/wrf/users/namelist_best_prac_wps.html

Kelly
 
Hi Kelly,

Thank you so much for all these suggestions. Yes, I will set max_dom = 1 when we are tracking down the issue. As for disk space, I checked and I should have enough space but I will launch the job under a different /glade space and see if that makes any difference.

My domains are indeed very small. My group has been using these domains for a few years already. We will try to enlarge the domains next time when we upgrade our system.

Thanks again,
Yongxin
 
Top