Wrf.exe Error While Writing Restart Files

jbellino

New member
Hi all, I have a 2-domain nested model that runs with no errors, but when I introduce the "restart_interval" parameter (with "restart=.false.") wrf.exe will run all the way to the end and bail while writing the wrfrst files. Often the wrfrst file for domain 1 will complete before the error occurs (EDIT: it seems that it doesn't finish writing domain 1, but gets somewhat close), however in this example the program failed while writing domain 1. Executables were compiled with "WRFIO_NCD_LARGE_FILE_SUPPORT=1" and I have no problem writing large (> 7GB) input files with real.exe. Any help I can get with troubleshooting this problem would be much appreciated!
Timing for main: time 1974-01-02_04:59:40 on domain 1: 1.00705 elapsed seconds
Timing for main: time 1974-01-02_04:59:45 on domain 2: 0.16771 elapsed seconds
Timing for main: time 1974-01-02_04:59:50 on domain 2: 0.20322 elapsed seconds
Timing for main: time 1974-01-02_04:59:55 on domain 2: 0.20213 elapsed seconds
Timing for main: time 1974-01-02_05:00:00 on domain 2: 0.20210 elapsed seconds
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 4
Timing for Writing wrfout_d02_1974-01-02_05:00:00 for domain 2: 1.29688 elapsed seconds
Timing for main: time 1974-01-02_05:00:00 on domain 1: 2.31661 elapsed seconds
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 4
Timing for Writing wrfout_d01_1974-01-02_05:00:00 for domain 1: 0.56472 elapsed seconds
Timing for Writing restart for domain 1: 63.82768 elapsed seconds
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
wrf.exe 00000000235EECBB for__signal_handl Unknown Unknown
libpthread-2.26.s 00001555529672D0 Unknown Unknown Unknown
libmpich_intel.so 0000155552DB1220 MPID_nem_gni_poll Unknown Unknown
libmpich_intel.so 0000155552D8EDB6 MPIDI_CH3I_Progre Unknown Unknown
libmpich_intel.so 0000155552CD442D MPIC_Wait Unknown Unknown
libmpich_intel.so 0000155552CD46C7 MPIC_Recv Unknown Unknown
libmpich_intel.so 0000155552CFA99C MPIR_CRAY_Bcast_T Unknown Unknown
libmpich_intel.so 0000155552CFC7E8 MPIR_CRAY_Bcast Unknown Unknown
libmpich_intel.so 0000155552C0C54B MPIR_Bcast_impl Unknown Unknown
libmpich_intel.so 0000155552CF8DB4 MPIR_CRAY_Gatherv Unknown Unknown
libmpich_intel.so 0000155552C1C904 MPIR_Gatherv_impl Unknown Unknown
libmpich_intel.so 0000155552C1CE19 MPI_Gatherv Unknown Unknown
wrf.exe 00000000206C5D07 Unknown Unknown Unknown
wrf.exe 00000000206B10C8 Unknown Unknown Unknown
wrf.exe 000000002136420C Unknown Unknown Unknown
wrf.exe 000000002116CB31 Unknown Unknown Unknown
wrf.exe 0000000021213D3C Unknown Unknown Unknown
wrf.exe 00000000212164E5 Unknown Unknown Unknown
wrf.exe 0000000020197EA7 Unknown Unknown Unknown
wrf.exe 0000000020017A91 Unknown Unknown Unknown
wrf.exe 0000000020017A49 Unknown Unknown Unknown
wrf.exe 00000000200179D2 Unknown Unknown Unknown
libc-2.26.so 00001555525BD34A __libc_start_main Unknown Unknown
wrf.exe 00000000200178EA Unknown Unknown Unknown
 

Attachments

  • namelist.input.txt
    7.7 KB · Views: 6
  • rsl.error.0000.txt
    1.8 MB · Views: 1
Last edited:

kwerner

Administrator
Staff member
Hi,
Can you check on/test out a couple of things to narrow down the issue?
1) If you only run a single domain, does the issue occur?
2) If you set the restart interval much larger (e.g., 1000) so that no restart file will be written, does the issue occur?
3) You said sometimes the wrfrst files completes before the error. Are you running different cases, dates, etc. and this is only happening on some of the simulations? When you do get a wrfrst file, what is the size of that file?
4) Can you check just to verify you have enough disk space to write the larger files?
5) Can you package up all of your rsl.error.* files into a single *.TAR file and attach that? Thanks!
 

jbellino

New member
Hello, thanks for the reply! Here are my answers to your questions:

1) If you only run a single domain, does the issue occur?
  • No, a 1-domain model simulation completes without error. The wrfrst file for domain 1 was about 2.4 GB.
2) If you set the restart interval much larger (e.g., 1000) so that no restart file will be written, does the issue occur?
  • I increased the restart interval by 100 minutes (1440 --> 1540) and was able to get the model to run and it completed without error and without writing restart files.
3) You said sometimes the wrfrst files completes before the error. Are you running different cases, dates, etc. and this is only happening on some of the simulations? When you do get a wrfrst file, what is the size of that file?
  • I've run the same-ish model (same domain and mp physics) for different time periods and simulation lengths and always get the same error. The wrfrst file for domain 1 is generally about 1.7 GB and the wrfrst file for domain 2 is about 800 kB. Given the size of the wrfrst file for the 1-domain test above my guess is that these runs did not finish writing domain 1 before failing—I had only inferred that to be the case from output text at the end of the rsl.error files.
4) Can you check just to verify you have enough disk space to write the larger files?
  • We are currently at about 35TB of our 50 TB disk quota.
5) Can you package up all of your rsl.error.* files into a single *.TAR file and attach that? Thanks!
  • .Tar file is attached!
 

Attachments

  • rsl.error.tar
    625.2 KB · Views: 2

jbellino

New member
I was able to write wrfrst files using io_form_restart=102; however, I was under the impression that file size is unlimited with netCDF-4. Perhaps I'm using a library/configuration that doesn't support this? The libraries and variables I set when compiling/running WRF are listed below. My only concern with this approach is recombining the scattered restart files back into something useable by WRF for the restart run—I've seen the presentation on the JOINER program that joins model output, but it specifically says that it does not work with restart files 😕

EDIT: I'm able to write output files that are >>20 GB so I'm not sure that this is a netCDF file-size problem.

module swith PrgEnv-cray PrgEnv-intel
module load cray-netcdf-hdf5parallel # loads cray-netcdf-hdf5parallel/4.8.1.1
module load cray-hdf5-parallel # loads cray-hdf5-parallel/1.12.1.1
module load grib2/libpng/1.2.34
module load grib2/jasper/1.900.1

ulimit -s unlimited

export WRFIO_NCD_LARGE_FILE_SUPPORT=1
export NETCDF=${NETCDF_DIR}
export HDF5=${HDF5_DIR}
 
Last edited:

kwerner

Administrator
Staff member
Hi,
Have you tried this without using quilting? I ran a test using your namelist (and my own generated input) and saw a similar issue to yours. Mine actually stopped just prior to finishing the last wrfout_d02* file, and before writing the restart file. I then tried it without turning on any quilting and it ran to completion. The quilting option can be finicky and we often see issues with it. Can you try without and let me know if that makes any difference? Thanks!
 

jbellino

New member
Ah, yes I can complete a simulation if I turn off quilting. The simulation I'm preparing is a 50-year reanalysis of the southeastern USA at 1km and I'm concerned that if I have to turn off the quilting I/O control it will cause the runs to take an unreasonably long time to complete. I've tried messing around with the nio_groups and nio_tasks_per_group parameters, but those seem kinda finicky (2 or 8 tasks per group works, but WRF fails to run with any other numbers between 1 and 8). At this point it looks like I can use quilting *if* I specify split netCDF format (102) for the restart files. Are there any off-the-shelf codes I could use to join them back together? I'm also open to any other suggestions.
 
Last edited:

kwerner

Administrator
Staff member
Hi,
Yes, we actually do have a script that an NCAR colleague wrote that will join the files back together if using the io_form 102 option. You can find that script here. It's called "Joiner Script."
 

jbellino

New member
Hello, thank you for your continued help on this problem. I was able to get the joinwrf code compiled on our system, but so far have not been able to get any output after following the instructions for modifying the namelist file. I tested this on restart files (wrfrst*) created with the io_form_restart = 102 option but I noticed that the PowerPoint presentation accompanying the source code specifically says it will not work with restart files. The error I encountered is pasted below.

(base) jbellino@nid00023:/<clip>/jbellino/wrf/JOINER2> ./joinwrf namelist.join

###################################################################
###################################################################
#### ####
#### Welcome to JOINWRF ####
#### ####
#### A program that reads in patches of WRF history files ####
#### and join them into one large piece. ####
#### ####
###################################################################
###################################################################

Namelist wrfdfile read in successfully.
dir_extd = '/<clip>/jbellino/wrf/testing/join/',
io_form = 7,
grid_id = 1,
init_time_str = '1974-01-01_05:00:00',
start_time_str = '1974-01-01_05:00:00',
history_interval = ' 00_01:00:00',
end_time_str = '1974-01-01_05:00:00',

Namelist arpsgrid read in successfully.
proc_start_x = 0,
proc_start_y = 0,
nproc_x = 36,
nproc_y = 40,
nproc_xin = 36,


Namelist output was successfully read.
outdirname = '/<clip>/jbellino/wrf/testing/join/joined/',
outfiltail = '',
deflate_level = 0
nvarout = 0,
attadj = F,

Namelist debugging was successfully read.
debug = 2,


============================
WRF files to be read are:

ERROR: The WRF file
/<clip>/jbellino/wrf/testing/join/wrfout_d01_1974-01-01_05:00:00_0000
does not exist.



To be sure I was using the program correctly, I generated some output using the io_form_history = 102 option and tried to join those history files (wrfout*). It was able to read the history files in the directory and successfully (I think) wrote the dimensions and global attributes in the output file, but then choked when trying to copy the variables. As an aside, I had to increase the value of nmaxvars in joinlist.F (from 300 to 400) in order to handle all of the variables in my output files. I also modified the &patches section of the namelist file by removing the proc_sw variable (unused) and adding the proc_start_x and proc_start_x variables expected by joinwrf.F.
 

Attachments

  • namelist.join.txt
    5.3 KB · Views: 1
Last edited:

kwerner

Administrator
Staff member
Hi,
Unfortunately the joiner program is a script that was passed down to our group several years ago and is not maintained, supported, or updated. I do recall some others having issues with the program and have a couple of forum posts I can point you to in case any of the information in them helps (or may help along the way):

WRF output JOINER
Joiner program run time error
io_form_history=102 JOINER code

If you want to search for other threads in this forum, just type "joiner" in the search bar and it will bring up several.

If the presentation states that it will not work with restart files, unfortunately I think that's likely the case and is probably the reason for your error. It may be possible for you to make some modifications to the code that will allow that. If you do, we will welcome any code updates we can provide for future users!
 

jbellino

New member
I wanted to update this thread with positive results from some test runs that just finished. I arbitrarily decreased the size of my inner domain from 1041 x 1333 to 601 x 837 and was able to successfully use quilting while also writing restart files. I ran a second test to determine if the restart files were valid and it seems to have run successfully as well. I'm not sure at this point whether it is a memory issue or something else, but will continue testing to see if I can find out at what point things fall apart in terms of the size of domain 2.
 

jbellino

New member
I was able to get the nested domain 2 up to a size of 801 x 1001 before the restart files fail to write. I did notice one discrepancy in the rsl files where the memory allocation size for domain 2 is listed—there are 2 entries in the failure run and just the 1 entry in the successful run (see below). Could this be symptomatic of the underlying cause for the eventual failure to write the restart files? I'm running these models on a Cray XC-50 with 1,520 processors spread across 38 compute nodes, each node having approximately 190 GB of available memory. I'm all out of ideas at this point and not sure what to try next...

Successful run:
*************************************
Nesting domain
ids,ide,jds,jde 1 801 1 1001
ims,ime,jms,jme -4 36 -4 36
ips,ipe,jps,jpe 1 23 1 25
INTERMEDIATE domain
ids,ide,jds,jde 162 367 80 335
ims,ime,jms,jme 157 179 75 97
ips,ipe,jps,jpe 160 169 78 87
*************************************
d01 2017-06-01_05:00:00 alloc_space_field: domain 2 , 74863904 bytes allocated
d01 2017-06-01_05:00:00 *** Initializing nest domain # 2 from an input file. ***

Failure run:
*************************************
Nesting domain
ids,ide,jds,jde 1 801 1 1013
ims,ime,jms,jme -4 36 -4 36
ips,ipe,jps,jpe 1 23 1 26
INTERMEDIATE domain
ids,ide,jds,jde 162 367 80 338
ims,ime,jms,jme 157 179 75 98
ips,ipe,jps,jpe 160 169 78 88
*************************************
d01 2017-06-01_05:00:00 alloc_space_field: domain 2 , 7597728 bytes allocated
d01 2017-06-01_05:00:00 alloc_space_field: domain 2 , 74873516 bytes allocated
d01 2017-06-01_05:00:00 *** Initializing nest domain # 2 from an input file. ***
 

Attachments

  • success.namelist.input
    7.9 KB · Views: 2
  • success.rsl.error.0000
    157.6 KB · Views: 1
  • failure.namelist.input
    7.9 KB · Views: 1
  • failure.rsl.error.0000
    158 KB · Views: 1
Last edited:

kwerner

Administrator
Staff member
Hi,
I'm sorry to hear you're still struggling with this, and unfortunately at this time, I don't have a lot of advice. I'm going to reach out to a couple colleagues to see if they have any advice, but we are lacking in our software engineering resources right now (I am not an SE).

For sanity purposes, I don't think the "alloc_space" message is the issue. I looked at some rsl* files for other runs with/without quilting and I see it listed twice in successful runs too. I'm not really sure why it splits it like that. I see you're no longer using split-netcdf files. Were you never able to get that figured out?
 
Last edited:

jbellino

New member
No I'm writing standard netcdf files with io_form_history = 2 and io_form_restart = 2. We will likely have to switch to a single-domain model if we can't get quilting to work with the restart files. Thank you for helping me to dig into this issue!
 

kwerner

Administrator
Staff member
I spoke to a colleague who also suggested using option 102, and it shouldn't be necessary to join the wrfrst* files back together to run the restart simulation. As long as you're consistent with the output type, you would only need to piece together the regular wrfout* files when you're ready to use them for post-processing. I also learned that the quilting option has been unstable for a while and even when we did have our software engineer, they weren't able to get it fixed. It's up to you as to which path you want to take, but I just wanted to share that information!
 

jbellino

New member
Thank you for the info re: using option 102 for the restart files and just running the restart simulation from the scattered wrfrst* files—I tested that here and can confirm that it was not necessary to rejoin them to get the restart run to complete successfully. The solution was much more straightforward than what I thought; since I saw there was a separate utility to rejoin the regular wrfout* files I assumed that meant wrfrst* files would also need to be rejoined. But now I see the error in my assumption was in equating the need to conduct post-processing of wrfout* files vs. the internal mechanisms of a wrf restart run. Thanks for all of your help with this problem!!
 
Last edited:
Top