Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Errors getting restart files

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

dycrisis

New member
Hi,
I have made several tests based on the 60-3 km mesh and encountered problems getting the restart file.
Here is my namelist.atmosphere:
Code:
&nhyd_model
    config_dt = 15.0
    config_start_time = '2018-05-17_12:00:00'
    config_run_duration = '3:00:00'
    config_len_disp = 3000.0
/
&decomposition
    config_block_decomp_file_prefix = 'x20.chn.graph.info.part.'
/
&restart
    config_do_restart = false
/
&physics
    config_sst_update = false
    config_sstdiurn_update = false
    config_deepsoiltemp_update = false
    config_radtlw_interval = '00:30:00'
    config_radtsw_interval = '00:30:00'
    config_bucket_update = 'none'
    config_physics_suite = 'mesoscale_reference'
    config_convection_scheme = 'cu_kain_fritsch'
    config_microp_scheme = 'mp_thompson'
/
At first, I ran the model without the restart stream, so my streams.atmosphere file was:
Code:
<streams>
<immutable_stream name="input"
                  type="input"
                  filename_template="x20.chn.init.nc"
                  input_interval="initial_only" />

<stream name="output"
        type="output"
        precision="single"
        filename_template="history.$Y-$M-$D_$h.$m.$s.nc"
        output_interval="3:00:00" >

        <file name="stream_list.atmosphere.output"/>
</stream>

<stream name="diagnostics"
        type="output"
        precision="single"
        filename_template="diag.$Y-$M-$D_$h.$m.$s.nc"
        output_interval="1:00:00" >

        <file name="stream_list.atmosphere.diagnostics"/>
</stream>
</streams>
The model ran normally based on these settings. However, when I added the "restart" stream like this:
Code:
<streams>
<immutable_stream name="input"
                  type="input"
                  filename_template="x20.chn.init.nc"
                  input_interval="initial_only" />

<immutable_stream name="restart"
                  type="output"
                  filename_template="restart.$Y-$M-$D_$h.$m.$s.nc"
                  output_interval="3:00:00" />

<stream name="output"
        type="output"
        precision="single"
        filename_template="history.$Y-$M-$D_$h.$m.$s.nc"
        output_interval="3:00:00" >

        <file name="stream_list.atmosphere.output"/>
</stream>

<stream name="diagnostics"
        type="output"
        precision="single"
        filename_template="diag.$Y-$M-$D_$h.$m.$s.nc"
        output_interval="1:00:00" >

        <file name="stream_list.atmosphere.diagnostics"/>
</stream>
</streams>
The model ran smoothly until it was time to write restart file, and I got the following message:
Code:
ERROR: MPAS IO Error: Bad return value from PIO
ERROR: MPAS IO Error: Bad return value from PIO
ERROR: MPAS IO Error: Bad return value from PIO
ERROR: MPAS IO Error: Bad return value from PIO
ERROR: ********************************************************************************
ERROR: Error writing one or more output streams
CRITICAL ERROR: ********************************************************************************
Then I removed all the other output streams, only kept the input stream and the restart stream, the model didn't start integration after initialization, and I got this message:
Code:
ERROR: Stream output does not exist in call to MPAS_stream_mgr_get_property().
On the other hand, I ran the test case on MPAS website, which based on the 120km mesh, the model can write restart files as expected. Therefore, I really don't know what's wrong with my configurations. I have tried both versions of MPAS built with PIO1.7 and PIO2 library, the error message was similar. Please help me with this problem, I am very appreciated for your assistance.
 
Hi...

First of all, I know nothing about the MPAS software, however, based on your output, I can make
an educated guess. Are your restart files in NETCDF format and are they greater than 2gb in
size?

You could be writing 32-bit NETCDF files. These have a 32-bit counter, which maxes out about
2gb. When the file exceeds 2gb in size, the counter, which is supposed to be the size of the file,
goes negative! The result is your NETCDF file is corrupt and unusable.

If you aren't sure, do "od -c some-good-netcdf-file | head -1". Columns 2-4 will be "C D F".
Column 5 will be 001 or 002. 001 is for 32-bit NETCDF and 002 is for 64-bit NETCDF. The
max number for 64-bit NETCDF is *MUCH* larger than for 32-bit.

Check your documentation for what to do for 64-bit files. You'll have to clean and recompile
your software once you find the "magic".

Restart files tend to be huge for WRF, so I assume the same for MPAS.
 
I think the issue may be, as @kwthomas suggested, one of file format. There are three file formats that can be written by MPAS streams: CDF2, CDF5, and HDF5. CFD2 allows for files larger than 4 GB, but individual variables or records must be less than 4 GB (see https://www.unidata.ucar.edu/software/netcdf/netcdf/Large-File-Support.html for a more detailed discussion); CDF2 is the default file format for MPAS streams. CDF5 allows for both files and variables/records larger than 4 GB, but this format is only supported by the parallel-netCDF library or newer versions of the netCDF library compiled with pnetcdf support. HDF5 also allows for files and variables/records larger than 4 GB.

With the 60-3 km mesh and a double-precision build of MPAS-Atmosphere, there will be some fields in the restart stream that exceed 4 GB in size (e.g., the 'ozmixm' field, which will be approximately 4.7 GB in size). In this case, you can add one of the following to the definition of your "restart" stream in the streams.atmosphere file:
Code:
io_type="pnetcdf,cdf5"
or
Code:
io_type="netcdf4"
. Section 5.2 of the users' guide has more details.

We should probably update the mesh download page with some notes about file formats required by the 60-3 km mesh to avoid any future confusion!
 
Thanks to all and sorry for my late reply, because I made some tests based on suggestions by @kwthomas and @mgduda. The problem was a bit complicated, since the netcdf library on our cluster didn't support parallel nor cdf5. Threrfore, the io_type="netcdf4" didn't work, the "pnetcdf,cdf5" option worked, and I finally got the nearly 40GB restart file, but my post-processing program built with that netcdf library wasn't able to read the cdf5 format restart file.

To fix all of the problems, I rebuilt all the libraries needed on my own. The model now generally works fine, and I tested all options of "io_type", they seem to be 64-bit files. I tried the "od -c" command, the results were as follows:
Code:
$ od -c x1.40962.init.nc.netcdf | head -1
0000000   C   D   F 002  \0  \0  \0 001  \0  \0  \0  \n  \0  \0  \0 020
$ od -c x1.40962.init.nc.netcdf4 | head -1
0000000 211   H   D   F  \r  \n 032  \n  \0  \0  \0  \0  \0  \b  \b  \0
$ od -c x1.40962.init.nc.pnetcdf | head -1
0000000   C   D   F 002  \0  \0  \0 001  \0  \0  \0  \n  \0  \0  \0 020
$ od -c x1.40962.init.nc.pnetcdf_cdf5 | head -1
0000000   C   D   F 005  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \n
However, a problem still exists, when I set io_type to "netcdf" or "netcdf4", the model will show a "ERROR: MPAS IO Error: Bad return value from PIO" message, but the output file is actually generated, which is pretty weird. Maybe something went wrong building the libraries or it's just a false alarm, I guess...
 
Hi...

I *think* you are ending up with a NETCDF file that actually has HDF5 data, which makes it
smaller than a pure NETCDF file. Your postprocessing program doesn't support that format.

Try running with "pnetcdf". This will create a pure netcdf format file, with the I/O in parallel.
 
Thanks, I have changed to another version of NETCDF library, and the error was gone. Actually, I found the "netcdf4" output file is a little bit bigger than output files of other options. Currently I'm using NETCDF4.5.0, I noticed that during the configure step of building PIO2 library, this version (4.5.0) of NETCDF didn't make it able to find nc_set_log_level, while the previous version (4.6.2) did. Therefore, I think the nc_set_log_level function may be the cause of the error message. Totally guess, since everything have been to normal, I will make more model run and see if other problem exits.
 
Top