Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A hanging while writing output files

nickheavens-cgg

New member
For as long as I have working with MPAS-A (since May 2023), I have been experiencing a problem in which the model hangs while writing output. This can happen several days into a simulation or a few hours. I can diagnose it by seeing no error messages in the log but an output file that is short of the expected size. Indeed, the model just keeps running without doing anything further until it is forcibly terminated. The problem is that it is not reproducible. Restarting the simulation usually gets past the problem at the point which it occurred, but the same basic problem can crop up with a later (or sometimes earlier) output file.

I have two queries associated with this behaviour.

1. Is this likely to be a problem stemming from using netcdf4 as io_type? For reasons having to do with my environment but are correctable with some effort, I am using libraries that are incompatible with NCAR ParallelIO. I'm hoping someone has seen this problem before and has some insight. If no one has encountered this problem before, I may conclude that ParallelIO is far more reliable and should be used in all circumstances.

2. Does anyone have a suggested strategy for forcing the output file to try again if there is a hang like this? I'd prefer it to be somehow progress-based rather than time-based, because the restart files can take a long time to write. I'm not looking for a worked solution, just any rough ideas.

Best regards,

Nick


1704188311973.png
 
Hi Nick,

Has this been with any particular version(s) of MPAS-A? If you aren't already, try using the latest version (v8.0.1) so you can use Simple MPAS IO Layer (SMIOL) instead. This can be done by unsetting the PIO environment variable or just not loading a PIO module on most HPC systems. If built with SMIOL the end of your build should contain lines (especially the last) like the following:

Code:
*******************************************************************************
MPAS was built with default single-precision reals.
Debugging is off.
Parallel version is on.
Papi libraries are off.
TAU Hooks are off.
MPAS was built without OpenMP support.
MPAS was built without OpenMP-offload GPU support.
MPAS was built without OpenACC accelerator support.
Position-dependent code was generated.
MPAS was built with .F files.
The native timer interface is being used
Using the SMIOL library.
*******************************************************************************

To your questions:

1. It's possible, but my colleagues have mainly noted slower than expected performance with "netcdf4". My personal recommendation would be to only use io_type="netcdf4" if you know you need CDF-2 file format (e.g. your analysis scripts or software-stack don't support CDF-5). Otherwise don't set io_type to use the default values or use io_type="pnetcdf,cdf5" to ensure most simulations can be output (regardless of mesh size).

2. I'm not sure either. It would be interesting to see what others might have for this!

Cheers,
Dylan
 
Dear Dylan,

Thank you so much for the SMIOL suggestion! I have had the problem with 6, 7, and 8, but I didn't realise SMIOL existed. I'll post on the thread once I have an idea if it resolves my issue. (I will have access to a faster system soon that should allow faster testing.)

Nick
 
I can build with SMIOL, but I get some strange errors when I try using it. -5 is a SMIOL Library Error, but I'm puzzled where "Feature is not yet supported" is coming from.

** Attempting to bootstrap MPAS framework using stream: input
Bootstrapping framework with mesh fields from input file 'init.nc'
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
ERROR: SMIOLf_get_var failed with error -5
ERROR: Feature is not yet supported.
 
The "Feature is not yet supported" message is likely coming from the PnetCDF library's ncmpi_strerror function, which we call from SMIOL_lib_error_string when a library read error is returned by SMIOLf_get_var in the low-level mpas_io module.

Can you run 'ncdump -k init.nc' to see what file format your init.nc file is using? Perhaps the latest PnetCDF library is different, but most versions that I've worked with can only read CDF-2 and CDF-5, and SMIOL uses the PnetCDF library exclusively.

In case your init.nc file is in HDF5 format, for example, you may need to recreate your initial conditions in CDF-2 or CDF-5 format.
 
Dear mgduda,

Yes, these are netcdf4 files (i.e. HDF5 backend, as you know, of course). So that's the problem. Thank you! I'm going to go through the whole static and init file process with SMIOL (it seems to be working so far) and see how I go.

Cheers,

Nick
 
I have managed to get SMIOL to work with init_atmosphere and atmosphere_model with SMIOL-generated initial conditions. Thanks! I will post on this thread once I see if the change in I/O system has any impact on the writing problem. I am very impressed by the 10x improvement in reading the init file, but part of the improvement is because I was forced to a non-Pnetcdf I/O strategy by library issues.
 
I am delighted to say that I have managed to run 10 days (99 files per day) without encountering this problem for the first time. This suggests a reliability rate better than 99.89% for SMIOL compared to my former I/O method, which seemed to average ~ 99.66% (95-99.7%) [statistics based on memory]. Thanks for your help!
 
@gdicker , @mgduda

FYI, the SMIOL also solves writing very large NetCDF files. My case was ~43GB for an initial condition. I had a problem with writing that large file with PIO, but SMIOL solves it like a magic with fast speed. Thanks a lot!
 
Top