Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A Restart Files (Error Checking Ideas?)

nickheavens-cgg

New member
Dear all,

As you may know, I run MPAS-A as part of a climate model on my company's growing low latency HPC cluster. I've generally been running short simulations on stable environments, so I have not needed to use restart files very much. But this way of working is changing. And there have been some surprises.

The most recent surprise that I think will be of general interest is a situation in which MPAS-A appeared to complete safely (having written out timing statistics) but failed to copy the zgrid variable properly to the file. The copy mistake was small, consisting of 4 cells in the middle of the array. But writing zeros to zgrid leads to obvious errors that show up immediately in the longwave radiation scheme.

I think the problem may be some bad timing in signalling when writing the restart at the end of the simulation. However, MPAS-A was not designed to interface with OASIS3-MCT (the coupler that is probably at the root of the problem), so I can't expect this problem to be worthy of support.

That said, I wonder if restart miscopying has been noted in more supported applications and whether there are additional error handling technology/other tips that could be applied to make sure the restart files are completely written before the model stops.

Best regards,

Nicholas G. Heavens
Innovation Project Manager, Viridien
West Sussex, UK
 
Dear all,

I'm no longer convinced this error is bad signalling. I get it when I run the model a few timesteps beyond the restart write, I also have found similar errors in a history file. I'm looking into various possible causes, but if anyone has experience with slight mistakes in writing netCDF files by MPAS-A, I'd appreciate hearing about it.

Nick
 
I now have a solution to this problem. I was able to eliminate it (so far) by changing the clobber mode in streams.atmosphere from "overwrite" to "replace_files." If you think you have this problem, you should try looking for variable miscopying using some Python code that looks like this:

Python:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt


ds_rest = xr.open_dataset('restart.2023-01-02_00.00.00.nc')
ds_hist = xr.open_dataset('/history.2023-01-02_00.00.00.nc')


rest_keys = list(ds_rest.keys())
rest_pczero = np.zeros(np.shape(rest_keys)[0])


hist_keys = list(ds_hist.keys())
hist_pczero = np.zeros(np.shape(hist_keys)[0])


all_keys = np.intersect1d(hist_keys,rest_keys)
hist_pczero2 = np.zeros(np.shape(all_keys)[0])
rest_pczero2 = np.zeros(np.shape(all_keys)[0])

for ii,i in enumerate(rest_keys):
    var=np.squeeze(ds_rest[i].values)
    if np.size(np.shape(var))>1:
        if np.size(np.shape(var))==2:
            if ((np.shape(var)[0]==1024002) & (np.shape(var)[1]>4)):
                tester=var[np.squeeze(np.where(var[:,4]==0)),:]
                tester[np.nonzero(np.abs(tester)>1)]=1
                tester_full=var+0
                tester_full[np.nonzero(np.abs(tester_full)>1)]=1
                rest_pczero[ii] = (100*(1-(np.sum(tester)/np.size(tester))))-(100*(1-(np.sum(tester_full)/np.size(tester_full))))


for ii,i in enumerate(hist_keys):
    var=np.squeeze(ds_hist[i].values)
    if np.size(np.shape(var))>1:
        if np.size(np.shape(var))==2:
            if ((np.shape(var)[0]==1024002) & (np.shape(var)[1]>4)):
                tester=var[np.squeeze(np.where(var[:,4]==0)),:]
                tester[np.nonzero(np.abs(tester)>1)]=1
                tester_full=var+0
                tester_full[np.nonzero(np.abs(tester_full)>1)]=1
                hist_pczero[ii] = (100*(1-(np.sum(tester)/np.size(tester))))-(100*(1-(np.sum(tester_full)/np.size(tester_full))))



If hist_pczero or rest_pczero for any variable is much larger than 10%, you may have a few small blocks of zeroes. In a few variables, this is normal behaviour, e.g., nr, tlag, and greenfrac. But if you find it in a field like theta or rho, it will crash the radiative transfer in MPAS-A.
 
Last edited:
Here is a further update to this saga. I crashed the model in the midst of writing a restart once more and soon had the same corruption problem when writing restarts. I changed the clobber settings back to overwrite and the files wrote correctly once more.


Nick
 
Dear all,

This problem keeps recurring and seems to be a fundamental problem with MPAS I/O in my configuration, being reproducible across compilers, use of Parallel-IO rather than SMIOL, and versions of PNetcdf.

At its heart, the problem seems to be that in some cases, some data is failing to write along the nCells dimension, so that there are small gaps (of three or four cells) with zeros in one, two, or three places. This is happening in restart files, history files, and diagnostic files though inconsistently.

The easiest way to reproduce it is to do a simulation that writes out restart and history files every 24 hours for ten days.

For example, I have run a case like this starting from 1 January 2023 on four AMD Epyc Genoa nodes using MPAS-A 8.2.0 with only changes to the Makefile. Grid is globally uniform 24 km. I/O is on 1 node with 192 cores using SMIOL. The code is compiled on the login node (also AMD Epyc Genoa architecture) with gcc-12.2 using:

gfortran: # BUILDTARGET GNU Fortran, C, and C++ compilers
( $(MAKE) all \
"FC_PARALLEL = mpif90" \
"CC_PARALLEL = mpicc" \
"CXX_PARALLEL = mpicxx" \
"FC_SERIAL = gfortran" \
"CC_SERIAL = gcc" \
"CXX_SERIAL = g++" \
"FFLAGS_PROMOTION = -fdefault-real-8 -fdefault-double-8" \
"FFLAGS_OPT = -O3 -ffree-line-length-none -fconvert=big-endian -ffree-form" \
"CFLAGS_OPT = -O3" \
"CXXFLAGS_OPT = -O3" \
"LDFLAGS_OPT = -O3" \
"FFLAGS_DEBUG = -g -ffree-line-length-none -fconvert=big-endian -ffree-form -fcheck=all -fbacktrace -ffpe-trap=invalid,zero,overflow" \
"CFLAGS_DEBUG = -g" \
"CXXFLAGS_DEBUG = -g" \
"LDFLAGS_DEBUG = -g" \
"FFLAGS_OMP = -fopenmp" \
"CFLAGS_OMP = -fopenmp" \
"FFLAGS_ACC =" \
"CFLAGS_ACC =" \
"PICFLAG = -fPIC" \
"BUILD_TARGET = $(@)" \
"CORE = $(CORE)" \
"DEBUG = $(DEBUG)" \
"USE_PAPI = $(USE_PAPI)" \
"OPENMP = $(OPENMP)" \
"OPENACC = $(OPENACC)" \
"CPPFLAGS = $(MODEL_FORMULATION) -D_MPI" )

In this case, I get these results:

Date of restart in 2023Time for stream output at time of restart writing (includes diagnostic and history file ~ 33 GB total file size) [seconds]Restart file good or bad
2 January163Good
3 January384Good
4 January678Bad
5 January597Bad
6 January734Bad
7 January609Bad
8 January639Good
9 January1062Bad
10 January751Bad
11 January746Bad

In other words, file writing speed decreases as the simulation goes on and typically slower writes are associated with bad restart files.

We are currently trying to profile the code to see why output slows down.

If you wish to test restart files for this issue, I have written and attached a more efficient version of the code above that takes the file name of interest on the command line as input.


Python:
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import sys

def var_eval_hole(n):
    var_test=np.squeeze(ds_use[rest_keys[n]]).values
    pczero=0
    tester=0
    tester_full=0
    if np.size(np.shape(var_test))>1:
        if np.size(np.shape(var_test))==2:
            if ((np.shape(var_test)[0]>100000) & (np.shape(var_test)[1]>4)):
                    tester=var_test[np.squeeze(np.where(var_test[:,4]==0)),:]
                    tester[np.nonzero(np.abs(tester)>1)]=1
                    tester_full=np.squeeze(ds_use[rest_keys[n]].values)
                    if ((np.size(tester_full)>0) & (np.size(tester)>0)):
                        tester_full[np.nonzero(np.abs(tester_full)>1)]=1
                        pczero = (100*(1-(np.sum(tester)/np.size(tester))))-(100*(1-(np.sum(tester_full)/np.size(tester_full))))
                        if (pczero>10): print(rest_keys[n] + ' ' + str(pczero))
    return pczero

def initialiser(fname):
    global ds_use, rest_keys
    ds_use = xr.open_dataset(fname)
    rest_keys = list(ds_use.keys())
    ds_use.close()

    return np.shape(rest_keys)


print('Restart File Check')

fileName=sys.argv[1]
print(fileName)

if __name__=='__main__':
    with Pool(processes=16, initializer=initialiser,initargs=(fileName,)) as p:
       pctest=p.map(var_eval_hole,range(initialiser(fileName)[0]))
  

print('Done')
 
Top