Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A Restart Files (Error Checking Ideas?)

nickheavens-cgg

New member
Dear all,

As you may know, I run MPAS-A as part of a climate model on my company's growing low latency HPC cluster. I've generally been running short simulations on stable environments, so I have not needed to use restart files very much. But this way of working is changing. And there have been some surprises.

The most recent surprise that I think will be of general interest is a situation in which MPAS-A appeared to complete safely (having written out timing statistics) but failed to copy the zgrid variable properly to the file. The copy mistake was small, consisting of 4 cells in the middle of the array. But writing zeros to zgrid leads to obvious errors that show up immediately in the longwave radiation scheme.

I think the problem may be some bad timing in signalling when writing the restart at the end of the simulation. However, MPAS-A was not designed to interface with OASIS3-MCT (the coupler that is probably at the root of the problem), so I can't expect this problem to be worthy of support.

That said, I wonder if restart miscopying has been noted in more supported applications and whether there are additional error handling technology/other tips that could be applied to make sure the restart files are completely written before the model stops.

Best regards,

Nicholas G. Heavens
Innovation Project Manager, Viridien
West Sussex, UK
 
Dear all,

I'm no longer convinced this error is bad signalling. I get it when I run the model a few timesteps beyond the restart write, I also have found similar errors in a history file. I'm looking into various possible causes, but if anyone has experience with slight mistakes in writing netCDF files by MPAS-A, I'd appreciate hearing about it.

Nick
 
I now have a solution to this problem. I was able to eliminate it (so far) by changing the clobber mode in streams.atmosphere from "overwrite" to "replace_files." If you think you have this problem, you should try looking for variable miscopying using some Python code that looks like this:

Python:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt


ds_rest = xr.open_dataset('restart.2023-01-02_00.00.00.nc')
ds_hist = xr.open_dataset('/history.2023-01-02_00.00.00.nc')


rest_keys = list(ds_rest.keys())
rest_pczero = np.zeros(np.shape(rest_keys)[0])


hist_keys = list(ds_hist.keys())
hist_pczero = np.zeros(np.shape(hist_keys)[0])


all_keys = np.intersect1d(hist_keys,rest_keys)
hist_pczero2 = np.zeros(np.shape(all_keys)[0])
rest_pczero2 = np.zeros(np.shape(all_keys)[0])

for ii,i in enumerate(rest_keys):
    var=np.squeeze(ds_rest[i].values)
    if np.size(np.shape(var))>1:
        if np.size(np.shape(var))==2:
            if ((np.shape(var)[0]==1024002) & (np.shape(var)[1]>4)):
                tester=var[np.squeeze(np.where(var[:,4]==0)),:]
                tester[np.nonzero(np.abs(tester)>1)]=1
                tester_full=var+0
                tester_full[np.nonzero(np.abs(tester_full)>1)]=1
                rest_pczero[ii] = (100*(1-(np.sum(tester)/np.size(tester))))-(100*(1-(np.sum(tester_full)/np.size(tester_full))))


for ii,i in enumerate(hist_keys):
    var=np.squeeze(ds_hist[i].values)
    if np.size(np.shape(var))>1:
        if np.size(np.shape(var))==2:
            if ((np.shape(var)[0]==1024002) & (np.shape(var)[1]>4)):
                tester=var[np.squeeze(np.where(var[:,4]==0)),:]
                tester[np.nonzero(np.abs(tester)>1)]=1
                tester_full=var+0
                tester_full[np.nonzero(np.abs(tester_full)>1)]=1
                hist_pczero[ii] = (100*(1-(np.sum(tester)/np.size(tester))))-(100*(1-(np.sum(tester_full)/np.size(tester_full))))



If hist_pczero or rest_pczero for any variable is much larger than 10%, you may have a few small blocks of zeroes. In a few variables, this is normal behaviour, e.g., nr, tlag, and greenfrac. But if you find it in a field like theta or rho, it will crash the radiative transfer in MPAS-A.
 
Last edited:
Here is a further update to this saga. I crashed the model in the midst of writing a restart once more and soon had the same corruption problem when writing restarts. I changed the clobber settings back to overwrite and the files wrote correctly once more.


Nick
 
Top