nickheavens-cgg
New member
Dear all,
As you may know, I run MPAS-A as part of a climate model on my company's growing low latency HPC cluster. I've generally been running short simulations on stable environments, so I have not needed to use restart files very much. But this way of working is changing. And there have been some surprises.
The most recent surprise that I think will be of general interest is a situation in which MPAS-A appeared to complete safely (having written out timing statistics) but failed to copy the zgrid variable properly to the file. The copy mistake was small, consisting of 4 cells in the middle of the array. But writing zeros to zgrid leads to obvious errors that show up immediately in the longwave radiation scheme.
I think the problem may be some bad timing in signalling when writing the restart at the end of the simulation. However, MPAS-A was not designed to interface with OASIS3-MCT (the coupler that is probably at the root of the problem), so I can't expect this problem to be worthy of support.
That said, I wonder if restart miscopying has been noted in more supported applications and whether there are additional error handling technology/other tips that could be applied to make sure the restart files are completely written before the model stops.
Best regards,
Nicholas G. Heavens
Innovation Project Manager, Viridien
West Sussex, UK
As you may know, I run MPAS-A as part of a climate model on my company's growing low latency HPC cluster. I've generally been running short simulations on stable environments, so I have not needed to use restart files very much. But this way of working is changing. And there have been some surprises.
The most recent surprise that I think will be of general interest is a situation in which MPAS-A appeared to complete safely (having written out timing statistics) but failed to copy the zgrid variable properly to the file. The copy mistake was small, consisting of 4 cells in the middle of the array. But writing zeros to zgrid leads to obvious errors that show up immediately in the longwave radiation scheme.
I think the problem may be some bad timing in signalling when writing the restart at the end of the simulation. However, MPAS-A was not designed to interface with OASIS3-MCT (the coupler that is probably at the root of the problem), so I can't expect this problem to be worthy of support.
That said, I wonder if restart miscopying has been noted in more supported applications and whether there are additional error handling technology/other tips that could be applied to make sure the restart files are completely written before the model stops.
Best regards,
Nicholas G. Heavens
Innovation Project Manager, Viridien
West Sussex, UK