nickheavens-cgg
New member
For as long as I have working with MPAS-A (since May 2023), I have been experiencing a problem in which the model hangs while writing output. This can happen several days into a simulation or a few hours. I can diagnose it by seeing no error messages in the log but an output file that is short of the expected size. Indeed, the model just keeps running without doing anything further until it is forcibly terminated. The problem is that it is not reproducible. Restarting the simulation usually gets past the problem at the point which it occurred, but the same basic problem can crop up with a later (or sometimes earlier) output file.
I have two queries associated with this behaviour.
1. Is this likely to be a problem stemming from using netcdf4 as io_type? For reasons having to do with my environment but are correctable with some effort, I am using libraries that are incompatible with NCAR ParallelIO. I'm hoping someone has seen this problem before and has some insight. If no one has encountered this problem before, I may conclude that ParallelIO is far more reliable and should be used in all circumstances.
2. Does anyone have a suggested strategy for forcing the output file to try again if there is a hang like this? I'd prefer it to be somehow progress-based rather than time-based, because the restart files can take a long time to write. I'm not looking for a worked solution, just any rough ideas.
Best regards,
Nick
I have two queries associated with this behaviour.
1. Is this likely to be a problem stemming from using netcdf4 as io_type? For reasons having to do with my environment but are correctable with some effort, I am using libraries that are incompatible with NCAR ParallelIO. I'm hoping someone has seen this problem before and has some insight. If no one has encountered this problem before, I may conclude that ParallelIO is far more reliable and should be used in all circumstances.
2. Does anyone have a suggested strategy for forcing the output file to try again if there is a hang like this? I'd prefer it to be somehow progress-based rather than time-based, because the restart files can take a long time to write. I'm not looking for a worked solution, just any rough ideas.
Best regards,
Nick