I'm having trouble with a restart run in WRF. the initial run is fine until the slurm submission time limit, but the restart run gives the error (always in rsl.error.0128 and rsl.error.0256):
MPICH ERROR [Rank 128] [job id 4908348.0] [Mon Nov 20 14:49:35 2023] [nid003702] - Abort(201352719) (rank 128 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received 4 but expected 2100
This looks like a compilation error but I don't understand how it can be when the non-restart run is fine (and can run past this time)?
Has anyone come across this before? Any help would be much appreciated.
I've attached the namelist and rsl.error.0128 renamed to rsl.error.0000 (only permissible attachment)
MPICH ERROR [Rank 128] [job id 4908348.0] [Mon Nov 20 14:49:35 2023] [nid003702] - Abort(201352719) (rank 128 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received 4 but expected 2100
This looks like a compilation error but I don't understand how it can be when the non-restart run is fine (and can run past this time)?
Has anyone come across this before? Any help would be much appreciated.
I've attached the namelist and rsl.error.0128 renamed to rsl.error.0000 (only permissible attachment)