Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Restart run error: Fatal error in MPIR_CRAY_Bcast_Tree:

epotter1

New member
I'm having trouble with a restart run in WRF. the initial run is fine until the slurm submission time limit, but the restart run gives the error (always in rsl.error.0128 and rsl.error.0256):

MPICH ERROR [Rank 128] [job id 4908348.0] [Mon Nov 20 14:49:35 2023] [nid003702] - Abort(201352719) (rank 128 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received 4 but expected 2100

This looks like a compilation error but I don't understand how it can be when the non-restart run is fine (and can run past this time)?

Has anyone come across this before? Any help would be much appreciated.

I've attached the namelist and rsl.error.0128 renamed to rsl.error.0000 (only permissible attachment)
 

Attachments

  • namelist.input
    6.9 KB · Views: 1
  • rsl.error.0000
    2.5 KB · Views: 1

kwerner

Administrator
Staff member
Hi,
This is not a compilation error. You're right that since you were able to run the initial case, everything was compiled correctly. Can you do a test for me?
Do a non-restart test, starting from a few hours prior to the start time of your restart run (e.g., start from 2019-05-02_12:00:00) and then run for maybe 24-36 hours, just to make sure the data is okay for this time period. You'll need to re-run real.exe for this time period, as well, so that your initial condition file matches the initial time. I assume that will be okay. If so, please package all the rsl* files from the failed restart run into a single *.tar file and attach that here so I can take a look.

Will you also issue the following?

Code:
ls -hail wrfrst* >& rst_size.txt

and attach that rst_size.txt file, as well? Please also check that you do have enough disk space to write any additional large files in your running directory. Thanks!
 

epotter1

New member
Hi,

Thanks very much for your reply. Sorry it's taken a while to respond, it takes a while to run even test runs on archer2 (hpc). I've put the rst_size.txt file in the original_wrf_run... folder. The whole setup works fine if I run using 1 node (with 128 tasks, so still in parallel), but crashes on 2 nodes (crash in rsl.error.0128) and 3 nodes (crashes in rsl.error.0128 and rsl.error.0256). I realise this is probably an issue with the exact setup on archer2, but they are unable to help due to not being WRF specialists. Is there a way to set the partition in the namelist, and do you think this might help? I can't work out why the error is always on the first task of each subsequent node after 1.

Many thanks,

Emily
 

Attachments

  • restart_wrf_run_rsl_files.zip
    905.6 KB · Views: 0
  • original_wrf_run_rsl_files.zip
    2 MB · Views: 0
Top