Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Restart run error: Fatal error in MPIR_CRAY_Bcast_Tree:

epotter1

New member
I'm having trouble with a restart run in WRF. the initial run is fine until the slurm submission time limit, but the restart run gives the error (always in rsl.error.0128 and rsl.error.0256):

MPICH ERROR [Rank 128] [job id 4908348.0] [Mon Nov 20 14:49:35 2023] [nid003702] - Abort(201352719) (rank 128 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received 4 but expected 2100

This looks like a compilation error but I don't understand how it can be when the non-restart run is fine (and can run past this time)?

Has anyone come across this before? Any help would be much appreciated.

I've attached the namelist and rsl.error.0128 renamed to rsl.error.0000 (only permissible attachment)
 

Attachments

  • namelist.input
    6.9 KB · Views: 1
  • rsl.error.0000
    2.5 KB · Views: 1
Hi,
This is not a compilation error. You're right that since you were able to run the initial case, everything was compiled correctly. Can you do a test for me?
Do a non-restart test, starting from a few hours prior to the start time of your restart run (e.g., start from 2019-05-02_12:00:00) and then run for maybe 24-36 hours, just to make sure the data is okay for this time period. You'll need to re-run real.exe for this time period, as well, so that your initial condition file matches the initial time. I assume that will be okay. If so, please package all the rsl* files from the failed restart run into a single *.tar file and attach that here so I can take a look.

Will you also issue the following?

Code:
ls -hail wrfrst* >& rst_size.txt

and attach that rst_size.txt file, as well? Please also check that you do have enough disk space to write any additional large files in your running directory. Thanks!
 
Hi,

Thanks very much for your reply. Sorry it's taken a while to respond, it takes a while to run even test runs on archer2 (hpc). I've put the rst_size.txt file in the original_wrf_run... folder. The whole setup works fine if I run using 1 node (with 128 tasks, so still in parallel), but crashes on 2 nodes (crash in rsl.error.0128) and 3 nodes (crashes in rsl.error.0128 and rsl.error.0256). I realise this is probably an issue with the exact setup on archer2, but they are unable to help due to not being WRF specialists. Is there a way to set the partition in the namelist, and do you think this might help? I can't work out why the error is always on the first task of each subsequent node after 1.

Many thanks,

Emily
 

Attachments

  • restart_wrf_run_rsl_files.zip
    905.6 KB · Views: 0
  • original_wrf_run_rsl_files.zip
    2 MB · Views: 0
Hi Emily,

You're right that if the model runs fine within a single node, then unfortunately it's probably an issue that's specific to your system. Are you able to run other (non-wrf) jobs across multiple nodes without any problems? Are you able to run with fewer processors on each node (e.g., use 2 nodes with 64 processors each, instead of the full 128 -just as a test). If so, then you could potentially run across several nodes as long as you aren't using all processors on each node. Unfortunately I'm not a software engineer and am not familiar with your system, so I'm probably not able to help out too much.

There is an option in the namelist that allows you to set up the number of processors in each direction, but not the number of nodes used. If you're interested, you can try the nproc_x and nproc_y options in the &domains record.
 
Top