Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPI_LAUNCH_TIMEOUT

cwrenn

Member
hey, WRF & MPAS Forum!
occasionally instead of issuing the command ./real.exe, i submit it as a slurm script to reduce the time (1 hour --> 5 minutes) and it works fine.
the files that i'm working with now are so large that i have attempted to do the same thing with ./ungrib.exe, i.e, submitting it as a slurm script, but it times out after about a minute of execution and returns the following error:

MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r12i6n26 failed to receive connection(s) from: r12i6n26.ib0.cheyenne.ucar.edu r12i6n23.ib0.cheyenne.ucar.edu
MPT: MPT ERROR: Check network connectivity between hosts.
Retry after increasing value of MPI_LAUNCH_TIMEOUT.
See MPI(1) for details.

MPT ERROR: could not launch executable
(HPE MPT 2.25 08/14/21 03:06:24)
/var/spool/pbs/mom_priv/jobs/4434641.chadmin1.ib0.cheyenne.ucar.edu.SC: line 16: 43005 Killed mpiexec_mpt ./ungrib.exe

i've tried a couple of things but nothing seems to work. E.g., increasing MPI_LAUNCH_TIMEOUT but its max is 60 seconds.
i also tried adding :nodetype=largemem onto the sixth line but it had no effect.
Any suggestions?

note: after issuing the command ./ungrib.exe, i anticipate it will take 3 hours or timeout after about an hour or so
slurm script:
#!/bin/bash
#PBS -N ungrib_e5
#PBS -l walltime=06:00:00
#PBS -q economy
#PBS -j oe
#PBS -k eod
#PBS -l select=2:ncpus=36:mpiprocs=36:nodetype=largemem
#PBS -m abe
###
export TMPDIR=/glade/scratch/wrenn/temp
export MPI_LAUNCH_TIMEOUT=60

mkdir -p $TMPDIR
###
cd /glade/scratch/wrenn/WRFV3.7_2/WPS-3.7
mpiexec_mpt ./ungrib.exe

source: MPT Startup Failures: Workarounds - HECC Knowledge Base
 
Last edited:
Hi,
The reason you are seeing this issue is that ungrib must only be run serially. Even if you compiled WPS with a distributed memory option, the ungrib program is still only capable of utilizing serial processing. You can still use a slurm script for this, but you will need to ask it to only use one processor.
 
Top