hey, WRF & MPAS Forum!
occasionally instead of issuing the command ./real.exe, i submit it as a slurm script to reduce the time (1 hour --> 5 minutes) and it works fine.
the files that i'm working with now are so large that i have attempted to do the same thing with ./ungrib.exe, i.e, submitting it as a slurm script, but it times out after about a minute of execution and returns the following error:
MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r12i6n26 failed to receive connection(s) from: r12i6n26.ib0.cheyenne.ucar.edu r12i6n23.ib0.cheyenne.ucar.edu
MPT: MPT ERROR: Check network connectivity between hosts.
Retry after increasing value of MPI_LAUNCH_TIMEOUT.
See MPI(1) for details.
MPT ERROR: could not launch executable
(HPE MPT 2.25 08/14/21 03:06:24)
/var/spool/pbs/mom_priv/jobs/4434641.chadmin1.ib0.cheyenne.ucar.edu.SC: line 16: 43005 Killed mpiexec_mpt ./ungrib.exe
i've tried a couple of things but nothing seems to work. E.g., increasing MPI_LAUNCH_TIMEOUT but its max is 60 seconds.
i also tried adding :nodetype=largemem onto the sixth line but it had no effect.
Any suggestions?
note: after issuing the command ./ungrib.exe, i anticipate it will take 3 hours or timeout after about an hour or so
slurm script:
#!/bin/bash
#PBS -N ungrib_e5
#PBS -l walltime=06:00:00
#PBS -q economy
#PBS -j oe
#PBS -k eod
#PBS -l select=2:ncpus=36:mpiprocs=36:nodetype=largemem
#PBS -m abe
###
export TMPDIR=/glade/scratch/wrenn/temp
export MPI_LAUNCH_TIMEOUT=60
mkdir -p $TMPDIR
###
cd /glade/scratch/wrenn/WRFV3.7_2/WPS-3.7
mpiexec_mpt ./ungrib.exe
source: MPT Startup Failures: Workarounds - HECC Knowledge Base
occasionally instead of issuing the command ./real.exe, i submit it as a slurm script to reduce the time (1 hour --> 5 minutes) and it works fine.
the files that i'm working with now are so large that i have attempted to do the same thing with ./ungrib.exe, i.e, submitting it as a slurm script, but it times out after about a minute of execution and returns the following error:
MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r12i6n26 failed to receive connection(s) from: r12i6n26.ib0.cheyenne.ucar.edu r12i6n23.ib0.cheyenne.ucar.edu
MPT: MPT ERROR: Check network connectivity between hosts.
Retry after increasing value of MPI_LAUNCH_TIMEOUT.
See MPI(1) for details.
MPT ERROR: could not launch executable
(HPE MPT 2.25 08/14/21 03:06:24)
/var/spool/pbs/mom_priv/jobs/4434641.chadmin1.ib0.cheyenne.ucar.edu.SC: line 16: 43005 Killed mpiexec_mpt ./ungrib.exe
i've tried a couple of things but nothing seems to work. E.g., increasing MPI_LAUNCH_TIMEOUT but its max is 60 seconds.
i also tried adding :nodetype=largemem onto the sixth line but it had no effect.
Any suggestions?
note: after issuing the command ./ungrib.exe, i anticipate it will take 3 hours or timeout after about an hour or so
slurm script:
#!/bin/bash
#PBS -N ungrib_e5
#PBS -l walltime=06:00:00
#PBS -q economy
#PBS -j oe
#PBS -k eod
#PBS -l select=2:ncpus=36:mpiprocs=36:nodetype=largemem
#PBS -m abe
###
export TMPDIR=/glade/scratch/wrenn/temp
export MPI_LAUNCH_TIMEOUT=60
mkdir -p $TMPDIR
###
cd /glade/scratch/wrenn/WRFV3.7_2/WPS-3.7
mpiexec_mpt ./ungrib.exe
source: MPT Startup Failures: Workarounds - HECC Knowledge Base
Last edited: