Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPI Error Running real.exe for a Domain Configuration Stable on Cheyenne

ejones

New member
Hello, I posted this question in the WRF general forum (Running WRF Domain Configuration on Derecho That Was Stable on Cheyenne) but have not had much response there, so I thought I would try posting here:

I am attempting to run WRF on Derecho for a 3-nested domain simulation that was stable on Cheyenne. I used 16 nodes on Cheyenne for a total of 576 processes. However, when I attempt to run a similar configuration on Derecho, real.exe is failing on the third, innermost domain (which is of the highest resolution). I first tried to run the exact same domain specification I did on Cheyenne but that failed with these MPI errors in tail.rsl.0000:

MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14f62a2bd020, scnts=0xd74f870, displs=0x9ec8f90, MPI_CHAR, rbuf=0x7ffc2eb0cc60, rcount=4380288, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)

I next tried loading some specific modules at the recommendation of CISLhelp:
module --force purge
module load ncarenv/23.06 intel-oneapi/2023.0.0
module load ncarcompilers/1.0.0
module load netcdf/4.9.2 hdf5/1.12.2
module load craype/2.7.20 cray-mpich/8.1.25
mpiexec -n 640 -ppn 128 ./real.exe

That gave the same errors. I then removed the inner domain and was able to run real.exe and wrf.exe for the outer two, coarser domains. Next I tried just running the inner high resolution original domain (so a 1-domain simulation) but that yielded the same errors for real.exe.

Based on the advice of this post (WRFv4.5.2 on Derecho for very large nx,ny) I tried reducing the size of my inner domain (where I changed it to 1498 x 1498 grid points) and it successfully ran! There is some issue going on where it cannot handle a large domain size. Unfortunately the 1498 x 1498 grid points are not exactly what I need. I am somewhat perplexed as to why this could be happening, as it doesn't seem to be an issue dividing the grid into even chunks for processing when running real.exe or wrf.exe.

I look forward to hearing back about possible other tactics to try from the WRF forum side! Thanks for any insight you can provide!

Best,
Evan
 
Unfortunately I can't say why an identical simulation worked properly on Cheyenne, but not Derecho. It could have to do with the differing libraries, compilers, etc. That would have to be a question for the CISL support group at NCAR.

Regarding your domain set-up, though, if you want to keep the large size for d03, an option would be to use the ndown program (scroll down a bit to the ndown section on that page) to run your d03, after running d01 and d02, since d03 needs more processors. In case you haven't already seen this, it may be helpful to read Choosing an Appropriate Number of Processors - just to know how many processors you may need for each domain.
 
Top