Hello, I posted this question in the WRF general forum (Running WRF Domain Configuration on Derecho That Was Stable on Cheyenne) but have not had much response there, so I thought I would try posting here:
I am attempting to run WRF on Derecho for a 3-nested domain simulation that was stable on Cheyenne. I used 16 nodes on Cheyenne for a total of 576 processes. However, when I attempt to run a similar configuration on Derecho, real.exe is failing on the third, innermost domain (which is of the highest resolution). I first tried to run the exact same domain specification I did on Cheyenne but that failed with these MPI errors in tail.rsl.0000:
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14f62a2bd020, scnts=0xd74f870, displs=0x9ec8f90, MPI_CHAR, rbuf=0x7ffc2eb0cc60, rcount=4380288, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
I next tried loading some specific modules at the recommendation of CISLhelp:
module --force purge
module load ncarenv/23.06 intel-oneapi/2023.0.0
module load ncarcompilers/1.0.0
module load netcdf/4.9.2 hdf5/1.12.2
module load craype/2.7.20 cray-mpich/8.1.25
mpiexec -n 640 -ppn 128 ./real.exe
That gave the same errors. I then removed the inner domain and was able to run real.exe and wrf.exe for the outer two, coarser domains. Next I tried just running the inner high resolution original domain (so a 1-domain simulation) but that yielded the same errors for real.exe.
Based on the advice of this post (WRFv4.5.2 on Derecho for very large nx,ny) I tried reducing the size of my inner domain (where I changed it to 1498 x 1498 grid points) and it successfully ran! There is some issue going on where it cannot handle a large domain size. Unfortunately the 1498 x 1498 grid points are not exactly what I need. I am somewhat perplexed as to why this could be happening, as it doesn't seem to be an issue dividing the grid into even chunks for processing when running real.exe or wrf.exe.
I look forward to hearing back about possible other tactics to try from the WRF forum side! Thanks for any insight you can provide!
Best,
Evan
I am attempting to run WRF on Derecho for a 3-nested domain simulation that was stable on Cheyenne. I used 16 nodes on Cheyenne for a total of 576 processes. However, when I attempt to run a similar configuration on Derecho, real.exe is failing on the third, innermost domain (which is of the highest resolution). I first tried to run the exact same domain specification I did on Cheyenne but that failed with these MPI errors in tail.rsl.0000:
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14f62a2bd020, scnts=0xd74f870, displs=0x9ec8f90, MPI_CHAR, rbuf=0x7ffc2eb0cc60, rcount=4380288, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
I next tried loading some specific modules at the recommendation of CISLhelp:
module --force purge
module load ncarenv/23.06 intel-oneapi/2023.0.0
module load ncarcompilers/1.0.0
module load netcdf/4.9.2 hdf5/1.12.2
module load craype/2.7.20 cray-mpich/8.1.25
mpiexec -n 640 -ppn 128 ./real.exe
That gave the same errors. I then removed the inner domain and was able to run real.exe and wrf.exe for the outer two, coarser domains. Next I tried just running the inner high resolution original domain (so a 1-domain simulation) but that yielded the same errors for real.exe.
Based on the advice of this post (WRFv4.5.2 on Derecho for very large nx,ny) I tried reducing the size of my inner domain (where I changed it to 1498 x 1498 grid points) and it successfully ran! There is some issue going on where it cannot handle a large domain size. Unfortunately the 1498 x 1498 grid points are not exactly what I need. I am somewhat perplexed as to why this could be happening, as it doesn't seem to be an issue dividing the grid into even chunks for processing when running real.exe or wrf.exe.
I look forward to hearing back about possible other tactics to try from the WRF forum side! Thanks for any insight you can provide!
Best,
Evan