Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Running WRF Domain Configuration on Derecho That Was Stable on Cheyenne

ejones

New member
Hello,

I am attempting to run WRF on Derecho for a 3-nested domain simulation that was stable on Cheyenne. I used 16 nodes on Cheyenne for a total of 576 processes. However, when I attempt to run a similar configuration on Derecho, real.exe is failing on the third, innermost domain (which is of the highest resolution).

I am wondering if there is a way to get something stable running that worked on Cheyenne previously without having to change anything about the domain configuration. I am aware of posts on here about optimizing the number of processes (Choosing an Appropriate Number of Processors). I also tried using a similar number of total processes on Derecho that I used on Cheyenne (128 cores per node * 5 nodes) but the third domain still failed. The following is a snippet from the tail.rsl.error file for real.exe:

MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14f62a2bd020, scnts=0xd74f870, displs=0x9ec8f90, MPI_CHAR, rbuf=0x7ffc2eb0cc60, rcount=4380288, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)

When I tried 9 nodes, that was too large for my outer domain to run. Do you have any insight into how WRF would scale on Derecho versus Cheyenne to get a stable configuration? I have attached a zip of namelist.input, rsl.error, rsl.out and runrealc.sh files. Thanks for any assistance you can provide!

Evan
 

Attachments

  • derecho_files.zip
    3 MB · Views: 1
Just following up on this--I was able to run a successful WRF test simulation all the way through when I removed the inner domain that it was failing on. s\So that indicates to me that there really is something going on with how it’s splitting up the grid of that inner domain with the MPI processes. I am not sure how to go about getting that inner domain to be stable without trial and error, which seems like it would take awhile…do you have any other ideas of what I could do?
 
Following up on this:
I went ahead and changed the inner domain to 1960 * 2128 grid points, which is evenly divisible by 128. Unfortunately, when I went through all the steps and re-ran up to real.exe, I still got the same MPI errors:

cxil_map: write error
MPICH ERROR [Rank 0] [job id f58b1d16-e860-4acd-b255-ef4f929a3411] [Mon May 13 19:02:24 2024] [dec1326] - Abort(136960015) (rank 0 in comm 0): Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14dcf8d29020, scnts=0x9ff54f0, displs=0x9fe25e0, MPI_CHAR, rbuf=0x7ffeb674a9e0, rcount=2193312, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)

aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14dcf8d29020, scnts=0x9ff54f0, displs=0x9fe25e0, MPI_CHAR, rbuf=0x7ffeb674a9e0, rcount=2193312, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)

One thing I will note is then when I was running metgrid.exe (the step before real.exe), there were a number of weird temporary output files when it was trying to write the met_em files for domain 3 that ended in .p####. I am not sure what those are for and why it didn’t show up for the other two domains, but it appeared that metgrid ran successfully (according to the executable anyway). So I wonder if there is some issue going on with metgrid.exe too?

Thanks for any additional insight you can provide!
 
Top