Hello,
I am attempting to run WRF on Derecho for a 3-nested domain simulation that was stable on Cheyenne. I used 16 nodes on Cheyenne for a total of 576 processes. However, when I attempt to run a similar configuration on Derecho, real.exe is failing on the third, innermost domain (which is of the highest resolution).
I am wondering if there is a way to get something stable running that worked on Cheyenne previously without having to change anything about the domain configuration. I am aware of posts on here about optimizing the number of processes (Choosing an Appropriate Number of Processors). I also tried using a similar number of total processes on Derecho that I used on Cheyenne (128 cores per node * 5 nodes) but the third domain still failed. The following is a snippet from the tail.rsl.error file for real.exe:
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14f62a2bd020, scnts=0xd74f870, displs=0x9ec8f90, MPI_CHAR, rbuf=0x7ffc2eb0cc60, rcount=4380288, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
When I tried 9 nodes, that was too large for my outer domain to run. Do you have any insight into how WRF would scale on Derecho versus Cheyenne to get a stable configuration? I have attached a zip of namelist.input, rsl.error, rsl.out and runrealc.sh files. Thanks for any assistance you can provide!
Evan
I am attempting to run WRF on Derecho for a 3-nested domain simulation that was stable on Cheyenne. I used 16 nodes on Cheyenne for a total of 576 processes. However, when I attempt to run a similar configuration on Derecho, real.exe is failing on the third, innermost domain (which is of the highest resolution).
I am wondering if there is a way to get something stable running that worked on Cheyenne previously without having to change anything about the domain configuration. I am aware of posts on here about optimizing the number of processes (Choosing an Appropriate Number of Processors). I also tried using a similar number of total processes on Derecho that I used on Cheyenne (128 cores per node * 5 nodes) but the third domain still failed. The following is a snippet from the tail.rsl.error file for real.exe:
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14f62a2bd020, scnts=0xd74f870, displs=0x9ec8f90, MPI_CHAR, rbuf=0x7ffc2eb0cc60, rcount=4380288, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(368)..: OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Invalid argument)
When I tried 9 nodes, that was too large for my outer domain to run. Do you have any insight into how WRF would scale on Derecho versus Cheyenne to get a stable configuration? I have attached a zip of namelist.input, rsl.error, rsl.out and runrealc.sh files. Thanks for any assistance you can provide!
Evan