Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF domain 3 error on derecho

ananyas

New member
Hi there,

I am attempting to run real.exe on derecho for a 3-layer nested domain simulation (namelist attached below); but real.exe seems to be failing on the 3rd domain with the following error;
Code:
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
MPICH ERROR [Rank 0] [job id b00ca03d-a489-44f4-ae0c-620377b3aa0d] [Fri Mar 21 23:26:58 2025] [dec0037] - Abort(539613199) (rank 0 in comm 0): Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x148af85d7020, scnts=0xa0429c0, displs=0xfd322d0, MPI_CHAR, rbuf=0x7fff3a921200, rcount=17799936, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)

aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x148af85d7020, scnts=0xa0429c0, displs=0xfd322d0, MPI_CHAR, rbuf=0x7fff3a921200, rcount=17799936, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)

This is my config for running real.exe
Code:
...
#PBS -l select=4:ncpus=96:mem=128GB
...
export WRFIO_NCD_LARGE_FILE_SUPPORT=1
mpirun -np 156 ./real.exe

After running into the issue with namelist.old (please ignore the start and end times) and having looked at another similar issue before, I tried increasing domain 1 grids (namelist_new) but still running into the same issue. Before I play around with other domain 1 grids (domain 3 is entirely the area of interest so was avoiding reducing that), was curious if anyone had any pointers on how to approach this issue? Any other insights would be helpful as well.

Thanks in advance,
Ananya
 

Attachments

  • namelist_old.txt
    1.1 KB · Views: 2
  • namelist_new.txt
    1.1 KB · Views: 3
Hi Ananya,
Can you try to run real with 2 domains, using the same number of processors? If that works, then it's likely just the size of d03, and it's possible you will need more processors. If you continue to increase the processor number, and it's still failing, will you package all of the rsl* files into a single *.tar file and attach that, as well as your namelist.input file, so I can take a look? Thanks!
 
This looks to be related to :

This is not in the latest release yet, but you could either try using the develop branch or apply the changes noted in the pull request.
 
Thanks much for the response. I applied the changes @islas suggested in the file frame/collect_on_comm.c but still running into the same issue.

@kwerner 2 domains work fine, it seems to be indeed an issue with the size of domain 3. I tried increasing processors and nodes but still the same issue. In my latest attempt I used this;

Code:
#PBS -l select=6:ncpus=96:mem=196GB
#PBS -o wrf_output.log

### Set temp to scratch
export TMPDIR=/glade/derecho/scratch/${USER}/tmp && mkdir -p ${TMPDIR}
export WRFIO_NCD_LARGE_FILE_SUPPORT=1
mpirun -np 256 ./real.exe

PFA attached my namelist.input file. The rsl.* files should be at /glade/work/ananyas/wrfv4.5/run . Please let me know if you cannot access that and I will tar them up and upload it here.

Thanks again for all your help!
 

Attachments

  • namelist.input.txt
    3.8 KB · Views: 1
I tried recompiling with the fix @islas suggested again today and it worked this time. I might have been using the wrong binary maybe yesterday, not sure.

Thanks again for your help :)
 
Top