Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

cxil_map: write error MPICH ERROR [Rank 0] [job id 8446943.0] [Thu Nov 14 10:18:16 2024] [nid001080] - Abort(2713103) (rank 0 in comm 0): Fatal error

Good morning. We have been running ndown.exe with WRF v4.4 between d02 and d03, but we get the following error message:

cxil_map: write error
cxil_map: write error
MPICH ERROR [Rank 0] [job id 8446943.0] [Thu Nov 14 10:18:16 2024] [nid001080] - Abort(2713103) (rank 0 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(412)......: MPI_Gatherv failed(sbuf=0x6fc1f7c0, scount=912000, MPI_CHAR, rbuf=0x14712fc71010, rcnts=0x6fb8fe70, displs=0x6fb76cf0, datatype=MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Gatherv(456).:
MPIC_Irecv(594)........:
MPID_Irecv(529)........:
MPIDI_irecv_unsafe(163):
MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Invalid argument)


I saw this error in other posts and tried all the recommendations but nothing works for our test. Please can someone help me? Thanks
I attach the filesxº
 

Attachments

  • files.tar.gz
    16.6 KB · Views: 3
Hi,
So I assume you don't get this error when you run real.exe or wrf.exe initially for the coarse domain? Can you first try to run ndown with fewer processors? I'm not sure if this would cause the issue, but you don't need that many processors to run ndown. Try something like 256.

Assuming that won't make any difference, can you package all of the rsl* files from the ndown.exe submission into a single *.tar file and attach that? Thanks.
 
Good morning,
Thanks for the reply. Yes, I get this error when I run real.exe, but I ran it in serial and it worked. I'll run ndown.exe with fewer processors and let you know the results.
 
Good morning @kwerner,
I managed to run real.exe in parallel without problems and used less processors with ndown.exe, but I get the same error. I attach the *.tar with the rsl* files and the list of namelist. What do you think the problem could be? Thanks for your help
 

Attachments

  • logs_wrf.tar.gz
    2.3 MB · Views: 1
  • namelist_ndow_d03.input
    6.4 KB · Views: 2
  • submit_script.txt
    1.9 KB · Views: 1
Hi,
I see at the top of the rsl* files, the decomposition is shown as

Code:
Ntasks in X            3 , ntasks in Y           43

This is a very uneven decomposition. Can you try to use a number of processors that is more equal in the X and Y directions? Maybe try 144 total processors, so that the decomposition is 12x12.
 
Top