Hi Zhan,
This error message is derived from SUBROUTINE read_CAMgases in phys/module_ra_clWRF_support.F.
Can you add some prints right before the line to check models and dates for this case?
201 CALL wrf_error_fatal("CLWRF: 'CAMtr_volume_mixing_ratio' does not exist")
Specifically, I would like to know values of the following variables:
model, yr, julian, max_years
Thanks.
Hello Ming,
Sorry about the late reply. We were able to avoid restart last time, and later the restart worked, so this error was passed and ignored.
However, this time the restart failed again and i am trying anything i could to figure it out.
So i print the following variables: model, yr, julian, max_years, and also absolute path and relative path of the file. One of the rank returned the CAMtr file does not exist while others confirmed that the file does exist. So this is very interesting to me that ranks performed differently like this. Can i ask if you would have any thought? Thank you.
rsl.out.0224:
29 Checking file: [CAMtr_volume_mixing_ratio]
30 /pscratch/sd/z/zhanshi/ensemble_18h/run
31 relative exists = F
32 absolute exists = F
33 model=RRTMG yr= 2023 julian= 159.0000 max_years= 233
34 -------------- FATAL CALLED ---------------
35 FATAL CALLED FROM FILE: <stdin> LINE: 215
36 CLWRF: 'CAMtr_volume_mixing_ratio' does not exist
37 -------------------------------------------
38 taskid: 224 hostname: nid004800
rsl.out.0220:
D01: NML defined reasonable_time_step_ratio = 6.000000
Checking file: [CAMtr_volume_mixing_ratio]
/pscratch/sd/z/zhanshi/ensemble_18h/run
relative exists = T
absolute exists = T
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2023 , Julian day = 159
CO2 = 4.230768799241307E-004 volume mixing ratio
N2O = 3.343851326344848E-007 volume mixing ratio
CH4 = 1.941778516043003E-006 volume mixing ratio
CFC11 = 2.100264328202455E-010 volume mixing ratio
CFC12 = 4.810798593491203E-010 volume mixing ratio
taskid: 220 hostname: nid004581
Not just the old problem, new problem appears ...
I also suspect that this problem might be sensitive to decomposition. For the previous successful case, I used 900 ranks. Later the run stopped, restarted again with 900 ranks successfully. This time i used 1024 ranks. But for the restart, when i use smaller number of processors, like 256, i received CAMtr_volume_mixing_ratio does not exist. When i use larger number of processors like i used for my first run 1024,
Normal ending of CAMtr_volume_mixing_ratio file, but i received a different error message:
cxil_map: write error. The restart run never reached wrfrst_d03 and crushed after reading wrfrst_d02.
*** subr move_sections - method = 20
*** subr move_sections - idiag = 0
dep_init: initializing for 3 domains
start_domain_em: numgas = 141
*************************************
Nesting domain
ids,ide,jds,jde 1 301 1 268
ims,ime,jms,jme -4 21 93 120
ips,ipe,jps,jpe 1 10 103 110
INTERMEDIATE domain
ids,ide,jds,jde 28 133 29 123
ims,ime,jms,jme 23 42 55 77
ips,ipe,jps,jpe 26 32 65 67
*************************************
d01 2023-06-08_21:00:00 alloc_space_field: domain 2 , 51352560 bytes allocated
d01 2023-06-08_21:00:00 alloc_space_field: domain 2 , 301273920 bytes allocated
RESTART: nest, opening wrfrst_d02_2023-06-08_21:00:00 for reading
d01 2023-06-08_21:00:00 Input data is acceptable to use:
cxil_map: write error
cxil_map: write error
cxil_map: write error
MPICH ERROR [Rank 384] [job id 54199333.0] [Tue Jun 9 00:25:52 2026] [nid004876] - Abort(405397519) (rank 384 in comm 0): Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416).....: MPI_Scatterv(sbuf=0x4aec2340, scnts=0x4af77c40, displs=0x4af78c50, dtype=0x4c000427, rbuf=0x7fff9e475900, rcount=28800, dtype=0x4c000427, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(502):
MPIC_Recv(194).........:
MPID_Recv(380).........:
MPIDI_recv_unsafe(87)..:
MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)
aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416).....: MPI_Scatterv(sbuf=0x4aec2340, scnts=0x4af77c40, displs=0x4af78c50, dtype=0x4c000427, rbuf=0x7fff9e475900, rcount=28800, dtype=0x4c000427, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(502):
MPIC_Recv(194).........:
MPID_Recv(380).........:
MPIDI_recv_unsafe(87)..:
MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)
While i realized this might be related to the bug mentioned and solved in this post: (in case you cannot open the website)
MPI_Gatherv/MPI_Scatterv displacements overflow in frame/collect_on_comm.c
https://github.com/wrf-model/WRF/issues/2156
Determine MPI Data Types in col_on_comm() & dst_on_comm() to prevent displacements overflow.
TYPE: bug fix
KEYWORDS: prevent displacements overflow in MPI_Gatherv() and MPI_Scatterv() operations
SOURCE: Benjamin Kirk & Negin Sobhani (NSF NCAR / CISL)
DESCRIPTION OF CHANGES:
Problem:
The MPI_Gatherv() and MPI_Scatterv() operations require integer displacements into the communications buffers. Historically everything is passed as an MPI_CHAR, causing these displacements to be larger than otherwise necessary. For large domain sizes this can cause the displace[] offsets to exceed the maximum int, wrapping to negative values.
Solution:
This change introduces additional error checking and then uses the function MPI_Type_match_size() (available since MPI-2.0) to determine a suitable MPI_Datatype given the input *typesize. The result then is that the displace[] offsets are in terms of data type extents, rather than bytes, and less likely to overflow.
ISSUE: Fixes #2156
LIST OF MODIFIED FILES:
M frame/collect_on_comm.c
TESTS CONDUCTED:
Failed cases run now.
RELEASE NOTE:
Determine MPI Data Types in col_on_comm() & dst_on_comm() to prevent displacements overflow.
I modified the code and recompiled. Unfortunately it didn't work for me.
I have little experience in debugging the code. Please let me know if you have any advise or have experience in solving this kind of problem. I would also work with HPC support because this problem might not be purely related to wrf chem itself.
Thank you very much for your time and help.