Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF4.5.2 crash help

jamesrup

Member
Hi there,

I can't for the life of me figure out what's causing my WRF job to crash. I've tried just about everything, from my original goal of a very large domain and hi-res (1km) to a very coarse domain (25km). I originally thought it was due to a lack of nests stepping down from the coarse boundary condition data, but with the current 25km domain there's no way this could be the cause.

I've turned on w_damping and increased epssm to 0.5. Have also tried using the DFI initialization. None of this has helped.

Attached is my namelist. I'm running on Derecho using the attached configure setup. I've tried various numbers of nodes from very small (128) to very large (thousands).

I need to get these jobs running for fieldwork support in the near future. Thanks in advance for any help!

James
 

Attachments

  • namelist.input.txt
    4.5 KB · Views: 2
  • configure.wrf.txt
    21.1 KB · Views: 3
Hi James,
If you don't mind, can you let me know the path to this case so I can take a look on Derecho? Otherwise, will you please package all of your rsl.* files together into a single *.tar file and attach that? Thanks!
 
Hi KWerner, thanks for your reply!

The directory I've been working from is

/glade/work/ruppert/wrf-piccolo/WRF/run

But it's actually running the jobs in

/glade/derecho/scratch/ruppert/piccolo/run

I've been playing with a very large number of nodes more recently so it's generating RSL files out to #29439!
 
Hi KWerner,

I just wanted to follow up with you on this. I believe I've narrowed the issue down to something about parallelization with large nx, ny. It's even failing just with real.exe

I've tried just a single domain and when I go from nx=ny=1000 to about 2000 it stops working, and it crashes and I get the below error. The run and rsl files are located in /glade/derecho/scratch/ruppert/piccolo/run3

The compilation of WRF used to run that is in /glade/work/ruppert/wrf-piccolo/wrftests/WRF in case you want to see my configure.wrf file (used option #50)
I'll also copy below my module list.

---------------------------------------------------

Currently Loaded Modules:
1) ncarenv/23.06 (S) 3) intel/2023.0.0 5) cray-mpich/8.1.25 7) netcdf/4.9.2
2) craype/2.7.20 4) ncarcompilers/1.0.0 6) hdf5/1.12.2 8) conda/latest

---------------------------------------------------

metgrid input_wrf.F first_date_input = 2023-08-09_00:00:00
metgrid input_wrf.F first_date_nml = 2023-08-09_00:00:00
MPICH ERROR [Rank 0] [job id f92d2ea2-ecd4-429f-aa06-aa61ddb43685] [Sat Jan 27 17:13:06 2024] [dec0356] - Abort(136960015) (rank 0 in comm 0): Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14a1256a9020, scnts=0x29a56d90, displs=0x29b0b470, MPI_CHAR, rbuf=0x7ffd633bde00, rcount=2029104, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)

aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14a1256a9020, scnts=0x29a56d90, displs=0x29b0b470, MPI_CHAR, rbuf=0x7ffd633bde00, rcount=2029104, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)
 
Top