WRFv4.5.2 on Derecho for very large nx,ny

jamesrup · Jan 19, 2024

Hi all,

I posted on here with an issue getting WRF 4.5.2 to run (crashing) on Derecho with a very large domain. I'm working with a domain of nx,ny = 3551, 2441. Are there expected limitations to WRF's ability to scale up to a domain of this size?

My original setup was:

module load intel-classic/2023.0.0
module load ncarcompilers/1.0.0
module load cray-mpich/8.1.25
module load craype/2.7.20
module load netcdf-mpi/4.9.2

And configure option #50 as was recommended to me on this thread. I tried many values of "select" to get more CPUs (from 1024 up to 2304) and it kept failing. I have therefore been trying different configure options, including 24, 15.

Most recently, I've been trying config option #78 (Intel oneAPI), since the latest WRF version is supposed to accommodate this. I therefore did module load intel-oneapi/2023.0.0 before compiling. It failed to compile. I'm attaching the text write-out from that compile command. Any ideas what's going wrong here?

I really need to get these runs going in support of an ongoing virtual field campaign, so, huge appreciation in advance if anyone has some insights.

Cheers, James

jamesrup · Jan 28, 2024

Copying my reply from another thread here, since it's directly relevant

WRF4.5.2 crash help

Hi there, I can't for the life of me figure out what's causing my WRF job to crash. I've tried just about everything, from my original goal of a very large domain and hi-res (1km) to a very coarse domain (25km). I originally thought it was due to a lack of nests stepping down from the coarse...

forum.mmm.ucar.edu

jamesrup said:
Hi KWerner,

I just wanted to follow up with you on this. I believe I've narrowed the issue down to something about parallelization with large nx, ny. It's even failing just with real.exe

I've tried just a single domain and when I go from nx=ny=1000 to about 2000 it stops working, and it crashes and I get the below error. The run and rsl files are located in /glade/derecho/scratch/ruppert/piccolo/run3

The compilation of WRF used to run that is in /glade/work/ruppert/wrf-piccolo/wrftests/WRF in case you want to see my configure.wrf file (used option #50)
I'll also copy below my module list.

---------------------------------------------------

Currently Loaded Modules:
1) ncarenv/23.06 (S) 3) intel/2023.0.0 5) cray-mpich/8.1.25 7) netcdf/4.9.2
2) craype/2.7.20 4) ncarcompilers/1.0.0 6) hdf5/1.12.2 8) conda/latest

---------------------------------------------------

metgrid input_wrf.F first_date_input = 2023-08-09_00:00:00
metgrid input_wrf.F first_date_nml = 2023-08-09_00:00:00
MPICH ERROR [Rank 0] [job id f92d2ea2-ecd4-429f-aa06-aa61ddb43685] [Sat Jan 27 17:13:06 2024] [dec0356] - Abort(136960015) (rank 0 in comm 0): Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14a1256a9020, scnts=0x29a56d90, displs=0x29b0b470, MPI_CHAR, rbuf=0x7ffd633bde00, rcount=2029104, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)

aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14a1256a9020, scnts=0x29a56d90, displs=0x29b0b470, MPI_CHAR, rbuf=0x7ffd633bde00, rcount=2029104, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)

Ming Chen · Jan 29, 2024

jamesrup said:
Hi all,

I posted on here with an issue getting WRF 4.5.2 to run (crashing) on Derecho with a very large domain. I'm working with a domain of nx,ny = 3551, 2441. Are there expected limitations to WRF's ability to scale up to a domain of this size?

My original setup was:

module load intel-classic/2023.0.0
module load ncarcompilers/1.0.0
module load cray-mpich/8.1.25
module load craype/2.7.20
module load netcdf-mpi/4.9.2

And configure option #50 as was recommended to me on this thread. I tried many values of "select" to get more CPUs (from 1024 up to 2304) and it kept failing. I have therefore been trying different configure options, including 24, 15.

Most recently, I've been trying config option #78 (Intel oneAPI), since the latest WRF version is supposed to accommodate this. I therefore did module load intel-oneapi/2023.0.0 before compiling. It failed to compile. I'm attaching the text write-out from that compile command. Any ideas what's going wrong here?

I really need to get these runs going in support of an ongoing virtual field campaign, so, huge appreciation in advance if anyone has some insights.

Cheers, James

Hi James,
What error message did you see in your rsl files? Basically we recommend to determine the smallest number of processors using the formula below:

(e_we/100) * (e_sn/100)

In your case, I suppose the number of 1024 processors should work.

We don't have many experience running such a big case. I would appreciate if you can keep me updated about the progress.

jamesrup · Jan 29, 2024

Hi Ming Chen,

I didn't see any clear error messages in my rsl files aside from what I quoted in the message above, starting with "MPICH ERROR"

I've tested running with a very wide range of choices of nCPUs, e.g., a range of nselect from 20 to 200 on Derecho.

I'm surprised to learn that no one has experience running WRF at this large of a scale. Is there a possibility that people in the WRF development expertise can provide wisdom from running MPAS at high resolution? I know folks are doing that at very high resolution...

Ming Chen · May 28, 2024

We did have trouble running WRF with very large grid numbers (e.g., larger than 3000 x 3000).

For MPAS, I know that we can run global 3km mesh in derecho, --- is this resolution sufficient for you?

jamesrup · May 28, 2024

Hi Ming,

We really need 1-km grid spacing and to use WRF, though I do suspect that setup of MPAS is a larger number of grid cells so if that can work, hopefully we can use that same setup. Can you share what module and environmental setup you use for the global 3km mesh? I'll try it.

Cheers,
James

WRFv4.5.2 on Derecho for very large nx,ny

jamesrup

New member

Attachments

jamesrup

New member

WRF4.5.2 crash help

Ming Chen

Moderator

jamesrup

New member

Ming Chen

Moderator

jamesrup

New member