Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRFv4.5.2 on Derecho for very large nx,ny

jamesrup

New member
Hi all,

I posted on here with an issue getting WRF 4.5.2 to run (crashing) on Derecho with a very large domain. I'm working with a domain of nx,ny = 3551, 2441. Are there expected limitations to WRF's ability to scale up to a domain of this size?

My original setup was:

module load intel-classic/2023.0.0
module load ncarcompilers/1.0.0
module load cray-mpich/8.1.25
module load craype/2.7.20
module load netcdf-mpi/4.9.2

And configure option #50 as was recommended to me on this thread. I tried many values of "select" to get more CPUs (from 1024 up to 2304) and it kept failing. I have therefore been trying different configure options, including 24, 15.

Most recently, I've been trying config option #78 (Intel oneAPI), since the latest WRF version is supposed to accommodate this. I therefore did module load intel-oneapi/2023.0.0 before compiling. It failed to compile. I'm attaching the text write-out from that compile command. Any ideas what's going wrong here?

I really need to get these runs going in support of an ongoing virtual field campaign, so, huge appreciation in advance if anyone has some insights.

Cheers, James
 

Attachments

  • compile.out.txt
    2 MB · Views: 2
Copying my reply from another thread here, since it's directly relevant

Hi KWerner,

I just wanted to follow up with you on this. I believe I've narrowed the issue down to something about parallelization with large nx, ny. It's even failing just with real.exe

I've tried just a single domain and when I go from nx=ny=1000 to about 2000 it stops working, and it crashes and I get the below error. The run and rsl files are located in /glade/derecho/scratch/ruppert/piccolo/run3

The compilation of WRF used to run that is in /glade/work/ruppert/wrf-piccolo/wrftests/WRF in case you want to see my configure.wrf file (used option #50)
I'll also copy below my module list.

---------------------------------------------------

Currently Loaded Modules:
1) ncarenv/23.06 (S) 3) intel/2023.0.0 5) cray-mpich/8.1.25 7) netcdf/4.9.2
2) craype/2.7.20 4) ncarcompilers/1.0.0 6) hdf5/1.12.2 8) conda/latest

---------------------------------------------------

metgrid input_wrf.F first_date_input = 2023-08-09_00:00:00
metgrid input_wrf.F first_date_nml = 2023-08-09_00:00:00
MPICH ERROR [Rank 0] [job id f92d2ea2-ecd4-429f-aa06-aa61ddb43685] [Sat Jan 27 17:13:06 2024] [dec0356] - Abort(136960015) (rank 0 in comm 0): Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14a1256a9020, scnts=0x29a56d90, displs=0x29b0b470, MPI_CHAR, rbuf=0x7ffd633bde00, rcount=2029104, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)

aborting job:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(416)..........: MPI_Scatterv(sbuf=0x14a1256a9020, scnts=0x29a56d90, displs=0x29b0b470, MPI_CHAR, rbuf=0x7ffd633bde00, rcount=2029104, MPI_CHAR, root=0, comm=comm=0xc4000000) failed
MPIR_CRAY_Scatterv(462).....:
MPIC_Isend(511).............:
MPID_Isend_coll(610)........:
MPIDI_isend_coll_unsafe(176):
MPIDI_OFI_send_normal(372)..: OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Invalid argument)
 
Hi all,

I posted on here with an issue getting WRF 4.5.2 to run (crashing) on Derecho with a very large domain. I'm working with a domain of nx,ny = 3551, 2441. Are there expected limitations to WRF's ability to scale up to a domain of this size?

My original setup was:

module load intel-classic/2023.0.0
module load ncarcompilers/1.0.0
module load cray-mpich/8.1.25
module load craype/2.7.20
module load netcdf-mpi/4.9.2

And configure option #50 as was recommended to me on this thread. I tried many values of "select" to get more CPUs (from 1024 up to 2304) and it kept failing. I have therefore been trying different configure options, including 24, 15.

Most recently, I've been trying config option #78 (Intel oneAPI), since the latest WRF version is supposed to accommodate this. I therefore did module load intel-oneapi/2023.0.0 before compiling. It failed to compile. I'm attaching the text write-out from that compile command. Any ideas what's going wrong here?

I really need to get these runs going in support of an ongoing virtual field campaign, so, huge appreciation in advance if anyone has some insights.

Cheers, James
Hi James,
What error message did you see in your rsl files? Basically we recommend to determine the smallest number of processors using the formula below:

(e_we/100) * (e_sn/100)

In your case, I suppose the number of 1024 processors should work.

We don't have many experience running such a big case. I would appreciate if you can keep me updated about the progress.
 
Hi Ming Chen,

I didn't see any clear error messages in my rsl files aside from what I quoted in the message above, starting with "MPICH ERROR"

I've tested running with a very wide range of choices of nCPUs, e.g., a range of nselect from 20 to 200 on Derecho.

I'm surprised to learn that no one has experience running WRF at this large of a scale. Is there a possibility that people in the WRF development expertise can provide wisdom from running MPAS at high resolution? I know folks are doing that at very high resolution...
 
We did have trouble running WRF with very large grid numbers (e.g., larger than 3000 x 3000).

For MPAS, I know that we can run global 3km mesh in derecho, --- is this resolution sufficient for you?
 
Hi Ming,

We really need 1-km grid spacing and to use WRF, though I do suspect that setup of MPAS is a larger number of grid cells so if that can work, hopefully we can use that same setup. Can you share what module and environmental setup you use for the global 3km mesh? I'll try it.

Cheers,
James
 
Top