Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPICH ERROR when running WRF on Derecho

htan2013

Member
Dear all experts,

I want to run a modified version of WRF on Derecho. The module that I modified is module_sf_bep_bem.F and I have compiled it with GCC compiler. Module list below:

Currently Loaded Modules:
1) ncarenv/23.09 (S) 3) gcc/13.2.0 5) hdf5/1.14.3 7) cray-mpich/8.1.27
2) craype/2.7.23 4) ncarcompilers/1.0.0 6) netcdf/4.9.2

After I ran the WRF, I met mpich error:

WRF TILE 1 IS 303 IE 353 JS 177 JE 201
WRF NUMBER OF TILES = 1
MPICH ERROR [Rank 62] [job id abd2453e-5b59-4625-a3a3-a3e727c5c071] [Mon Aug 5 11:16:06 2024] [dec1380] - Abort(405417871) (ra>
PMPI_Wait(221).................: MPI_Wait(request=0x9168cd4, status=0x7ffcfa7415b0) failed
MPIR_Wait(93)..................:
MPIR_Wait_impl(41).............:
MPID_Progress_wait(201)........:
MPIDI_Progress_test(97)........:
MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FO>

aborting job:
Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(221).................: MPI_Wait(request=0x9168cd4, status=0x7ffcfa7415b0) failed
MPIR_Wait(93)..................:
MPIR_Wait_impl(41).............:
MPID_Progress_wait(201)........:
MPIDI_Progress_test(97)........:
MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FO>

I have 2 two questions regarding this issue:
1) I tried to compile it with the intel compiler but I met some compilation errors. Is it reasonable to have a compilation error on Intel but not on Gfortran based on my new modified module file?

2) I have tried to reduce the processor numbers and time step on this. But it didn't work. The path of the WRF on Derecho is on /glade/derecho/scratch/htan2013/New_BEP_BEM/WRF4.5_Alberto_Tree/WRF/test/em_real and I am also attaching the rsl file here. I really appreciate any advice, thank you.

HT
 

Attachments

  • rsl.error.0062.txt
    12 KB · Views: 2
  • rsl.out.0062.txt
    12 KB · Views: 1
Hi,

1) I tried to compile it with the intel compiler but I met some compilation errors. Is it reasonable to have a compilation error on Intel but not on Gfortran based on my new modified module file?
I suppose that could be possible. Some compilers pick up on certain things that others don't.

2) I have tried to reduce the processor numbers and time step on this. But it didn't work. The path of the WRF on Derecho is on /glade/derecho/scratch/htan2013/New_BEP_BEM/WRF4.5_Alberto_Tree/WRF/test/em_real and I am also attaching the rsl file here. I really appreciate any advice, thank you.
I tried to look for all of your rsl files, but I don't see any in the directory you provided. I did take a look at your namelist, however, and I have a few suggestions. These may not fix the issue, but it's certainly worth a try.

1. It looks like you're only running this with 128 processors. Due to the size of your domains, you may need to try using more than that. You could potentially use up to 10 nodes (or 1280 processors). You likely don't need that many, but you could try something in-between.

2. I see that your parent domain has grid-spacing (resolution) of 4.5 km. What type of input data are you using? If the resolution of the input data is much more than about 5xDX (or 5 x 4.5), then you should try putting another coarser-resolution parent around the 4.5 km domain. Otherwise, the resolution ratio is too large, which can cause issues.

If the model is still stopping immediately, there is likely something else going on. If that's the case, can you try this exact simulation with a pristine version of WRF (non-modified) and see if you still have issues? If so, then we know it's not your modifications that introduced the problems.
 
Hi,


I suppose that could be possible. Some compilers pick up on certain things that others don't.


I tried to look for all of your rsl files, but I don't see any in the directory you provided. I did take a look at your namelist, however, and I have a few suggestions. These may not fix the issue, but it's certainly worth a try.

1. It looks like you're only running this with 128 processors. Due to the size of your domains, you may need to try using more than that. You could potentially use up to 10 nodes (or 1280 processors). You likely don't need that many, but you could try something in-between.

2. I see that your parent domain has grid-spacing (resolution) of 4.5 km. What type of input data are you using? If the resolution of the input data is much more than about 5xDX (or 5 x 4.5), then you should try putting another coarser-resolution parent around the 4.5 km domain. Otherwise, the resolution ratio is too large, which can cause issues.

If the model is still stopping immediately, there is likely something else going on. If that's the case, can you try this exact simulation with a pristine version of WRF (non-modified) and see if you still have issues? If so, then we know it's not your modifications that introduced the problems.
Thank you so much! I will try to use more processors also. I realize that you have recommended that I use more processors before. Thanks!
 
Top