Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF crashing with some ensemble members and not others

MattW

Member
Hi WRF folks,

I've been using WRF to run some short ensemble forecasts, and I've found that it crashes with some ensemble members and not others, despite using an identical namelist and submit script on derecho. I've also been able to run identically-configured ensembles for two other cases without this issue, so I'm not sure why this case keeps crashing. When the model crashes, I get the following error in the rsl.error files:

taskid: 2 hostname: dec0508
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 1
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 00001494B8E42900 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B9F87D5F Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAD808C6 Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAD9428C Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAC99155 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920EA31 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920EBEE Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAF867A2 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920FBD7 PMPI_Alltoall Unknown Unknown
wrf.exe 0000000003654EF7 Unknown Unknown Unknown
wrf.exe 00000000009244C2 Unknown Unknown Unknown
wrf.exe 0000000001CD9B25 Unknown Unknown Unknown
wrf.exe 000000000165101D Unknown Unknown Unknown
wrf.exe 00000000005D550C Unknown Unknown Unknown
wrf.exe 0000000000418821 Unknown Unknown Unknown
wrf.exe 00000000004187E1 Unknown Unknown Unknown
wrf.exe 000000000041877D Unknown Unknown Unknown
libc.so.6 00001494B8E2BE6C Unknown Unknown Unknown
libc.so.6 00001494B8E2BF35 __libc_start_main Unknown Unknown

Stack trace terminated abnormally.

I've tried a range of different timestep values (initially thinking this was a CFL issue) and got a crash each time, although the timing of the crash varied somewhat. I've also tried running it using different numbers of nodes on derecho, different I/O quilting configurations, and with I/O quilting turned off, and each configuration crashes with the same error. An example of one of the failed ensemble members is in /glade/derecho/scratch/mawilson/OSSE_WRF/free_06/member2 on derecho. Let me know if you've seen this issue before or if you have any suggestions.

Thanks!

Matt Wilson
 
So an update on this: I figured out that the problem is that I had accidentally copied the wrong boundary conditions into the directory I was running WRF in, so it was running a case with ICs from June 4th, 2021 and boundary conditions from July 19th, 2022. It is slightly concerning that some of the ensemble members managed to complete their forecasts without any errors with that large of a mismatch in boundary conditions--I'm surprised it didn't throw some kind of error.
 
Top