Hi WRF folks,
I've been using WRF to run some short ensemble forecasts, and I've found that it crashes with some ensemble members and not others, despite using an identical namelist and submit script on derecho. I've also been able to run identically-configured ensembles for two other cases without this issue, so I'm not sure why this case keeps crashing. When the model crashes, I get the following error in the rsl.error files:
taskid: 2 hostname: dec0508
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 1
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 00001494B8E42900 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B9F87D5F Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAD808C6 Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAD9428C Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAC99155 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920EA31 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920EBEE Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAF867A2 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920FBD7 PMPI_Alltoall Unknown Unknown
wrf.exe 0000000003654EF7 Unknown Unknown Unknown
wrf.exe 00000000009244C2 Unknown Unknown Unknown
wrf.exe 0000000001CD9B25 Unknown Unknown Unknown
wrf.exe 000000000165101D Unknown Unknown Unknown
wrf.exe 00000000005D550C Unknown Unknown Unknown
wrf.exe 0000000000418821 Unknown Unknown Unknown
wrf.exe 00000000004187E1 Unknown Unknown Unknown
wrf.exe 000000000041877D Unknown Unknown Unknown
libc.so.6 00001494B8E2BE6C Unknown Unknown Unknown
libc.so.6 00001494B8E2BF35 __libc_start_main Unknown Unknown
Stack trace terminated abnormally.
I've tried a range of different timestep values (initially thinking this was a CFL issue) and got a crash each time, although the timing of the crash varied somewhat. I've also tried running it using different numbers of nodes on derecho, different I/O quilting configurations, and with I/O quilting turned off, and each configuration crashes with the same error. An example of one of the failed ensemble members is in /glade/derecho/scratch/mawilson/OSSE_WRF/free_06/member2 on derecho. Let me know if you've seen this issue before or if you have any suggestions.
Thanks!
Matt Wilson
I've been using WRF to run some short ensemble forecasts, and I've found that it crashes with some ensemble members and not others, despite using an identical namelist and submit script on derecho. I've also been able to run identically-configured ensembles for two other cases without this issue, so I'm not sure why this case keeps crashing. When the model crashes, I get the following error in the rsl.error files:
taskid: 2 hostname: dec0508
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 1
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 1 grid%oid 2
mediation_integrate.G 1242 DATASET=HISTORY
mediation_integrate.G 1243 grid%id 2 grid%oid 2
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 00001494B8E42900 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B9F87D5F Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAD808C6 Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAD9428C Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAC99155 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920EA31 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920EBEE Unknown Unknown Unknown
libmpi_intel.so.1 00001494BAF867A2 Unknown Unknown Unknown
libmpi_intel.so.1 00001494B920FBD7 PMPI_Alltoall Unknown Unknown
wrf.exe 0000000003654EF7 Unknown Unknown Unknown
wrf.exe 00000000009244C2 Unknown Unknown Unknown
wrf.exe 0000000001CD9B25 Unknown Unknown Unknown
wrf.exe 000000000165101D Unknown Unknown Unknown
wrf.exe 00000000005D550C Unknown Unknown Unknown
wrf.exe 0000000000418821 Unknown Unknown Unknown
wrf.exe 00000000004187E1 Unknown Unknown Unknown
wrf.exe 000000000041877D Unknown Unknown Unknown
libc.so.6 00001494B8E2BE6C Unknown Unknown Unknown
libc.so.6 00001494B8E2BF35 __libc_start_main Unknown Unknown
Stack trace terminated abnormally.
I've tried a range of different timestep values (initially thinking this was a CFL issue) and got a crash each time, although the timing of the crash varied somewhat. I've also tried running it using different numbers of nodes on derecho, different I/O quilting configurations, and with I/O quilting turned off, and each configuration crashes with the same error. An example of one of the failed ensemble members is in /glade/derecho/scratch/mawilson/OSSE_WRF/free_06/member2 on derecho. Let me know if you've seen this issue before or if you have any suggestions.
Thanks!
Matt Wilson