Hi,
I am making simulations with the 3-60km mesh (x20.835586) with different numbers of PEs. Everything runs as expected when I am using 1008 PEs or less, only the forecast runs too slowly with these amount of cores. Segfaults (as shown below) start to show up and get the job killed all together when I tried simulations with 1728, 2304, or 3456 PEs. I wonder why this is happening and how to resolve it. I am using MPAS-v8.1.
Thanks!
I am making simulations with the 3-60km mesh (x20.835586) with different numbers of PEs. Everything runs as expected when I am using 1008 PEs or less, only the forecast runs too slowly with these amount of cores. Segfaults (as shown below) start to show up and get the job killed all together when I tried simulations with 1728, 2304, or 3456 PEs. I wonder why this is happening and how to resolve it. I am using MPAS-v8.1.
Thanks!
Code:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7fd52cfefd8f in ???
#1 0x59a6e9 in ???
#2 0x59ab70 in ???
#3 0x59d02a in ???
#4 0x6f424b in ???
#5 0x552d54 in ???
#6 0x511876 in ???
#7 0x49dc07 in ???
#8 0x49ed90 in ???
#9 0x4067dc in ???
#10 0x405cba in ???
#11 0x7fd52cfdaeaf in ???
#12 0x7fd52cfdaf5f in ???
#13 0x405d04 in ???
#14 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: **-01-09
Local PID: 968029
Peer host: **-01-12
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 148 with PID 1060249 on node **-01-12 exited on