Hello,
I have a failure after 20 days of hourly simulation , it looks strange to me because i got confused why it happened after such long time, if there was an issue with the model it should have popped up before right ?
Timing for main: time 2020-01-20_06:49:36 on domain 2: 0.53781 elapsed seconds
Timing for main: time 2020-01-20_06:49:36 on domain 1: 3.02211 elapsed seconds
Timing for main: time 2020-01-20_06:49:36 on domain 1: 3.02211 elapsed seconds
Timing for main: time 2020-01-20_06:49:38 on domain 2: 0.82501 elapsed seconds
Timing for main: time 2020-01-20_06:49:38 on domain 2: 0.82501 elapsed seconds
Timing for main: time 2020-01-20_06:49:40 on domain 2: 0.54874 elapsed seconds
Timing for main: time 2020-01-20_06:49:40 on domain 2: 0.54874 elapsed seconds
Timing for main: time 2020-01-20_06:49:43 on domain 2: 0.54124 elapsed seconds
Timing for main: time 2020-01-20_06:49:43 on domain 2: 0.54124 elapsed seconds
Timing for main: time 2020-01-20_06:49:45 on domain 2: 0.53853 elapsed seconds
Timing for main: time 2020-01-20_06:49:45 on domain 2: 0.53853 elapsed seconds
d02 2020-01-20_06:49:45+03/05 330 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 73 72 69 vert_cfl,w,d(eta)= 1.98137449E+20 -9.28796725E+18 2.99999979E-03
d02 2020-01-20_06:49:45+03/05 330 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 73 72 69 vert_cfl,w,d(eta)= 1.98137449E+20 -9.28796725E+18 2.99999979E-03
d02 2020-01-20_06:49:45+03/05 340 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 74 67 43 vert_cfl,w,d(eta)= 7.29251637E+17 1.90167925E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 340 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 74 67 43 vert_cfl,w,d(eta)= 7.29251637E+17 1.90167925E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 1020 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 69 72 43 vert_cfl,w,d(eta)= 1.69698721E+18 -7.66308235E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 1020 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 69 72 43 vert_cfl,w,d(eta)= 1.69698721E+18 -7.66308235E+19 1.29999965E-02
Timing for main: time 2020-01-20_06:49:48 on domain 2: 0.53766 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 2: 0.53766 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 1: 3.33517 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 1: 3.33517 elapsed seconds
[belt01:267845:0:267845] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10804000)
[belt01:267837:0:267837] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x12a78000)
==== backtrace (tid: 267845) ====
==== backtrace (tid: 267837) ====
0 /opt/ucx-1.8.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x7f5e05198384]
1 /opt/ucx-1.8.0/lib/libucs.so.0(+0x236ac) [0x7f5e051986ac]
0 /opt/ucx-1.8.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x7f1d02917384]
1 /opt/ucx-1.8.0/lib/libucs.so.0(+0x236ac) [0x7f1d029176ac]
2 /opt/ucx-1.8.0/lib/libucs.so.0(+0x2391b) [0x7f5e0519891b]
3 /lib64/libpthread.so.0(+0xf630) [0x7f5e1c955630]
4 ./coawstM() [0x2ad0fac]
5 ./coawstM() [0x2ae3ea3]
7 ./coawstM() [0x1b32bec]
7 ./coawstM() [0x1b32bec]
8 ./coawstM() [0x14424d9]
9 ./coawstM() [0x126d618]
8 ./coawstM() [0x14424d9]
9 ./coawstM() [0x126d618]
10 ./coawstM() [0x471982]
11 ./coawstM() [0x406972]
12 ./coawstM() [0x405f7d]
10 ./coawstM() [0x471982]
11 ./coawstM() [0x406972]
12 ./coawstM() [0x405f7d]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1d19e1a555]
14 ./coawstM() [0x405fb4]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5e1c59a555]
14 ./coawstM() [0x405fb4]
=================================
=================================
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f5e1d2d2dfd in ???
#1 0x7f5e1d2d2013 in ???
#2 0x7f5e1c95562f in ???
#3 0x2ad0fac in ???
#4 0x2ae3ea2 in ???
#0 0x7f1d1ab52dfd in ???
#1 0x7f1d1ab52013 in ???
#2 0x7f1d1a1d562f in ???
#3 0x2ad0fac in ???
#4 0x2ae3ea2 in ???
#5 0x24d5e43 in ???
#5 0x24d5e43 in ???
#6 0x1b32beb in ???
#6 0x1b32beb in ???
#7 0x14424d8 in ???
#8 0x126d617 in ???
#7 0x14424d8 in ???
#8 0x126d617 in ???
#9 0x471981 in ???
#10 0x406971 in ???
#9 0x471981 in ???
#10 0x406971 in ???
#11 0x405f7c in ???
#11 0x405f7c in ???
#12 0x7f1d19e1a554 in ???
#13 0x405fb3 in ???
#14 0xffffffffffffffff in ???
#12 0x7f5e1c59a554 in ???
#13 0x405fb3 in ???
#14 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node belt01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Can any one please comment why it happened, ?
1831120,1 Bot
I have a failure after 20 days of hourly simulation , it looks strange to me because i got confused why it happened after such long time, if there was an issue with the model it should have popped up before right ?
Timing for main: time 2020-01-20_06:49:36 on domain 2: 0.53781 elapsed seconds
Timing for main: time 2020-01-20_06:49:36 on domain 1: 3.02211 elapsed seconds
Timing for main: time 2020-01-20_06:49:36 on domain 1: 3.02211 elapsed seconds
Timing for main: time 2020-01-20_06:49:38 on domain 2: 0.82501 elapsed seconds
Timing for main: time 2020-01-20_06:49:38 on domain 2: 0.82501 elapsed seconds
Timing for main: time 2020-01-20_06:49:40 on domain 2: 0.54874 elapsed seconds
Timing for main: time 2020-01-20_06:49:40 on domain 2: 0.54874 elapsed seconds
Timing for main: time 2020-01-20_06:49:43 on domain 2: 0.54124 elapsed seconds
Timing for main: time 2020-01-20_06:49:43 on domain 2: 0.54124 elapsed seconds
Timing for main: time 2020-01-20_06:49:45 on domain 2: 0.53853 elapsed seconds
Timing for main: time 2020-01-20_06:49:45 on domain 2: 0.53853 elapsed seconds
d02 2020-01-20_06:49:45+03/05 330 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 73 72 69 vert_cfl,w,d(eta)= 1.98137449E+20 -9.28796725E+18 2.99999979E-03
d02 2020-01-20_06:49:45+03/05 330 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 73 72 69 vert_cfl,w,d(eta)= 1.98137449E+20 -9.28796725E+18 2.99999979E-03
d02 2020-01-20_06:49:45+03/05 340 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 74 67 43 vert_cfl,w,d(eta)= 7.29251637E+17 1.90167925E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 340 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 74 67 43 vert_cfl,w,d(eta)= 7.29251637E+17 1.90167925E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 1020 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 69 72 43 vert_cfl,w,d(eta)= 1.69698721E+18 -7.66308235E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 1020 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 69 72 43 vert_cfl,w,d(eta)= 1.69698721E+18 -7.66308235E+19 1.29999965E-02
Timing for main: time 2020-01-20_06:49:48 on domain 2: 0.53766 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 2: 0.53766 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 1: 3.33517 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 1: 3.33517 elapsed seconds
[belt01:267845:0:267845] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10804000)
[belt01:267837:0:267837] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x12a78000)
==== backtrace (tid: 267845) ====
==== backtrace (tid: 267837) ====
0 /opt/ucx-1.8.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x7f5e05198384]
1 /opt/ucx-1.8.0/lib/libucs.so.0(+0x236ac) [0x7f5e051986ac]
0 /opt/ucx-1.8.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x7f1d02917384]
1 /opt/ucx-1.8.0/lib/libucs.so.0(+0x236ac) [0x7f1d029176ac]
2 /opt/ucx-1.8.0/lib/libucs.so.0(+0x2391b) [0x7f5e0519891b]
3 /lib64/libpthread.so.0(+0xf630) [0x7f5e1c955630]
4 ./coawstM() [0x2ad0fac]
5 ./coawstM() [0x2ae3ea3]
7 ./coawstM() [0x1b32bec]
7 ./coawstM() [0x1b32bec]
8 ./coawstM() [0x14424d9]
9 ./coawstM() [0x126d618]
8 ./coawstM() [0x14424d9]
9 ./coawstM() [0x126d618]
10 ./coawstM() [0x471982]
11 ./coawstM() [0x406972]
12 ./coawstM() [0x405f7d]
10 ./coawstM() [0x471982]
11 ./coawstM() [0x406972]
12 ./coawstM() [0x405f7d]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1d19e1a555]
14 ./coawstM() [0x405fb4]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5e1c59a555]
14 ./coawstM() [0x405fb4]
=================================
=================================
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f5e1d2d2dfd in ???
#1 0x7f5e1d2d2013 in ???
#2 0x7f5e1c95562f in ???
#3 0x2ad0fac in ???
#4 0x2ae3ea2 in ???
#0 0x7f1d1ab52dfd in ???
#1 0x7f1d1ab52013 in ???
#2 0x7f1d1a1d562f in ???
#3 0x2ad0fac in ???
#4 0x2ae3ea2 in ???
#5 0x24d5e43 in ???
#5 0x24d5e43 in ???
#6 0x1b32beb in ???
#6 0x1b32beb in ???
#7 0x14424d8 in ???
#8 0x126d617 in ???
#7 0x14424d8 in ???
#8 0x126d617 in ???
#9 0x471981 in ???
#10 0x406971 in ???
#9 0x471981 in ???
#10 0x406971 in ???
#11 0x405f7c in ???
#11 0x405f7c in ???
#12 0x7f1d19e1a554 in ???
#13 0x405fb3 in ???
#14 0xffffffffffffffff in ???
#12 0x7f5e1c59a554 in ???
#13 0x405fb3 in ???
#14 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node belt01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Can any one please comment why it happened, ?
1831120,1 Bot