CFL issue in the middle of the wrf run

kgeorge · Oct 7, 2023

Hello,

I have a failure after 20 days of hourly simulation , it looks strange to me because i got confused why it happened after such long time, if there was an issue with the model it should have popped up before right ?

Timing for main: time 2020-01-20_06:49:36 on domain 2: 0.53781 elapsed seconds
Timing for main: time 2020-01-20_06:49:36 on domain 1: 3.02211 elapsed seconds
Timing for main: time 2020-01-20_06:49:36 on domain 1: 3.02211 elapsed seconds
Timing for main: time 2020-01-20_06:49:38 on domain 2: 0.82501 elapsed seconds
Timing for main: time 2020-01-20_06:49:38 on domain 2: 0.82501 elapsed seconds
Timing for main: time 2020-01-20_06:49:40 on domain 2: 0.54874 elapsed seconds
Timing for main: time 2020-01-20_06:49:40 on domain 2: 0.54874 elapsed seconds
Timing for main: time 2020-01-20_06:49:43 on domain 2: 0.54124 elapsed seconds
Timing for main: time 2020-01-20_06:49:43 on domain 2: 0.54124 elapsed seconds
Timing for main: time 2020-01-20_06:49:45 on domain 2: 0.53853 elapsed seconds
Timing for main: time 2020-01-20_06:49:45 on domain 2: 0.53853 elapsed seconds
d02 2020-01-20_06:49:45+03/05 330 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 73 72 69 vert_cfl,w,d(eta)= 1.98137449E+20 -9.28796725E+18 2.99999979E-03
d02 2020-01-20_06:49:45+03/05 330 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 73 72 69 vert_cfl,w,d(eta)= 1.98137449E+20 -9.28796725E+18 2.99999979E-03
d02 2020-01-20_06:49:45+03/05 340 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 74 67 43 vert_cfl,w,d(eta)= 7.29251637E+17 1.90167925E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 340 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 74 67 43 vert_cfl,w,d(eta)= 7.29251637E+17 1.90167925E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 1020 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 69 72 43 vert_cfl,w,d(eta)= 1.69698721E+18 -7.66308235E+19 1.29999965E-02
d02 2020-01-20_06:49:45+03/05 1020 points exceeded cfl=2 in domain d02 at time 2020-01-20_06:49:45+03/05 hours
d02 2020-01-20_06:49:45+03/05 MAX AT i,j,k: 69 72 43 vert_cfl,w,d(eta)= 1.69698721E+18 -7.66308235E+19 1.29999965E-02
Timing for main: time 2020-01-20_06:49:48 on domain 2: 0.53766 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 2: 0.53766 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 1: 3.33517 elapsed seconds
Timing for main: time 2020-01-20_06:49:48 on domain 1: 3.33517 elapsed seconds
[belt01:267845:0:267845] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10804000)
[belt01:267837:0:267837] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x12a78000)
==== backtrace (tid: 267845) ====
==== backtrace (tid: 267837) ====
0 /opt/ucx-1.8.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x7f5e05198384]
1 /opt/ucx-1.8.0/lib/libucs.so.0(+0x236ac) [0x7f5e051986ac]
0 /opt/ucx-1.8.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x7f1d02917384]
1 /opt/ucx-1.8.0/lib/libucs.so.0(+0x236ac) [0x7f1d029176ac]
2 /opt/ucx-1.8.0/lib/libucs.so.0(+0x2391b) [0x7f5e0519891b]
3 /lib64/libpthread.so.0(+0xf630) [0x7f5e1c955630]
4 ./coawstM() [0x2ad0fac]
5 ./coawstM() [0x2ae3ea3]
7 ./coawstM() [0x1b32bec]
7 ./coawstM() [0x1b32bec]
8 ./coawstM() [0x14424d9]
9 ./coawstM() [0x126d618]
8 ./coawstM() [0x14424d9]
9 ./coawstM() [0x126d618]
10 ./coawstM() [0x471982]
11 ./coawstM() [0x406972]
12 ./coawstM() [0x405f7d]
10 ./coawstM() [0x471982]
11 ./coawstM() [0x406972]
12 ./coawstM() [0x405f7d]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1d19e1a555]
14 ./coawstM() [0x405fb4]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5e1c59a555]
14 ./coawstM() [0x405fb4]
=================================
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7f5e1d2d2dfd in ???
#1 0x7f5e1d2d2013 in ???
#2 0x7f5e1c95562f in ???
#3 0x2ad0fac in ???
#4 0x2ae3ea2 in ???
#0 0x7f1d1ab52dfd in ???
#1 0x7f1d1ab52013 in ???
#2 0x7f1d1a1d562f in ???
#3 0x2ad0fac in ???
#4 0x2ae3ea2 in ???
#5 0x24d5e43 in ???
#5 0x24d5e43 in ???
#6 0x1b32beb in ???
#6 0x1b32beb in ???
#7 0x14424d8 in ???
#8 0x126d617 in ???
#7 0x14424d8 in ???
#8 0x126d617 in ???
#9 0x471981 in ???
#10 0x406971 in ???
#9 0x471981 in ???
#10 0x406971 in ???
#11 0x405f7c in ???
#11 0x405f7c in ???
#12 0x7f1d19e1a554 in ???
#13 0x405fb3 in ???
#14 0xffffffffffffffff in ???
#12 0x7f5e1c59a554 in ???
#13 0x405fb3 in ???
#14 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node belt01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Can any one please comment why it happened, ?
1831120,1 Bot

kwerner · Oct 12, 2023

Hi,
It's difficult to say why the CFL errors happen so long into the simulation. CFL errors occur typically due to complex terrain, but can also sometimes be related to strong updrafts and/or very unstable conditions. Is it possible that there was some sort of strong system in your domain when you experienced the errors?

If you have a restart file prior to the time the CFL errors occur, you can try to run a restart and see if it still stops at the same time - just as a test.
Are you able to get past the CFL errors using some of the suggestions in What is the most common reason for a segmentation fault?

kgeorge · Oct 23, 2023

Hello @kwerner ,

Thank you very much for the reply and i was trying some reruns, i tried restarting from the rst prior rst file, but again, it failed. I found out that there is a storm at this time and my timestep is 6dx which i am thinking to change to 3dx, will it help? i dont have any memory issue! and this thing happen above the sea, and not along boundary or rough terrain !
any other suggestions please ?

kwerner · Oct 23, 2023

If your time_step is 8xDX, that is almost certainly why you are getting CFL errors. It should be no larger than 6xDX. You can try 6x and other lower settings to see what works for you. You can test this by starting with the restart time so that you don't have to run so much before you know if you can get past this point.

kgeorge · Oct 23, 2023

Hello @kwerner
Thank you once again for your reply, I am sorry it was a typo error, it was 6dx at the first time, now i reduce to 3dx and is running it. There is an interesting thing about this error , which is, of the three configuration i tested, two of them had this "same" cfl issue and not the third one, all are different configurations. i am validating it again to see if it occur again. May i please ask if presence of a storm can be a reason for it?

kwerner · Oct 23, 2023

Yes, as I indicated above, a strong updraft and unstable conditions can be related to CFL issues. The atmosphere is unstable, and there are strong updrafts during storms.

CFL issue in the middle of the wrf run

kgeorge

New member

kwerner

Administrator

kgeorge

New member

kwerner

Administrator

kgeorge

New member

kwerner

Administrator