Hello,
Could someone please confirm if the SEGFAULT I'm encountering is due to a CFL error? If so, does this mean the only solution is to reduce the timestep? Currently, I'm running a double-domain (2-way nested) simulation with horizontal resolutions of 7 km and 2.333... km (ratio = 3) on 65 vertical levels (refer to the attached namelist). In parallel, I'm also running the same simulation but with 33 levels (half the vertical resolution), and it is running without any CFL errors.
Hereunder are the first and last lines of the grep CFL command I ran on rsl.error* files (see complete list in rsl.error* files attached) :
Further informations : I'm running both simulations on a 121x121 horizontal grid (same size on both domains) on 144 processors (+1 dedicated to wrtting). I read the FAQ on the max processors number and am aware of being at the upper limit given my domains's size : might this be a problem ?
On a more geographical aspect, the cells where occur the vert_cfl > 2 are located in open ocean (i.e. no strong slope around).
The current timestep is 40 seconds, and I'm eager to keep it as close as possible to this value to avoid increasing my calculation time significantly. I was thinking that perhaps 36 seconds would be sufficient. Do you have any advice?
Currently, I'm running 2 experiments with 36s and 30s timestep respectively. I should know if it crashes at the same time-location in few hours...
Out of curiosity, can anyone confirm whether the fact that I'm not encountering SEGFAULT/CFL errors in the 33-level run could be explained by the greater height between each individual level?
Thank you for your time
Could someone please confirm if the SEGFAULT I'm encountering is due to a CFL error? If so, does this mean the only solution is to reduce the timestep? Currently, I'm running a double-domain (2-way nested) simulation with horizontal resolutions of 7 km and 2.333... km (ratio = 3) on 65 vertical levels (refer to the attached namelist). In parallel, I'm also running the same simulation but with 33 levels (half the vertical resolution), and it is running without any CFL errors.
Code:
cat rsl.error.0000 | tail -n 25
Timing for main: time 2013-10-22_20:57:46 on domain 2: 0.05754 elapsed seconds
Timing for main: time 2013-10-22_20:58:00 on domain 2: 0.05740 elapsed seconds
Timing for main: time 2013-10-22_20:58:00 on domain 1: 0.33857 elapsed seconds
Timing for main: time 2013-10-22_20:58:13 on domain 2: 0.05045 elapsed seconds
Timing for main: time 2013-10-22_20:58:26 on domain 2: 0.05993 elapsed seconds
Timing for main: time 2013-10-22_20:58:40 on domain 2: 0.05952 elapsed seconds
Timing for main: time 2013-10-22_20:58:40 on domain 1: 0.37335 elapsed seconds
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
wrfexe 000000000288329D for__signal_handl Unknown Unknown
libpthread-2.19.s 00002AAAADE23870 Unknown Unknown Unknown
wrfexe 0000000001BD7F6C Unknown Unknown Unknown
wrfexe 0000000001BD19EE Unknown Unknown Unknown
wrfexe 0000000001BCC8DC Unknown Unknown Unknown
wrfexe 0000000001BCB0B1 Unknown Unknown Unknown
wrfexe 00000000016370C7 Unknown Unknown Unknown
wrfexe 000000000170B069 Unknown Unknown Unknown
wrfexe 0000000001166BBF Unknown Unknown Unknown
wrfexe 000000000102609C Unknown Unknown Unknown
wrfexe 00000000005263E5 Unknown Unknown Unknown
wrfexe 000000000040EF01 Unknown Unknown Unknown
wrfexe 000000000040EEBF Unknown Unknown Unknown
wrfexe 000000000040EE5E Unknown Unknown Unknown
libc-2.19.so 00002AAAAE052B25 __libc_start_main Unknown Unknown
wrfexe 000000000040ED69 Unknown Unknown Unknown
Hereunder are the first and last lines of the grep CFL command I ran on rsl.error* files (see complete list in rsl.error* files attached) :
Code:
grep -i CFL rsl.error.0*
rsl.error.0013:d01 2013-10-22_20:45:20 1 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:45:20 hours
rsl.error.0013:d01 2013-10-22_20:45:20 MAX AT i,j,k: 14 20 26 vert_cfl,w,d(eta)= 2.004570 5.555908 1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:00 1 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:46:00 hours
rsl.error.0013:d01 2013-10-22_20:46:00 MAX AT i,j,k: 14 20 26 vert_cfl,w,d(eta)= 2.017011 5.513463 1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:40 1 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:46:40 hours
rsl.error.0013:d01 2013-10-22_20:46:40 MAX AT i,j,k: 14 20 26 vert_cfl,w,d(eta)= 2.031362 5.400991 1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:47:20 1 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:47:20 hours
...
rsl.error.0013:d01 2013-10-22_20:56:40 13 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:56:40 hours
rsl.error.0013:d01 2013-10-22_20:56:40 MAX AT i,j,k: 14 20 28 vert_cfl,w,d(eta)= 11.27256 -246.8004 1.3000011E-02
rsl.error.0025:d01 2013-10-22_20:56:00 2 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:56:00 hours
rsl.error.0025:d01 2013-10-22_20:56:00 MAX AT i,j,k: 14 21 26 vert_cfl,w,d(eta)= 2.140012 6.960732 1.0499954E-02
rsl.error.0025:d01 2013-10-22_20:56:40 3 points exceeded cfl=2 in domain d01 at time 2013-10-22_20:56:40 hours
rsl.error.0025:d01 2013-10-22_20:56:40 MAX AT i,j,k: 14 21 28 vert_cfl,w,d(eta)= 4.002279 17.52446 1.3000011E-02
Further informations : I'm running both simulations on a 121x121 horizontal grid (same size on both domains) on 144 processors (+1 dedicated to wrtting). I read the FAQ on the max processors number and am aware of being at the upper limit given my domains's size : might this be a problem ?
On a more geographical aspect, the cells where occur the vert_cfl > 2 are located in open ocean (i.e. no strong slope around).
Code:
ncks -H -d Time,1 -d west_east,14 -d south_north,21 -v HGT wrfout_d01_2013-10-21_00\:00\:00
netcdf wrfout_d01_2013-10-21_00:00:00 {
dimensions:
Time = UNLIMITED ; // (1 currently)
south_north = 1 ;
west_east = 1 ;
variables:
float HGT(Time,south_north,west_east) ;
float XLAT(Time,south_north,west_east) ;
float XLONG(Time,south_north,west_east) ;
data:
HGT =
0 ;
XLAT =
-19.96004 ;
XLONG =
-152.2943 ;
The current timestep is 40 seconds, and I'm eager to keep it as close as possible to this value to avoid increasing my calculation time significantly. I was thinking that perhaps 36 seconds would be sufficient. Do you have any advice?
Currently, I'm running 2 experiments with 36s and 30s timestep respectively. I should know if it crashes at the same time-location in few hours...
Out of curiosity, can anyone confirm whether the fact that I'm not encountering SEGFAULT/CFL errors in the 33-level run could be explained by the greater height between each individual level?
Thank you for your time
Attachments
Last edited: