(RESOLVED) forrtl: severe (174): SIGSEGV, segmentation fault occurred while running WRF

Arty · Oct 5, 2023

Hello,

Could someone please confirm if the SEGFAULT I'm encountering is due to a CFL error? If so, does this mean the only solution is to reduce the timestep? Currently, I'm running a double-domain (2-way nested) simulation with horizontal resolutions of 7 km and 2.333... km (ratio = 3) on 65 vertical levels (refer to the attached namelist). In parallel, I'm also running the same simulation but with 33 levels (half the vertical resolution), and it is running without any CFL errors.

Code:

cat rsl.error.0000 | tail -n 25
Timing for main: time 2013-10-22_20:57:46 on domain   2:    0.05754 elapsed seconds
Timing for main: time 2013-10-22_20:58:00 on domain   2:    0.05740 elapsed seconds
Timing for main: time 2013-10-22_20:58:00 on domain   1:    0.33857 elapsed seconds
Timing for main: time 2013-10-22_20:58:13 on domain   2:    0.05045 elapsed seconds
Timing for main: time 2013-10-22_20:58:26 on domain   2:    0.05993 elapsed seconds
Timing for main: time 2013-10-22_20:58:40 on domain   2:    0.05952 elapsed seconds
Timing for main: time 2013-10-22_20:58:40 on domain   1:    0.37335 elapsed seconds
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source    
wrfexe             000000000288329D  for__signal_handl     Unknown  Unknown
libpthread-2.19.s  00002AAAADE23870  Unknown               Unknown  Unknown
wrfexe             0000000001BD7F6C  Unknown               Unknown  Unknown
wrfexe             0000000001BD19EE  Unknown               Unknown  Unknown
wrfexe             0000000001BCC8DC  Unknown               Unknown  Unknown
wrfexe             0000000001BCB0B1  Unknown               Unknown  Unknown
wrfexe             00000000016370C7  Unknown               Unknown  Unknown
wrfexe             000000000170B069  Unknown               Unknown  Unknown
wrfexe             0000000001166BBF  Unknown               Unknown  Unknown
wrfexe             000000000102609C  Unknown               Unknown  Unknown
wrfexe             00000000005263E5  Unknown               Unknown  Unknown
wrfexe             000000000040EF01  Unknown               Unknown  Unknown
wrfexe             000000000040EEBF  Unknown               Unknown  Unknown
wrfexe             000000000040EE5E  Unknown               Unknown  Unknown
libc-2.19.so       00002AAAAE052B25  __libc_start_main     Unknown  Unknown
wrfexe             000000000040ED69  Unknown               Unknown  Unknown

Hereunder are the first and last lines of the grep CFL command I ran on rsl.error* files (see complete list in rsl.error* files attached) :

Code:

grep -i CFL rsl.error.0*
rsl.error.0013:d01 2013-10-22_20:45:20            1  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:45:20 hours
rsl.error.0013:d01 2013-10-22_20:45:20  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.004570       5.555908      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:00            1  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:46:00 hours
rsl.error.0013:d01 2013-10-22_20:46:00  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.017011       5.513463      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:40            1  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:46:40 hours
rsl.error.0013:d01 2013-10-22_20:46:40  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.031362       5.400991      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:47:20            1  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:47:20 hours
...
rsl.error.0013:d01 2013-10-22_20:56:40           13  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:56:40 hours
rsl.error.0013:d01 2013-10-22_20:56:40  MAX AT i,j,k:           14          20          28  vert_cfl,w,d(eta)=   11.27256      -246.8004      1.3000011E-02
rsl.error.0025:d01 2013-10-22_20:56:00            2  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:56:00 hours
rsl.error.0025:d01 2013-10-22_20:56:00  MAX AT i,j,k:           14          21          26  vert_cfl,w,d(eta)=   2.140012       6.960732      1.0499954E-02
rsl.error.0025:d01 2013-10-22_20:56:40            3  points exceeded cfl=2 in domain d01 at time 2013-10-22_20:56:40 hours
rsl.error.0025:d01 2013-10-22_20:56:40  MAX AT i,j,k:           14          21          28  vert_cfl,w,d(eta)=   4.002279       17.52446      1.3000011E-02

Further informations : I'm running both simulations on a 121x121 horizontal grid (same size on both domains) on 144 processors (+1 dedicated to wrtting). I read the FAQ on the max processors number and am aware of being at the upper limit given my domains's size : might this be a problem ?

On a more geographical aspect, the cells where occur the vert_cfl > 2 are located in open ocean (i.e. no strong slope around).

Code:

ncks -H -d Time,1 -d west_east,14 -d south_north,21 -v HGT wrfout_d01_2013-10-21_00\:00\:00
netcdf wrfout_d01_2013-10-21_00:00:00 {
  dimensions:
    Time = UNLIMITED ; // (1 currently)
    south_north = 1 ;
    west_east = 1 ;

  variables:
    float HGT(Time,south_north,west_east) ;

    float XLAT(Time,south_north,west_east) ;

    float XLONG(Time,south_north,west_east) ;

  data:
    HGT =
    0 ;

    XLAT =
    -19.96004 ;

    XLONG =
    -152.2943 ;

The current timestep is 40 seconds, and I'm eager to keep it as close as possible to this value to avoid increasing my calculation time significantly. I was thinking that perhaps 36 seconds would be sufficient. Do you have any advice?
Currently, I'm running 2 experiments with 36s and 30s timestep respectively. I should know if it crashes at the same time-location in few hours...

Out of curiosity, can anyone confirm whether the fact that I'm not encountering SEGFAULT/CFL errors in the 33-level run could be explained by the greater height between each individual level?

Thank you for your time

Arty · Oct 7, 2023

Each solution below (independently) worked:

- Timestep reduced to 36s instead of 40s
- w_damping = 1 instead of 0
- epssm = 0.2 instead of 0.1

Nevertheless, I would appreciate to have more concrete details on the effects of the w_damping and epssm parameters ; as I didn't find much to read about in the User Guide.

kwerner · Oct 11, 2023

Hi,
I'm glad you were able to get it working. Thank you for providing the solution.
Regarding damping, I will point you to the WRF Technical Note. As for epssm, the entirety of my knowledge on this is what is mentioned in the FAQ What is the most common reason for a segmentation fault, but if you desire additional information, I can reach out to a colleague to try to find out more information. Just let me know, specifically, what you'd like to know. You can also look for it in the code to see how it is used.

Arty · Oct 11, 2023

kwerner said:
Hi,
I'm glad you were able to get it working. Thank you for providing the solution.
Regarding damping, I will point you to the WRF Technical Note. As for epssm, the entirety of my knowledge on this is what is mentioned in the FAQ What is the most common reason for a segmentation fault, but if you desire additional information, I can reach out to a colleague to try to find out more information. Just let me know, specifically, what you'd like to know. You can also look for it in the code to see how it is used.

That's OK thank you. I read about w_damping in the Skamarok et al., 2008 (using WRF V3.6). I understand it's a mathematical artifice with no connection to physics. Concerning the epssm variable, the concept of "off-centering" is not yet clear to me but I'll take time to read further about it.

On the other hand, even though I read about best practice for x/y processors number, I'd be glad to have more information if it really is problematic to run on 10 cells/cpu ? Because I'm doing it right now and it doesn't seem to cause any problem, but I would rather be sure the calculation are correct. Would you have something for me on that matter ? Thanks.

William.Hatheway · Oct 12, 2023

Arty said:
That's OK thank you. I read about w_damping in the Skamarok et al., 2008 (using WRF V3.6). I understand it's a mathematical artifice with no connection to physics. Concerning the epssm variable, the concept of "off-centering" is not yet clear to me but I'll take time to read further about it.

On the other hand, even though I read about best practice for x/y processors number, I'd be glad to have more information if it really is problematic to run on 10 cells/cpu ? Because I'm doing it right now and it doesn't seem to cause any problem, but I would rather be sure the calculation are correct. Would you have something for me on that matter ? Thanks.

The quick answer about x/y processors number is that WRF likes square numbers. However I use 8 because it is more efficient then 9 or 10 on my personal cpu.

Arty · Oct 12, 2023

William.Hatheway said:
The quick answer about x/y processors number is that WRF likes square numbers. However I use 8 because it is more efficient then 9 or 10 on my personal cpu.

Thank you. In my case I computed different configs from 20 grid cells/cpu (36 cpus total) to 8 grid cells/cpu (225 cpus total : which crashed) and got best speed performance at 10 grid cells/cpu, hence the choice. Other technical constraints apply (28 cpus/node, etc...).

But my point here is, aside speed performance, that I'm really looking for a confirmation that this amount of 10 grids/cpu is not causing bad results due to numerical artefacts for example.

William.Hatheway · Oct 12, 2023

@kwerner would be the best to answer that's.

But from my experience when I did those performance tests over all 27 cores I did not see any difference in my plots that I made for data

kwerner · Oct 12, 2023

Arty said:
But my point here is, aside speed performance, that I'm really looking for a confirmation that this amount of 10 grids/cpu is not causing bad results due to numerical artefacts for example.

As long as you're staying within the limitations of the model (i.e., no fewer than 10 grid cells surrounding processing tile - and the model will stop and let you know if you violate this), then it is fine. Theoretically, you should get the same results, regardless of the number of processors, or decomposition. The only issues arise when you use too few or too many, based on the side of the domain. Those issues are explained in Choosing an Appropriate Number of Processors (which you've probably already read).

Arty · Oct 12, 2023

kwerner said:
As long as you're staying within the limitations of the model (i.e., no fewer than 10 grid cells surrounding processing tile - and the model will stop and let you know if you violate this), then it is fine. Theoretically, you should get the same results, regardless of the number of processors, or decomposition. The only issues arise when you use too few or too many, based on the side of the domain. Those issues are explained in Choosing an Appropriate Number of Processors (which you've probably already read).

Thank you Ms. Werner. Indeed I did. Especially this part made me wonder about potential mathematical artifacts :

" You do not want your entire tile to be halo regions, as you will want some actual space for computation in the middle of each tile. If the computation space does not exist, it can cause the model to crash, or the output to be unrealistic. "

But I am reassured now.

Arty · Oct 14, 2023

Arty said:
Each solution below (independently) worked:

- Timestep reduced to 36s instead of 40s
- w_damping = 1 instead of 0
- epssm = 0.2 instead of 0.1

Nevertheless, I would appreciate to have more concrete details on the effects of the w_damping and epssm parameters ; as I didn't find much to read about in the User Guide.

For further "learning from experience" :

From the tests mentioned above, after one-month-run for each config., I decided to continue on with only w_damping = 1 (for calculation time consideration).

Eventually it did crash again, this time with vert_cfl reaching an extreme value of 166.22 (should not exceed 2).

Code:

20150801_20150831_wdamp1_fail> grep vert_cfl rsl.e*
rsl.error.0017:d01 2015-08-04_09:58:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.006587       4.800121      1.0499954E-02
rsl.error.0017:d01 2015-08-04_09:59:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.003651       4.450675      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:00:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.020375       4.692252      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:00:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.005452       4.053613      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:00:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.095181       4.289389      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:00:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.018242       3.701325      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:01:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.180107       4.058156      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:01:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.006206       7.780139      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:01:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.060689       3.051877      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:02:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.264330       3.525656      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:02:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.124609       8.451050      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:02:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.092540       3.593448      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:02:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.307505       4.224587      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:02:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.303792       8.177484      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:02:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.164794       4.687529      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:03:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.416075       4.979859      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:03:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.495229       7.892102      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:03:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.233529       5.602085      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:04:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.395652       6.173892      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:04:00  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.750925       6.869215      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:04:00  MAX AT i,j,k:           52          16          28  vert_cfl,w,d(eta)=   2.272073       7.827534      1.3000011E-02
rsl.error.0017:d01 2015-08-04_10:04:40  MAX AT i,j,k:           52          16          28  vert_cfl,w,d(eta)=   2.758431       7.491847      1.3000011E-02
rsl.error.0017:d01 2015-08-04_10:04:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.967172       5.401500      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:04:40  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   2.662990       10.35507      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:05:20  MAX AT i,j,k:           52          16          28  vert_cfl,w,d(eta)=   3.172976       4.987368      1.3000011E-02
rsl.error.0017:d01 2015-08-04_10:05:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   3.202723       5.884050      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:05:20  MAX AT i,j,k:           52          16          26  vert_cfl,w,d(eta)=   3.711579       11.34943      1.0499954E-02
rsl.error.0017:d01 2015-08-04_10:06:00  MAX AT i,j,k:           52          16          28  vert_cfl,w,d(eta)=   3.089234       5.116210      1.3000011E-02
rsl.error.0017:d01 2015-08-04_10:06:00  MAX AT i,j,k:           52          16          33  vert_cfl,w,d(eta)=   166.2205      -2627.772      2.5999963E-02

1/2

Arty · Oct 14, 2023

I resumed the run editing the namelist with epssm=0.2 in addition to w_damping=1 ; it did crash even earlier :

Code:

20150801_20150831> grep vert_cfl rsl.e*
rsl.error.0016:d01 2015-08-04_09:39:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.018993       4.973598      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:40:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.085662       4.962131      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:40:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.102881       4.877876      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:40:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.056139       6.399289      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:41:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.115517       5.038543      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:41:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.139940       6.231495      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.109290       5.427678      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.185308       5.974369      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.034342       5.580784      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.021821       5.585244      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.224082       5.577233      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.003055       6.032197      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:43:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.128371       6.663923      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:43:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.229564       5.418715      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:00  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.194541       6.514169      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:44:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.186398       5.326013      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.177926       6.647155      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:40  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.183123       6.428658      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:44:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.135717       5.196759      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.225829       6.602267      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:45:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.078398       6.602423      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:45:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.174050       8.213039      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:45:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.297877       6.312939      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:46:00  MAX AT i,j,k:           50          16          27  vert_cfl,w,d(eta)=   2.106392       6.144927      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:46:00  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.396022       7.854481      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:46:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.232093       6.069819      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:46:40  MAX AT i,j,k:           50          16          30  vert_cfl,w,d(eta)=   2.158721       7.210779      1.8500030E-02
rsl.error.0016:d01 2015-08-04_09:46:40  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.464130       7.357839      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:46:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.150765       7.344514      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:47:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.499514       7.102536      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:47:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.498658       7.846545      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:47:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.685980       6.836717      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:48:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.923602       7.392207      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:48:00  MAX AT i,j,k:           50          16          27  vert_cfl,w,d(eta)=   2.720652       10.82692      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:48:00  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   3.279955       13.96713      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:48:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   3.020925       6.357178      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:48:40  MAX AT i,j,k:           50          16          30  vert_cfl,w,d(eta)=   4.128247       5.263690      1.8500030E-02
rsl.error.0016:d01 2015-08-04_09:48:40  MAX AT i,j,k:           50          16          36  vert_cfl,w,d(eta)=   12.21906       40.53259      3.1500041E-02
rsl.error.0017:d01 2015-08-04_09:48:40  MAX AT i,j,k:           51          16          28  vert_cfl,w,d(eta)=   2.053180       5.880367      1.3000011E-02
rsl.error.0017:d01 2015-08-04_09:48:40  MAX AT i,j,k:           51          16          25  vert_cfl,w,d(eta)=   2.095456       5.506284      1.0500014E-02
rsl.error.0017:d01 2015-08-04_09:48:40  MAX AT i,j,k:           51          16          26  vert_cfl,w,d(eta)=   3.155609       2.279270      1.0499954E-02

Only the 36s (instead of 40s) run is still running.

Compared to the first crash, it always is around the same level, in the south region of the domain (where no terrain slope can be found)

Code:

20131001_20131031_wdamp0_fail> grep vert_cfl rsl.e*
rsl.error.0013:d01 2013-10-22_20:45:20  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.004570       5.555908      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:00  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.017011       5.513463      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:40  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.031362       5.400991      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:47:20  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.003102       5.453300      1.0499954E-02
...
rsl.error.0013:d01 2013-10-22_20:56:00  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   6.982184       7.462337      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:56:40  MAX AT i,j,k:           14          20          28  vert_cfl,w,d(eta)=   11.27256      -246.8004      1.3000011E-02
rsl.error.0025:d01 2013-10-22_20:56:00  MAX AT i,j,k:           14          21          26  vert_cfl,w,d(eta)=   2.140012       6.960732      1.0499954E-02
rsl.error.0025:d01 2013-10-22_20:56:40  MAX AT i,j,k:           14          21          28  vert_cfl,w,d(eta)=   4.002279       17.52446      1.3000011E-02

It so appears I have no choice than reduce time_step. I am wondering wether there is an impact adjusting time_step only when run fails ? Id est : as long as the run runs properly at 40s, I let it run, and only the month that fails I reduce the time_step ?
Would my simulation still be coherent that way, making the time_step vary back and forth over the whole period I'm running (30 years) ?

2/2

kwerner · Oct 16, 2023

Arty said:
I am wondering wether there is an impact adjusting time_step only when run fails ? Id est : as long as the run runs properly at 40s, I let it run, and only the month that fails I reduce the time_step ?
Would my simulation still be coherent that way, making the time_step vary back and forth over the whole period I'm running (30 years) ?

Hmm...I'm actually not sure if this would cause any issues. I want to say it wouldn't, but you could perform a quick test to see what happens. You could run your simulation for, say, 3 days. You could then run another test on the same 3 days, with everything identical, except run day 2 using a smaller time step, and then compare the output after 3 days. Another alternative is to look into using adaptive time stepping.

Arty · Dec 22, 2023

Arty said:

I resumed the run editing the namelist with epssm=0.2 in addition to w_damping=1 ; it did crash even earlier :

Code:

20150801_20150831> grep vert_cfl rsl.e*
rsl.error.0016:d01 2015-08-04_09:39:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.018993       4.973598      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:40:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.085662       4.962131      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:40:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.102881       4.877876      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:40:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.056139       6.399289      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:41:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.115517       5.038543      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:41:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.139940       6.231495      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.109290       5.427678      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.185308       5.974369      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.034342       5.580784      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.021821       5.585244      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.224082       5.577233      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:42:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.003055       6.032197      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:43:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.128371       6.663923      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:43:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.229564       5.418715      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:00  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.194541       6.514169      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:44:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.186398       5.326013      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.177926       6.647155      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:40  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.183123       6.428658      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:44:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.135717       5.196759      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:44:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.225829       6.602267      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:45:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.078398       6.602423      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:45:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.174050       8.213039      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:45:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.297877       6.312939      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:46:00  MAX AT i,j,k:           50          16          27  vert_cfl,w,d(eta)=   2.106392       6.144927      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:46:00  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.396022       7.854481      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:46:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.232093       6.069819      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:46:40  MAX AT i,j,k:           50          16          30  vert_cfl,w,d(eta)=   2.158721       7.210779      1.8500030E-02
rsl.error.0016:d01 2015-08-04_09:46:40  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.464130       7.357839      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:46:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.150765       7.344514      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:47:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.499514       7.102536      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:47:20  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   2.498658       7.846545      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:47:20  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.685980       6.836717      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:48:00  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   2.923602       7.392207      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:48:00  MAX AT i,j,k:           50          16          27  vert_cfl,w,d(eta)=   2.720652       10.82692      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:48:00  MAX AT i,j,k:           50          16          28  vert_cfl,w,d(eta)=   3.279955       13.96713      1.3000011E-02
rsl.error.0016:d01 2015-08-04_09:48:40  MAX AT i,j,k:           50          16          26  vert_cfl,w,d(eta)=   3.020925       6.357178      1.0499954E-02
rsl.error.0016:d01 2015-08-04_09:48:40  MAX AT i,j,k:           50          16          30  vert_cfl,w,d(eta)=   4.128247       5.263690      1.8500030E-02
rsl.error.0016:d01 2015-08-04_09:48:40  MAX AT i,j,k:           50          16          36  vert_cfl,w,d(eta)=   12.21906       40.53259      3.1500041E-02
rsl.error.0017:d01 2015-08-04_09:48:40  MAX AT i,j,k:           51          16          28  vert_cfl,w,d(eta)=   2.053180       5.880367      1.3000011E-02
rsl.error.0017:d01 2015-08-04_09:48:40  MAX AT i,j,k:           51          16          25  vert_cfl,w,d(eta)=   2.095456       5.506284      1.0500014E-02
rsl.error.0017:d01 2015-08-04_09:48:40  MAX AT i,j,k:           51          16          26  vert_cfl,w,d(eta)=   3.155609       2.279270      1.0499954E-02

Only the 36s (instead of 40s) run is still running.

Compared to the first crash, it always is around the same level, in the south region of the domain (where no terrain slope can be found)

Code:

20131001_20131031_wdamp0_fail> grep vert_cfl rsl.e*
rsl.error.0013:d01 2013-10-22_20:45:20  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.004570       5.555908      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:00  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.017011       5.513463      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:46:40  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.031362       5.400991      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:47:20  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   2.003102       5.453300      1.0499954E-02
...
rsl.error.0013:d01 2013-10-22_20:56:00  MAX AT i,j,k:           14          20          26  vert_cfl,w,d(eta)=   6.982184       7.462337      1.0499954E-02
rsl.error.0013:d01 2013-10-22_20:56:40  MAX AT i,j,k:           14          20          28  vert_cfl,w,d(eta)=   11.27256      -246.8004      1.3000011E-02
rsl.error.0025:d01 2013-10-22_20:56:00  MAX AT i,j,k:           14          21          26  vert_cfl,w,d(eta)=   2.140012       6.960732      1.0499954E-02
rsl.error.0025:d01 2013-10-22_20:56:40  MAX AT i,j,k:           14          21          28  vert_cfl,w,d(eta)=   4.002279       17.52446      1.3000011E-02

It so appears I have no choice than reduce time_step. I am wondering wether there is an impact adjusting time_step only when run fails ? Id est : as long as the run runs properly at 40s, I let it run, and only the month that fails I reduce the time_step ?
Would my simulation still be coherent that way, making the time_step vary back and forth over the whole period I'm running (30 years) ?

2/2

Just in case :

I did several other experiments on the same domain but with varying physics schemes and topography. I encountered some CFL crashes even with w_damping = 1 (which I now activate by default). Sometimes, assigning epssm = 0.2 fixes the problem ; sometimes not. I ran some experiments with epssm = 0.25 and even 0.5 that ended crashing eventually. As a last resort, the time-step need to be reduced anyway. So, in order not to loose much time re-running CFL crashed runs, I would rather try in that order :

1) Activate w_damping = 1
2) Reduce time-step : by 10%
3) Increase epssm up to 0.5 (which appears to be the max. value) in last resort

I also would like to note that event though not emerging at the same time-date, CFL breaches all occur in the South-West quadrant of the domain, independently from physics schemes and topography configuration ; and always around the vertical levels 26 to 28, which appear to be between 1300 and 1700 meters high according to sigma levels. Please, feel free to share your point of view on that matter.

(RESOLVED) forrtl: severe (174): SIGSEGV, segmentation fault occurred while running WRF

Arty

Member

Attachments

Arty

Member

kwerner

Administrator

Arty

Member

William.Hatheway

Active member

Arty

Member

William.Hatheway

Active member

kwerner

Administrator

Arty

Member

Arty

Member

Arty

Member

kwerner

Administrator

Arty

Member