Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Persistent segmentation faults

tanksnr

New member
Hi all,

I'm currently running WRFv4.4.1.
This for a retrospective simulation forced with GFS (analysis and archived forecast; d084001).
Domain is 480x490 at 10km resolution with 41 levels.
The runs segfault around 15:00 - 17:00.

Things I've tried:
- Increase domain extent (boundaries away from complex topography) - this worked and stopped getting cu scheme segfaults
- Increase domain height (ptop 7000pa to 5000pa), 35 to 41 levels - this also helped with cu scheme segfaults
- Went from mp scheme WDM5 to Morrison (both) to now Thompson (option 8) - this helped for a single day, segfault the next day.
- Reduced time-step from 30 to 25 and now to 20 - helps for a day or two but now even 20 gives segfaults.

Currently the segfaults seem to occur while in SFCLAY (I'm using "revised MM5" scheme with Noah LSM).

I've had a looked at the segfault thread but it doesn't seem to apply to my run.
1) Halos should be large enough with 48 processors
2) Plenty available disk space
3) Segfault is towards the end of runs. I did look at REAL output and there's no NaNs in surface variables.
4) There's no listed CFL errors in the rsl files
5) There should be enough memory (at least 2x64GB).

I've attached my namelist.input and the rsl.error file for the core with the segfault.

I'm not sure what else to try. Maybe increase resolution to 9km? The physics/dynamics set up has worked many times before for different domains (smaller) and resolutions (higher).
 

Attachments

  • namelist.input
    7.3 KB · Views: 5
  • rsl.error.0019.gz
    619.9 KB · Views: 3
Hi,
Can you turn off debug_level (i.e., debug_level = 0) and run this again? This shouldn't fix the problem, but since that option rarely provides any useful information, and instead makes the rsl* files huge and difficult to read, it's better to have it off. After it fails again, please package all of your rsl* files together into a *.tar or *.gz file and attach that so I can take a look. Thanks!
 
Most often epssm should be larger than default value of 0.1. I don't know why it is set so low by default. Try around 0.5, if you see improvement but still crashes here and there, go up to 1 without worrying.
 
Thank you all for responding.
Please see attached for namelist.input, namelist.output (just in case) and the rsl files.
@kwerner you're right debug set so high made it difficult to read rsl files.

@meteoadriatic and @William.Hatheway the files attached now refer to a run with epssm set at 0.5.
The segfault still occurs at the same time.
I haven't considered etac yet (first time hearing about this namelist variable).

For what it's worth, and why I included info on previous segfaults in the original post, earlier runs failed with the "WOULD GO OFF TOP" errors.
This for both the KF (1) and MSKF (11) cu schemes.
True enough my original domain included Mt Kilimanjaro at the northern edge and so I moved the boundary northward and increased ptop and model levels. However what was weird was that the error specified a cell in the middle of a very flat, low lying desert (Namibia).
Changing the domain size and MP physics from WDM5 to Thompson help the simulation go passed 4 days. The current segmentation fault provides no clues as to what could be the cause (to me at least!).
 

Attachments

  • namelist.input
    7.3 KB · Views: 6
  • namelist.output.txt
    84.7 KB · Views: 0
  • rslfiles.tar.gz
    62.1 KB · Views: 3
Thank you all for responding.
Please see attached for namelist.input, namelist.output (just in case) and the rsl files.
@kwerner you're right debug set so high made it difficult to read rsl files.

@meteoadriatic and @William.Hatheway the files attached now refer to a run with epssm set at 0.5.
The segfault still occurs at the same time.
I haven't considered etac yet (first time hearing about this namelist variable).

For what it's worth, and why I included info on previous segfaults in the original post, earlier runs failed with the "WOULD GO OFF TOP" errors.
This for both the KF (1) and MSKF (11) cu schemes.
True enough my original domain included Mt Kilimanjaro at the northern edge and so I moved the boundary northward and increased ptop and model levels. However what was weird was that the error specified a cell in the middle of a very flat, low lying desert (Namibia).
Changing the domain size and MP physics from WDM5 to Thompson help the simulation go passed 4 days. The current segmentation fault provides no clues as to what could be the cause (to me at least!).
can you upload your namelist.wps, I have some time I can try and figure it out
 
e_vert end index in z (vertical) direction; staggered dimension for full levels (most variables are on unstaggered levels); vertical dimensions must be the same for all domains

Code:
e_vert                              = 40, 35,

Dx and Dy need to be 3333.33 based on your ratio of 1:3

Code:
dx                                  = 10000,  6000,
 dy                                  = 10000,  6000,
 grid_id                             = 1,      2,
 parent_id                           = 1,      1,
 i_parent_start                      = 1,  31,
 j_parent_start                      = 1,  29,
 parent_grid_ratio                   = 1,     3,
 parent_time_step_ratio              = 1,     3,
 
can you upload your namelist.wps, I have some time I can try and figure it out
Thank you for having a look.
I've attached the wps namelist. Also METGRID.TBL and Vtable I used (I added SST input into metgrid instead of separate stream).

On the grid definitions, note that this is a 1 domain simulation (maxdom = 1). So, I assume, the second element in namelist variables are ignored. If not then the runs should've aborted on startup right?
 

Attachments

  • METGRID.TBL.txt
    37.9 KB · Views: 0
  • namelist.wps
    1.2 KB · Views: 1
  • Vtable.GFS.txt
    8.5 KB · Views: 0
Yes you don't need to worry about second column. Some time ago, I was trying RRTMG-K (ra*physics = 14) and had lot of crashes, while with standard RRTMG the model was fine. I suggest to try 4 instead of 14 unless you have a specific reason to stick with exactly that one. Also, if your domain includes very steep topography, I would not neccessary trust log files in terms of where the crash happened within domain. The slopes might just be too steep. You might have to use smoothing in GEOGRID.TBL (look into HGT field and this line: "smooth_option = smth-desmth_special; smooth_passes=1" ... try to set those passes to let's say 3 and rerun geogrid.exe, then see what happens. Although at 10 km dx I doubt this is the reason for instability.

Also to try:

Comment out these (put ! in front like below or remove completely):
!dzstretch_s = 1.1,
!dzstretch_u = 1.2,

If there are still no improvement, remove everything from namelists except bare minimum, so that all defaults are used.
 
Thanks Meteoadriatic.
I'll try the radiation schemes first. No real reason for using the newest one.

For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.
 
Thanks Meteoadriatic.
I'll try the radiation schemes first. No real reason for using the newest one.

For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.
i'm also going to test your namelists on my machine and see if i can reproduce the results, it's 0540 right now here in the states but i didn't forget this
 
For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.
I know but... you might get too large distance between two levels. If you do that try to increase number of vertical levels. Or just try default stretching to see what happens, if works then experiment with modifications.
 
Thanks Meteoadriatic.
I'll try the radiation schemes first. No real reason for using the newest one.

For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.
1732283403659.png

That's your domain you are wanting correct?
 
I know but... you might get too large distance between two levels. If you do that try to increase number of vertical levels. Or just try default stretching to see what happens, if works then experiment with modifications.
Apologies I actually did mean I was trying to keep layer thickness below 1km. Such that yes the distance between two layers isn't too large resulting in CFL errors.
 
Yes you don't need to worry about second column. Some time ago, I was trying RRTMG-K (ra*physics = 14) and had lot of crashes, while with standard RRTMG the model was fine. I suggest to try 4 instead of 14 unless you have a specific reason to stick with exactly that one. Also, if your domain includes very steep topography, I would not neccessary trust log files in terms of where the crash happened within domain. The slopes might just be too steep. You might have to use smoothing in GEOGRID.TBL (look into HGT field and this line: "smooth_option = smth-desmth_special; smooth_passes=1" ... try to set those passes to let's say 3 and rerun geogrid.exe, then see what happens. Although at 10 km dx I doubt this is the reason for instability.

Also to try:

Comment out these (put ! in front like below or remove completely):
!dzstretch_s = 1.1,
!dzstretch_u = 1.2,

If there are still no improvement, remove everything from namelists except bare minimum, so that all defaults are used.
Ok so it turned out to be the radiation schemes. Switched to the normal RRTMG and no issues so far.
Did not even consider this (even though in retrospect I see it would heavily influence the sfclay scheme).

I'll start playing around with maybe going back to Morrison MP scheme and certainly increasing the timestep.
Thanks for all the help @meteoadriatic , @William.Hatheway and @kwerner
 
Top