Persistent segmentation faults

tanksnr · Nov 14, 2024

Hi all,

I'm currently running WRFv4.4.1.
This for a retrospective simulation forced with GFS (analysis and archived forecast; d084001).
Domain is 480x490 at 10km resolution with 41 levels.
The runs segfault around 15:00 - 17:00.

Things I've tried:
- Increase domain extent (boundaries away from complex topography) - this worked and stopped getting cu scheme segfaults
- Increase domain height (ptop 7000pa to 5000pa), 35 to 41 levels - this also helped with cu scheme segfaults
- Went from mp scheme WDM5 to Morrison (both) to now Thompson (option 8) - this helped for a single day, segfault the next day.
- Reduced time-step from 30 to 25 and now to 20 - helps for a day or two but now even 20 gives segfaults.

Currently the segfaults seem to occur while in SFCLAY (I'm using "revised MM5" scheme with Noah LSM).

I've had a looked at the segfault thread but it doesn't seem to apply to my run.

Segmentation Faults and CFL Errors

Segmentation faults can be difficult to track down. As there isn't usually a clear error message, it can take some trial and error to figure out the problem. 1) A segmentation fault is often the result of using too many or too few processors, or a bad decomposition. Take a look at this FAQ...

forum.mmm.ucar.edu

1) Halos should be large enough with 48 processors
2) Plenty available disk space
3) Segfault is towards the end of runs. I did look at REAL output and there's no NaNs in surface variables.
4) There's no listed CFL errors in the rsl files
5) There should be enough memory (at least 2x64GB).

I've attached my namelist.input and the rsl.error file for the core with the segfault.

I'm not sure what else to try. Maybe increase resolution to 9km? The physics/dynamics set up has worked many times before for different domains (smaller) and resolutions (higher).

kwerner · Nov 18, 2024

Hi,
Can you turn off debug_level (i.e., debug_level = 0) and run this again? This shouldn't fix the problem, but since that option rarely provides any useful information, and instead makes the rsl* files huge and difficult to read, it's better to have it off. After it fails again, please package all of your rsl* files together into a *.tar or *.gz file and attach that so I can take a look. Thanks!

meteoadriatic · Nov 18, 2024

Most often epssm should be larger than default value of 0.1. I don't know why it is set so low by default. Try around 0.5, if you see improvement but still crashes here and there, go up to 1 without worrying.

William.Hatheway · Nov 18, 2024

epssm = 0.9,
etac = 0.02

If it is over extreme topography like the Himalayas I've had to use those values.

tanksnr · Nov 21, 2024

Thank you all for responding.
Please see attached for namelist.input, namelist.output (just in case) and the rsl files.
@kwerner you're right debug set so high made it difficult to read rsl files.

@meteoadriatic and @William.Hatheway the files attached now refer to a run with epssm set at 0.5.
The segfault still occurs at the same time.
I haven't considered etac yet (first time hearing about this namelist variable).

For what it's worth, and why I included info on previous segfaults in the original post, earlier runs failed with the "WOULD GO OFF TOP" errors.
This for both the KF (1) and MSKF (11) cu schemes.
True enough my original domain included Mt Kilimanjaro at the northern edge and so I moved the boundary northward and increased ptop and model levels. However what was weird was that the error specified a cell in the middle of a very flat, low lying desert (Namibia).
Changing the domain size and MP physics from WDM5 to Thompson help the simulation go passed 4 days. The current segmentation fault provides no clues as to what could be the cause (to me at least!).

William.Hatheway · Nov 21, 2024

tanksnr said:
Thank you all for responding.
Please see attached for namelist.input, namelist.output (just in case) and the rsl files.
@kwerner you're right debug set so high made it difficult to read rsl files.

@meteoadriatic and @William.Hatheway the files attached now refer to a run with epssm set at 0.5.
The segfault still occurs at the same time.
I haven't considered etac yet (first time hearing about this namelist variable).

For what it's worth, and why I included info on previous segfaults in the original post, earlier runs failed with the "WOULD GO OFF TOP" errors.
This for both the KF (1) and MSKF (11) cu schemes.
True enough my original domain included Mt Kilimanjaro at the northern edge and so I moved the boundary northward and increased ptop and model levels. However what was weird was that the error specified a cell in the middle of a very flat, low lying desert (Namibia).
Changing the domain size and MP physics from WDM5 to Thompson help the simulation go passed 4 days. The current segmentation fault provides no clues as to what could be the cause (to me at least!).

can you upload your namelist.wps, I have some time I can try and figure it out

William.Hatheway · Nov 21, 2024

e_vert end index in z (vertical) direction; staggered dimension for full levels (most variables are on unstaggered levels); vertical dimensions must be the same for all domains

Code:

e_vert                              = 40, 35,

Dx and Dy need to be 3333.33 based on your ratio of 1:3

Code:

dx                                  = 10000,  6000,
 dy                                  = 10000,  6000,
 grid_id                             = 1,      2,
 parent_id                           = 1,      1,
 i_parent_start                      = 1,  31,
 j_parent_start                      = 1,  29,
 parent_grid_ratio                   = 1,     3,
 parent_time_step_ratio              = 1,     3,

tanksnr · Nov 21, 2024

William.Hatheway said:
can you upload your namelist.wps, I have some time I can try and figure it out

Thank you for having a look.
I've attached the wps namelist. Also METGRID.TBL and Vtable I used (I added SST input into metgrid instead of separate stream).

On the grid definitions, note that this is a 1 domain simulation (maxdom = 1). So, I assume, the second element in namelist variables are ignored. If not then the runs should've aborted on startup right?

meteoadriatic · Nov 22, 2024

Yes you don't need to worry about second column. Some time ago, I was trying RRTMG-K (ra*physics = 14) and had lot of crashes, while with standard RRTMG the model was fine. I suggest to try 4 instead of 14 unless you have a specific reason to stick with exactly that one. Also, if your domain includes very steep topography, I would not neccessary trust log files in terms of where the crash happened within domain. The slopes might just be too steep. You might have to use smoothing in GEOGRID.TBL (look into HGT field and this line: "smooth_option = smth-desmth_special; smooth_passes=1" ... try to set those passes to let's say 3 and rerun geogrid.exe, then see what happens. Although at 10 km dx I doubt this is the reason for instability.

Also to try:

Comment out these (put ! in front like below or remove completely):
!dzstretch_s = 1.1,
!dzstretch_u = 1.2,

If there are still no improvement, remove everything from namelists except bare minimum, so that all defaults are used.

tanksnr · Nov 22, 2024

Thanks Meteoadriatic.
I'll try the radiation schemes first. No real reason for using the newest one.

For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.

William.Hatheway · Nov 22, 2024

tanksnr said:
Thanks Meteoadriatic.
I'll try the radiation schemes first. No real reason for using the newest one.

For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.

i'm also going to test your namelists on my machine and see if i can reproduce the results, it's 0540 right now here in the states but i didn't forget this

meteoadriatic · Nov 22, 2024

tanksnr said:
For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.

I know but... you might get too large distance between two levels. If you do that try to increase number of vertical levels. Or just try default stretching to see what happens, if works then experiment with modifications.

William.Hatheway · Nov 22, 2024

tanksnr said:
Thanks Meteoadriatic.
I'll try the radiation schemes first. No real reason for using the newest one.

For the dzstretch parameters, you mean keep them as default?
My intention on modifying those was to keep the layers below or well below 1km so as to minimize chances of unresolved convection and CFL errors.

That's your domain you are wanting correct?

tanksnr · Nov 25, 2024

meteoadriatic said:
I know but... you might get too large distance between two levels. If you do that try to increase number of vertical levels. Or just try default stretching to see what happens, if works then experiment with modifications.

Apologies I actually did mean I was trying to keep layer thickness below 1km. Such that yes the distance between two layers isn't too large resulting in CFL errors.

tanksnr · Nov 25, 2024

William.Hatheway said:
View attachment 16443

That's your domain you are wanting correct?

The parent yes. My namelist is a bit untidy and I tend to keep the nest options intact while purely changing max_dom.
Btw I also did use the domain wizard and it's really useful. Beats firing up ArcGIS sometimes.

tanksnr · Nov 25, 2024

meteoadriatic said:
Yes you don't need to worry about second column. Some time ago, I was trying RRTMG-K (ra*physics = 14) and had lot of crashes, while with standard RRTMG the model was fine. I suggest to try 4 instead of 14 unless you have a specific reason to stick with exactly that one. Also, if your domain includes very steep topography, I would not neccessary trust log files in terms of where the crash happened within domain. The slopes might just be too steep. You might have to use smoothing in GEOGRID.TBL (look into HGT field and this line: "smooth_option = smth-desmth_special; smooth_passes=1" ... try to set those passes to let's say 3 and rerun geogrid.exe, then see what happens. Although at 10 km dx I doubt this is the reason for instability.

Also to try:

Comment out these (put ! in front like below or remove completely):
!dzstretch_s = 1.1,
!dzstretch_u = 1.2,

If there are still no improvement, remove everything from namelists except bare minimum, so that all defaults are used.

Ok so it turned out to be the radiation schemes. Switched to the normal RRTMG and no issues so far.
Did not even consider this (even though in retrospect I see it would heavily influence the sfclay scheme).

I'll start playing around with maybe going back to Morrison MP scheme and certainly increasing the timestep.
Thanks for all the help @meteoadriatic , @William.Hatheway and @kwerner

William.Hatheway · Nov 25, 2024

tanksnr said:
Ok so it turned out to be the radiation schemes. Switched to the normal RRTMG and no issues so far.
Did not even consider this (even though in retrospect I see it would heavily influence the sfclay scheme).

I'll start playing around with maybe going back to Morrison MP scheme and certainly increasing the timestep.
Thanks for all the help @meteoadriatic , @William.Hatheway and @kwerner

you're welcome

William.Hatheway · Nov 25, 2024

tanksnr said:
The parent yes. My namelist is a bit untidy and I tend to keep the nest options intact while purely changing max_dom.
Btw I also did use the domain wizard and it's really useful. Beats firing up ArcGIS sometimes.

glad you like it

Persistent segmentation faults

New member

Attachments

Administrator

Member

Active member

New member

Attachments

Active member

Active member

New member

Attachments

Member

New member

Active member

Member

Active member

New member

New member

New member

Active member

Active member