Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

V4.3.3: SIGSEGV, segmentation fault after several hours (model time)

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

cwaigl

New member
Hi WRF community,

I'm in the early stages of a project to run WRF using ERA5 reanalysis data, ultimately over a two-way nested domain (12 km and 4 km spatial resolution). Right now my test runs (run_hours = 21) have all finished fine, but once I extend the runs I encounter segmentation faults.

The last messages from rsl.error on the relevant core that segfaults (with debugging set to 200, and only running the outer domain) ... seems to be getting stuck in the spam filter here, so I am attaching it.

The segfault always occurs after "CALL rrtmg_lw". It usually occurs right after the midnight UTC timestamp has rolled around, but occasionally it has happened at another timestep. Also, I am using adaptive time steps here, but the behavior is identical with fixed time step (72 = 6 times dx, and same with time_step = 60).

I initially suspected a change I made to METGRID.TBL to cap snow fields (SNOW variable from SD variable in ERA5), but this last segfault occurs after I switched back and regenerated all met_em files .

Attached namelist.input and namelist.wps. (Note that I pre-generated three days' worth of met_em files for both domains. This has not made a difference in the past, but if you advise so I can generate the met_em files for the exact same test run that I run WRF on later.)

Any help is greatly appreciated - I'm a bit stuck here.
 

Attachments

  • namelist.input.202203142027.txt
    5.3 KB · Views: 24
  • namelist.wps.202203142027.txt
    593 bytes · Views: 18
  • rsl.error.0009.txt
    2.7 KB · Views: 21
Hi,
Can you actually turn off debug_level (or set to 0) and then re-run this? We have removed the debug parameter from more recent versions of WRF because it never tends to provide much information, and it just adds a lot of junk to the rsl* files, making them large and difficult to comb through. Once you re-run, please package all of your rsl.error* files into a single *.TAR file and attach that so that I can take a look. Thanks!
 
Hi,

Thanks for getting back to me!

1. I did what you advised. rsl.error files are attached.
2. While continuing to debug on my side I inspected all variables visually using ncview. The one that stood out is VAR_SSO. It looks cut off at 60°N. (see attached image) Regenerating it (with geogrid.exe) did not give a different result. Could this be the cause, and if yes, what do you advise? ( I think it shouldn't at this stage as I have set gwd_opt = 0.)) I am using the latest version of WRF-ARW.
 

Attachments

  • met_em_d01_VAR_SSO.png
    met_em_d01_VAR_SSO.png
    90 KB · Views: 468
  • rsl.error.tar
    490 KB · Views: 20
I made some changes to the physics options (to align with previously successful runs a few years ago) and then did a 54 h run (my target length - 6h spin-up and then two full days, midnight to midnight UTC) with just the large dimension and dt = 48 - and it didn't crash.

Then I started the exact same run again with the only change to set max_domains to 2 (both domains). The segmentation fault occurred right after midnight UTC rolled around the first time.

Any insight would be very helpful - I'm stumped right now. New namelist.input and rsl.error files are attached.
 

Attachments

  • rsl.error.20220316.tar
    1.1 MB · Views: 15
  • namelist.input.20220316.txt
    5.3 KB · Views: 21
Hi,
I apologize for the delay. VAR-SSO is a variable that is included to assist with the topo_wind option and it doesn't look like you're using that option, so it shouldn't be the issue. It is known that there is no VAR-SSO data above 60 degrees.

I see that you're running WRF-Chem. Does this same issue happen when you only run a basic WRF simulation? If so, then we can try to dig into this in more detail. Otherwise, it would point to a wrf-chem specific problem, and if that's the case, you should post your issue in the WRF-Chem section of this forum. There is a different team that supports those inquiries and they'd be able to assist you better.
 
Hi -

Thanks for getting back to me. I was surprised to hear that I was running WRF-Chem - that was unintentional. (This is in fact the first time I intended to run vanilla WRF, not WRF-Chem.) Seems like I accidentally compiled with-chemistry (I got the code from GitHub, and apparently it's included) by having the WRF-CHEM environment variable set to 1. I just re-compiled without the chemistry and am doing a test. I'll report back whether the behavior changes.
 
Unfortunately changing WRF-Chem to WRF did not change anything. The crash happens at exactly the same point in the simulation, right after the times roll around the midnight point (00:00 h) for the first time. (This isn't always where it crashed, in other cases, but often.)

I attach new rsl.error and namelist.input files. Any help etc. pp.
 

Attachments

  • namelist.input.20220325.txt
    5.3 KB · Views: 15
  • rsl.error.20220325.tar
    860 KB · Views: 14
Thanks for sending those, and I'm glad you realized you needed to only run basic wrf, instead of chemistry!

There are so many different reasons why a segmentation fault can occur. I recommend taking a look at this FAQ that discusses some of them.

Although the model is stopping when the input data for d02 is being processed at 2021-06-02_00, it's not actually integrating at all, which means it's pretty much stopping immediately. So keep that in mind when looking at that FAQ.

I would also suggest trying to run this simulation with just a very basic namelist - for e.g., the namelist.input file that comes with the model code by default. You will need to modify the domain information, dates, etc. to match your input, but leave everything else the same (e.g., physics, don't turn on sst_update, etc.) and see if that's able to run. If not, the problem is likely with the input, and if it does, then you can slowly add in some of the namelist options you want to use to see what is causing it to crash.
 
Thanks for your continued help - I think I'm getting closer, but don't have a solution yet. I did what you suggested: took the default namelist, ONLY updated the domain and start/end time variables, and ran it: No crash. Then I ONLY added auxinput4 and sst_update = 1 ... and it is crashing.

What is the next step here?
 
It's great you were able to find a culprit so quickly! Based on your namelist.wps file, it doesn't look like you're using any sort of additional SST data input, in addition to whatever is possibly provided by your primary input data source (e.g., GFS). Is this true? If you are not providing a higher-resolution (temporal or spatial) SST data, then there shouldn't be a reason to turn on sst_update, and your run is only a little over 2 days long. We typically recommend needing to use the sst_update option if running for 5 or more days. To use this option, you need to have access to time-varying SST and seaice fields.

I can't say for sure this is the only cause of the segmentation fault, but perhaps try turning on some of your other options without sst_update to see if it's able to run.
 
Thanks again! I thought I was providing time-varying SST with the ERA5 single level data (from RDA, not Copernicus). After discussion in our team we do want to update it, even for a 2-day run.

I am confused what I need to do - time-varying SST is being picked up in the wrflowinput_d0x files. But shouldn't it be masked with the correct (high-res) mask?

As additional information, some rsl.out files have huge blocks of this before the crash, and I understand from reading the forum that it's a sign of model instability. (Changing radt did not make a difference, for what it's worth):

Code:
Flerchinger USEd in NEW version. Iterations=          10
 
Thanks for that information. Often Flerchinger errors indicate there are unreasonable soil moisture or soil temperature values in the initial conditions. As a first check, it might be worth plotting or at least scanning through the soil moistures and soil temperatures to see that they all look reasonable. I'm not sure why this would cause the simulation to crash only when using sst_update, unless it has to do with the way the variables interact. If the soil values look okay, take a look to make sure the values in your wrflowinp* files look reasonable, as well.
 
Top