Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault RRTMG with adaptive time step.

Manuarii

Member
Hi everyone,

I am experiencing an issue while running WRF. Although both WPS and the real.exe processes completed successfully, executing WRF with adaptive time stepping leads to a segmentation fault. To resolve this problem, I need to set a constant time step of 10 seconds for the larger domain. I am currently using WRF version 4.4.2 with 81 processors. Attached are the rsl.error.0001 and namelist.input files for your reference. Is it due to the time step or size of my inner domain ?

Thanks a lot in advance,

Vazquez Ballesta Manuarii
 

Attachments

  • namelist.input
    6.3 KB · Views: 19
  • rsl.error.0001.txt
    26.9 KB · Views: 7
Hi,
Can you try a couple things?

1) first set debug_level = 0. I know you were probably trying to get some additional output information, but this option usually just adds a lot of useless junk to the rsl files, making them more difficult to read.

2) Try running just a single domain with adaptive time stepping on. If that works, try 2 domains. I'd like to know which nest (if any) causes the issue.

3) Try removing the eta_levels from your namelist and running that way - with however many domains it takes to make it fail.

Let me know the results and then package all of your rsl* files into a single *.tar file and attach that, along with the updated namelist.input file so I can take a look. Thanks!
 
Hi, Thanks for your reply.

I attempted to run the simulation by setting max_dom = 2 while using the same adaptive time step options, and it worked without issues. This indicates that the third domain is likely the cause of the problem. With this in mind, I experimented with different configurations in the namelist.input file to prevent the simulation from failing. However, even when using the smallest possible time step with the adaptive time step, nothing changed.

What’s puzzling is that the simulation consistently stops at the exact same point in time: "time 2017-02-20_00:10:00 on domain 3". In some cases, I don’t even encounter any CFL errors, yet the issue persists when running with three domains.

Attached is a zip file that contains the error logs for both the two-domain and three-domain configurations.

Manuarii
 

Attachments

  • rsl_error_3do.zip
    395.4 KB · Views: 2
  • rsl_error_2do.zip
    1.8 MB · Views: 1
Last edited:
Update : I’ve identified that the error is specifically related to the long-wave radiation scheme. In the namelist.input, the radt parameter is set to 10 minutes, and the model crashes exactly when the simulation reaches that 10-minute mark. I tried adjusting this value to both 1 minute and 20 minutes, but the model still fails once it hits the corresponding time.

What steps can I take to properly use the adaptive time step with this configuration?

Manuarii
 
In some cases, I don’t even encounter any CFL errors, yet the issue persists when running with three domains.
Are you changing anything with your setup when you don't receive any CFL errors? If you're not changing anything, the CFL errors should exist every time. The rsl files you shared with me show multiple CFL errors, and unfortunately you will have to overcome that issue before anything else can be resolved. See Segmentation Faults - Helpful Information, which includes information about getting past CFL errors. There are some additional options than just decreasing the time_step. It may be that you can solve the CFL issue without having to use adaptive time step. If you can get rid of the CFL errors, and it's still failing, send your modified namelist.input file, as well as the packaged-up rsl files again. Thanks!
 
Are you changing anything with your setup when you don't receive any CFL errors? If you're not changing anything, the CFL errors should exist every time. The rsl files you shared with me show multiple CFL errors, and unfortunately you will have to overcome that issue before anything else can be resolved. See Segmentation Faults - Helpful Information, which includes information about getting past CFL errors. There are some additional options than just decreasing the time_step. It may be that you can solve the CFL issue without having to use adaptive time step. If you can get rid of the CFL errors, and it's still failing, send your modified namelist.input file, as well as the packaged-up rsl files again. Thanks!
Thanks for your response. Yes, I adjust my setup to avoid CFL errors. In the attached example (rsl.error and namelist.input), where I don't encounter CFL errors, I can increase the time step up to 1s for the inner domain, and the model simulate up to 10 minutes before failing again.

That said, I am still encountering a segmentation fault, even with these adjustments.

Manuarii
 

Attachments

  • namelist.input
    7.8 KB · Views: 4
  • rsl_error_no_cfl_error.zip
    713.1 KB · Views: 1
Thanks for sending those. I just looked closer at your namelist and I see you are using a parent_grid_ratio of 9. We recommend never using more than a 5:1 ratio, whether or not using adaptive time-step. Can you try to set up a run that uses either 3:1 or 5:1 and see if that makes any difference?
 
Thanks for sending those. I just looked closer at your namelist and I see you are using a parent_grid_ratio of 9. We recommend never using more than a 5:1 ratio, whether or not using adaptive time-step. Can you try to set up a run that uses either 3:1 or 5:1 and see if that makes any difference?
Hi,

Thank you for your reply. Initially, we configured the model with five domains, each with a nesting ratio of 3. However, we encountered similar issues as we do now. In this setup, we observed that using three domains yielded values for various meteorological variables that were closer to ground observations compared to the five-domain configuration.

It appears that the smallest domain, with a resolution of 1/9 km, is the primary issue, as it fails when using an adaptive time step. What remains puzzling is that the model successfully runs with a constant time step, though only with a very small step size.

But I can run again the 5 domains and send you the rsl.error.
 
Hi,

I didn't find time to relaunch my 5 domains, to show that I have the same issues. However, as it was linked to the radiation scheme, I've tried to change my initial scheme (RRTMG, 4) with (RRTM, 1), but the problem persists... I really out of ideas, because in totally different configuration, in other simulations I made with WRF, I use without problem the RRTMG scheme.
 
Hi,

When you use adaptive time step you should also control for the minimum allowed time step, which is currently set in your namelist to "-1". This means that default value will be used, which is equal to 3*DX, which in your case with dx = 0.111 km equals 0.333 s. This *might* be too large timestep for such small grid.

For the test, set the min_time_step for the smallest domian to say 0.1 (1/10) second or less, and try again. If this works, that you must have extremely small time steps in order to avoid CFL crashes, it might be the indicator that something in the simulation is a little bit on extreme side. Probably too steep terrain slopes or something like that.

I can't tell exactly what causes the crashes, but you should also include epssm value between 0.5 and 1 (instead of default 0.1) and maybe even do some terrain smoothing in GEOGRID.TBL in order to allow model to run with larger time steps. That under assumption that allowing the adaptive timestep to drop much lower, works... if it doesn't than something other is shaky.

EDIT: target_hcfl = 0.84 is probably too large, too.
 
Hi,

When you use adaptive time step you should also control for the minimum allowed time step, which is currently set in your namelist to "-1". This means that default value will be used, which is equal to 3*DX, which in your case with dx = 0.111 km equals 0.333 s. This *might* be too large timestep for such small grid.

For the test, set the min_time_step for the smallest domian to say 0.1 (1/10) second or less, and try again. If this works, that you must have extremely small time steps in order to avoid CFL crashes, it might be the indicator that something in the simulation is a little bit on extreme side. Probably too steep terrain slopes or something like that.

I can't tell exactly what causes the crashes, but you should also include epssm value between 0.5 and 1 (instead of default 0.1) and maybe even do some terrain smoothing in GEOGRID.TBL in order to allow model to run with larger time steps. That under assumption that allowing the adaptive timestep to drop much lower, works... if it doesn't than something other is shaky.

EDIT: target_hcfl = 0.84 is probably too large, too.

Thank you for your reply, Meteoadriatic.

Unfortunately, I have already tried all possible combinations of target_cfl and min_time_step. I also attempted using the smallest time step to avoid CFL errors, but the simulation still fails at the exact same point when calling the longwave radiation scheme.

As I mentioned earlier, using a sufficiently small, constant time step allows the simulation to run without errors. However, this approach significantly increases computation time, and the same error may still occur depending on the simulation day.

The issue seems to be specifically related to using the third domain, as the model runs perfectly with only two domains.

There is something unusual happening here, and I haven’t retained all the rsl files from my tests to investigate further. The error doesn’t appear to be caused by CFL violations but seems to be linked to the longwave radiation scheme, as the simulation consistently stops at the time step defined by radt for the third domain. Changing the radiation scheme, as I mentioned earlier, doesn’t resolve the issue either.

One possible cause might be the WRF version I’m using, as I’ve successfully run other simulations with three domains in much more complex terrain at the same resolution. Alternatively, the issue might be related to the land-use dataset. We created a custom table that directly uses Corine Land Cover without reclassifying it to USGS, as suggested by Pineda et al. For sure we change what was needed in noahmp and stuff to be able to read it, as we don’t have same number of classes.
 
Last edited:
Thanks for the updates. Before I can dig further into this, I need you to try two different tests:
1) Try this using an unmodified version of WRF (preferably the latest version - v4.6.1) and using the default landuse dataset to see if either of those factors are the culprit.
2) Try using a 5:1 or 3:1 parent_grid_ratio. As I mentioned, we do not ever recommend using anything larger than 5:1.

If it still fails using an unmodified version of WRF and landuse, and using a reasonable parent_grid_ratio, please share your wrfinput_d0* and wrfbdy_d01 files, along with the current namelist.input file so that I can test it here. These files will likely be too large to attach, so see the home page of this forum for instructions on sharing large files. Thanks.
 
Thanks for the updates. Before I can dig further into this, I need you to try two different tests:
1) Try this using an unmodified version of WRF (preferably the latest version - v4.6.1) and using the default landuse dataset to see if either of those factors are the culprit.
2) Try using a 5:1 or 3:1 parent_grid_ratio. As I mentioned, we do not ever recommend using anything larger than 5:1.

If it still fails using an unmodified version of WRF and landuse, and using a reasonable parent_grid_ratio, please share your wrfinput_d0* and wrfbdy_d01 files, along with the current namelist.input file so that I can test it here. These files will likely be too large to attach, so see the home page of this forum for instructions on sharing large files. Thanks.
Hi Kwerner,

Thank you for your reply. I followed your suggestions and tested the latest version (4.6.1) with a 3:1 parent grid ratio and the default USGS land cover. By setting a minimal target_cfl value of 0.2 and fine-tuning the maximum and minimum time step parameters for the adaptive time step, I successfully ran 9 hours of my simulation. This was the first time I achieved such progress with the adaptive time step enabled.

However, at 9 hours and 10 minutes, the simulation failed with the same issue as before. I reviewed the rsl.error files, and no CFL errors were detected. The error consistently pointed to the Long-Wave Radiation scheme, resulting in a segmentation fault.

I also tested my previous setup with a 9:1 parent grid ratio. As expected, the simulation failed after approximately 10 minutes of runtime, again with segmentation faults related to the radiation scheme.

For further investigation, I’ve attached the namelist.input and namelist.wps files. Additionally, the wrfbdy_d01 file and the wrfinput files for all domains are available in the cloud forum for your review (https://nextcloud.mmm.ucar.edu/index.php/s/n8in8C8PAAaYqAy/download/wrfbdy.zip and https://nextcloud.mmm.ucar.edu/index.php/s/n8in8C8PAAaYqAy/download/wrfin_all.zip).

Lastly, just to clarify, I’ve modified the Noah-MP files (MPTABLE.TBL and VEGPARM.TBL) as well as the LANDUSE.TBL to integrate CLC land cover. However, I believe this should not affect the simulation when using the default USGS configuration directly. To ensure alignment, I’ve included the modified tables and Noah-MP files as well. While I don’t think the issue originates from these modifications, I’d appreciate your confirmation.

Thank you for your time and assistance.
 

Attachments

  • namelist.wps
    1.9 KB · Views: 2
  • namelist.input
    6.8 KB · Views: 4
  • Modified_noahmp_and_Table.zip
    147.9 KB · Views: 0
Hello,
I found that this often needs to be reduced too to keep runs stable with adaptive time step:
target_hcfl = .84,
to 0.8 or maybe bit lower in some extreme situations.

whereas vertical cfl of 0.2 makes no sense to me. If you did not yet, I suggest to try:
target_cfl = 0.8,
target_hcfl = 0.7,

Good luck!
 
Hello,
I found that this often needs to be reduced too to keep runs stable with adaptive time step:
target_hcfl = .84,
to 0.8 or maybe bit lower in some extreme situations.

whereas vertical cfl of 0.2 makes no sense to me. If you did not yet, I suggest to try:
target_cfl = 0.8,
target_hcfl = 0.7,

Good luck!
Hi Meteoadriatic,

Thank you for the suggestion!

However, I’ve already tested different sets of values for target_cfl and target_hcfl, ranging from 1.0 to 0.2, but unfortunately, the simulation continues to fail for the same reason. As I mentioned earlier, since the error occurs without a CFL violation, it doesn’t seem to be related to these parameters.

The simulation consistently fails when calling the longwave radiation (LW) scheme, regardless of the specific radiation scheme I use. This is the first time I’ve encountered this issue, which is surprising because I’ve successfully used WRF for simulations in other regions with more complex terrain, including steep slopes. With a bit of smoothing, I’ve been able to achieve good agreement with observations at a horizontal resolution of 1/9 km and 2 meters for the first mass point vertically. Typically for this other simulation I made, I use target_cfl of 0.5, with target_hcfl = 0.84 without any problem....

I appreciate your help nonetheless!
 
Try changing this

Code:
radt                                = 10,    10,    10,   10,    10,

to

Code:
 radt                                = 9,    9,    9,   9,    9,

radt, according to the WRF users guide should be minutes between radiation physics calls; recommend 1 minute per km of dx (e.g., set to 10 for a 10 km grid); set to the same value for all domains. I don't think this would be the issue, i have run cases without strictly staying allign with this. I would try the tests that @kwerner reccomended with this change and see if that helps, then with the changes @meteoadriatic suggested.

you could also try during this one on too.


Code:
w_damping                           = 0,

to

w_damping                           = 1,
 
What is obvious is that you need, for some reason, extremely small time step in order to keep simulation going (you said that with constant time step of 10 seconds for parent domain /9 km grid/ enables you to complete simulation. But that is really small, as usual time step that "should" work is about 6x dx which for 9km would be 54 seconds.

When I think more about it, I don't think that you have CFL violations if your run does not work with cfl targets around 0.5, but what happens is that when you reduce those targets to very low numbers, consenquently adaptive time step code keeps time step very very low in order to keep actual CFL values below targets.

But anyway, using super low constant time step or super low cfl targets, your run will take huge amounts of time if you don't find and fix the underlying cause of these crashes.

With all that being said:
1) Are you sure that input data is OK?
2) Can you possibly try without your NOAH-MP and tables modifications, just for a test, to narrow down issue?
3) Most probably - I found that QNSE scheme is very unstable. Is it required that you use that PBL? Can you change, at least for test?

Good luck!
 
What is obvious is that you need, for some reason, extremely small time step in order to keep simulation going (you said that with constant time step of 10 seconds for parent domain /9 km grid/ enables you to complete simulation. But that is really small, as usual time step that "should" work is about 6x dx which for 9km would be 54 seconds.

When I think more about it, I don't think that you have CFL violations if your run does not work with cfl targets around 0.5, but what happens is that when you reduce those targets to very low numbers, consenquently adaptive time step code keeps time step very very low in order to keep actual CFL values below targets.

But anyway, using super low constant time step or super low cfl targets, your run will take huge amounts of time if you don't find and fix the underlying cause of these crashes.

With all that being said:
1) Are you sure that input data is OK?
2) Can you possibly try without your NOAH-MP and tables modifications, just for a test, to narrow down issue?
3) Most probably - I found that QNSE scheme is very unstable. Is it required that you use that PBL? Can you change, at least for test?

Good luck!
Hi meteoadriatic,

The simulation requires a very small time step. Interestingly, the time it takes to run is not too bad; with a constant time step, I can achieve 1 hour of simulation in 1 hour of real time. However, there is certainly room for improvement in terms of speed.

That said, depending on the day I simulate, even a constant time step of 10 seconds is sometimes insufficient, and in certain cases, even 5 seconds is not enough.

To answer your three questions:
  1. Yes, the input data is fine; there are no issues. I'm confident in this because I use the same configurations from a previous PhD thesis, and from my wrfout files, I obtain similar results. However, the previous work only simulated two days, whereas I need to simulate a much larger number of days.
  2. I’ll test it and provide feedback once I have the results.
  3. As mentioned earlier, I am using the same parameters as those used in the previous PhD work, which showed good agreement with observational data. Nevertheless, I am open to testing changes to these parameters to explore any potential improvements.
Additionally, one approach I haven’t yet tried is smoothing the topography, particularly for areas with steep slopes. Although I don’t have many steep slopes in my domain, do you think this could help if the other tests don’t yield improvements? For now, I am still experimenting with setting radt to 9 minutes and w_damping to 1, as suggested by William Hathaway.
 
Hello,

Yes smoothing will help if the issue is too steep slope. Sure, try it but I still would not bet on that. About QNSE - but, in PhD work you used QNSE on different case study? This might explain why it worked. I don't have lot of experience with it but any time I tried to test it it crashed very quickly near the beginning of the simulation (but not at the very beginning, which tells me that it might work under different conditions and crashes are dependend on something in the simulation itself).
 
Hello,

Yes smoothing will help if the issue is too steep slope. Sure, try it but I still would not bet on that. About QNSE - but, in PhD work you used QNSE on different case study? This might explain why it worked. I don't have lot of experience with it but any time I tried to test it it crashed very quickly near the beginning of the simulation (but not at the very beginning, which tells me that it might work under different conditions and crashes are dependend on something in the simulation itself).
Hi Meteoadriatic,

I’ve tried the recommendation from William.Hatheway, setting radt to 9 minutes and enabling w_damping by setting it to 1. However, with the same parameters for target_cfl and other settings, the simulation only ran for 2 hours before failing again for the same reason.

Previously, using radt = 10 and w_damping = 0, the simulation ran for 9 hours before encountering the same error. Interestingly, reverting to the unmodified Noah-MP and its associated tables also allows the simulation to run for 9 hours before failing again for the same reason.

My next step is to test the changes to the PBL scheme, as you suggested.

Thank you again for your help.
 
Top