Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

restart run encountered segmentation fault

yutinghe

New member
Hello all,
I'm using WRF version 4.5 to perform a restart run test, But I encountered segmentation fault like the follows at the beginning of restart run:

[ib0306:55202:0:55202] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe07e74c40)
==== backtrace (tid: 55202) ====
0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
1 0x0000000002d59f91 module_sf_sfclayrev_mp_psim_stable_() ???:0
2 0x0000000002d55394 module_sf_sfclayrev_mp_sfclayrev1d_() ???:0
3 0x0000000002d52f66 module_sf_sfclayrev_mp_sfclayrev_() ???:0
4 0x00000000025e2086 module_surface_driver_mp_surface_driver_() ???:0
5 0x0000000001e739ba module_first_rk_step_part1_mp_first_rk_step_part1_() ???:0
6 0x00000000016e841c solve_em_() ???:0
7 0x0000000001506428 solve_interface_() ???:0
8 0x00000000005b9d33 module_integrate_mp_integrate_() ???:0
9 0x00000000005ba350 module_integrate_mp_integrate_() ???:0
10 0x0000000000417591 module_wrf_top_mp_wrf_run_() ???:0
11 0x000000000041754f MAIN__() ???:0
12 0x00000000004174e2 main() ???:0
13 0x0000000000022555 __libc_start_main() ???:0
14 0x00000000004173e9 _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
wrf.exe 000000000322C97A for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B350FF84630 Unknown Unknown Unknown
wrf.exe 0000000002D59F91 Unknown Unknown Unknown
wrf.exe 0000000002D55394 Unknown Unknown Unknown
wrf.exe 0000000002D52F66 Unknown Unknown Unknown
wrf.exe 00000000025E2086 Unknown Unknown Unknown
wrf.exe 0000000001E739BA Unknown Unknown Unknown
wrf.exe 00000000016E841C Unknown Unknown Unknown
wrf.exe 0000000001506428 Unknown Unknown Unknown
wrf.exe 00000000005B9D33 Unknown Unknown Unknown
wrf.exe 00000000005BA350 Unknown Unknown Unknown
wrf.exe 0000000000417591 Unknown Unknown Unknown
wrf.exe 000000000041754F Unknown Unknown Unknown
wrf.exe 00000000004174E2 Unknown Unknown Unknown
libc-2.17.so 00002B35103B7555 __libc_start_main Unknown Unknown
wrf.exe 00000000004173E9 Unknown Unknown Unknown

The initial run is start at 2010-12-01_00:00:00, and the restart run is start at 2010-12-02_00:00:00.
By the way, I modified some source code before compile the model to add a variable, I'm not sure whether this will affect the normal work of restart.
I attached my namelist and rsl* files, hope you can give me some help, many thanks!

Best,
Yuting
 

Attachments

  • namelist.input
    5.7 KB · Views: 10
  • rsl.tar.gz
    225.8 KB · Views: 4
Hi,
Can you compile a new (unmodified) version of WRF and run this to see if it still stops in the same place? That will tell us whether your modifications have made an impact.
 
Hi kwerner,
I have compiled an unmodified version of WRFV4.5 and run the same case, but unfortunately, the restart run still stopped at the same place.
In case you need to check my compilation environment, I attached my compile.log and bash script file for running wrf.exe.
If you need else information, please let me know.
Regards,
Yuting
 

Attachments

  • compile.log
    953.9 KB · Views: 1
  • wrf_job.sh.txt
    185 bytes · Views: 3
  • env.sh.txt
    1.1 KB · Views: 3
Thanks for trying that. I'm glad to hear you didn't introduce any issues with your modifications.
- Since the model stops immediately, can you take a look at the wrfrst* files you're using to make sure they are complete and nothing seems to be missing or unreasonable?
- It may also be helpful to know whether the model would run with 4, 3, 2, or 1 domain, as opposed to 5. This could at least help to determine which domain is causing the issue.
- You aren't using too many processors, but you are very close to the limit for the size of your domains. As a test, can you try to use fewer to see if that makes any difference?
- Finally, is there a reason why you need to run a restart at day 2, instead of just running the simulation for the full time period, without a break to restart?
 
Thanks for your sugestions.
- Regarding the check wrfrst* files you mentioned, everything looks fine when I check these files with ncdump (I have attached the output information about using the ncdump command to check the wrfrst* files). Should I drill down into specific variables in the wrfrst* files?
- About the last question, what I actually did was a ten-year simulation, but unfortunately the model stopped after just over a year of integration due to system issues. When I tried to do restart run, I encountered the error mentioned above. I have tried different restart time and the error is the same. So I do the restart run test at day 2 to save time and computational expense.
Next I will test your second and third suggestions and I will update my test results in time.
 

Attachments

  • wrfrst_var.txt
    178.1 KB · Views: 1
Last edited:
Hi kwerner,
Followed your suggestions, I have tried to reduce the number of domains and processors, but the restart can not work yet.
I also tried different version (4.2.2 and 4.4.2) of WRF model, none of them can do restart run in my case.
Finally, I found that it was a problem with the following two lines of parameters in namelist.input:
Code:
sf_surface_mosaic = 1
mosaic_cat = 3
But I don't know what the specific reason behind this is, I hope you can report this problem with the staff.
 
Thank you so much for doing those different tests, and especially for tracking down the namelist parameters causing the issue. Are you wanting to use the mosaic option, or are you okay proceeding without it?
 
Hi, kwerner,
It's okay for my case not using mosaic option, but I still hope that this issue can be resolved.
Thanks for your patient help!
 
Okay, thanks for letting me know. I'll put this on the list of issues we need to look into, since you are able to move forward with your simulations now.
 
Okay, thanks for letting me know. I'll put this on the list of issues we need to look into, since you are able to move forward with your simulations now.
Hi ,kwerner
l have to activate noah-mosaic option in WRF to finish my job now. However, it always occurs segmentation fault when l set this in the namslist.input
sf_surface_mosaic = 1
mosaic_cat = 9
l really wanna know how to activate the noah-mosaic option?
Thanks, looking forward to your reply.
 
What is your option of sf_surface_physics? Note that the mosaic option can only work with

Noah LSM (sf_surface_physics = 2).

If your sf_surface_physics = 2, then can you try with mosaic_cat = 3?
 
Dear Ming,
Thanks for your reply and l'm glad to recivew yours. l had tried to set mosaic_cat =3, but the reslut is error same as before. l don't find the reason and my WRF version is V4.4.1. l attach my namelist.input, rsl.out.0000 and rsl.error.0000 files.
Best wish.
 

Attachments

  • namelist.input
    3.9 KB · Views: 7
  • rsl.error.0000
    126.2 KB · Views: 2
  • rsl.out.0000
    125.9 KB · Views: 1
Hi Tongyu ,

I just run a test case using your physics options and my own data. I run two cases: an initial run and a restart run. Both are done successfully, which indicates that the option sf_surface_mosaic = 1 works just fine.
By the way, I run with WRFV4.5.1. However, because no changes have been made related to the mosaic option since WRFv4.4.1, I believe WRFV4.4.1 should also work fine.
For your failed case, I am suspicious this is a data issue since your case failed immediately after wrf.exe started. Please try the options below to narrow down possible reasons:
(1) run over a single domain to check whether the model works. This can be done for both initial run and restart run
(2) if (1) is successful, then move on to run the nested case
(3) if (1) failed, please turn off sf_surface_mosaic and rerun the case. If it still fails, then we will confirm something wrong in your data.
Please try and let me know the results.
 
Hi Tongyu ,

I just run a test case using your physics options and my own data. I run two cases: an initial run and a restart run. Both are done successfully, which indicates that the option sf_surface_mosaic = 1 works just fine.
By the way, I run with WRFV4.5.1. However, because no changes have been made related to the mosaic option since WRFv4.4.1, I believe WRFV4.4.1 should also work fine.
For your failed case, I am suspicious this is a data issue since your case failed immediately after wrf.exe started. Please try the options below to narrow down possible reasons:
(1) run over a single domain to check whether the model works. This can be done for both initial run and restart run
(2) if (1) is successful, then move on to run the nested case
(3) if (1) failed, please turn off sf_surface_mosaic and rerun the case. If it still fails, then we will confirm something wrong in your data.
Please try and let me know the results!
Hi, Ming
Thanks for your suggestion. l try to run a single domain and it works fine, but it works error after restarting to run a single domain. l attach the rsl.out.0000 file.
l also try to run two domains without sf_surface_mosaic and mosaic_cat in same data and physics options, and it works fine and could restart to run successfully.
Please let me know them if you have any thoughts about solving the problem.Looking forward to your reply!
 

Attachments

  • rsl.out.0000
    31.3 KB · Views: 3
In your rsl file, there are many messages of "Flerchinger USEd in NEW version", which indicates that something went wrong in the physics, possibly in the land model.
However, I am perplexed that the nested case can run to the send, while the single-domain run failed, ---- this is quite unusual and I have no explanation at hand.

I would suggest that you recompile WRF in debug mode (i.e., ./configure -D) and rerun the case. By this way you will find exactly when and where the model crashes, and then trace back to find possible reasons.
 
OK. Thank u! Ming.
l will attempt to find reasons follow your suggestions. I will post related infomation about it here if l solve the problem successfully.
Best wishes.
 
Top