Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Program received signal SIGBUS: Access to an undefined portion of a memory object.

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

todor_Gai

New member
Hi,
I have a problem running WRF on domain with a nest inside.
Running WRF fails with the following error:
Code:
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1376 RUNNING AT 4f09ebffaa69
=   EXIT CODE: 7
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
INFO:__main__:Time wrf.exe, #procs: 28 : 9.36809492111206 seconds
WARNING:__main__:Executing wrf.exe failed. Error:  starting wrf task           11  of           28

In the rsl.error.00* files I see only this kind of error:
Code:
Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0  0x7f807deed32a
#1  0x7f807deec503
#2  0x7f807d569fcf
#3  0x7f807d6b9d8f
#4  0x7f807e41ccd4
#5  0x7f807e4075ff
#6  0x7f807e411526
#7  0x7f807e3a376a
#8  0x7f807e2f1988
#9  0x7f807e2f21b1
#10  0x7f807e2f27c0
#11  0x55ada6e73b88
#12  0x55ada6cfcb69
#13  0x55ada6cfd1ca
#14  0x55ada7931a73
#15  0x55ada7933829
#16  0x55ada7930a2d
#17  0x55ada7930503
#18  0x55ada7939105
#19  0x55ada7c69bb6
#20  0x55ada7820115
#21  0x55ada779cbbe
#22  0x55ada787ad6a
#23  0x55ada787be55
#24  0x55ada787d121
#25  0x55ada6a3493f
#26  0x55ada6a34f88
#27  0x55ada69c6b17
#28  0x55ada69c647e
#29  0x7f807d54cb96
#30  0x55ada69c64b9
#31  0xffffffffffffffff

Can somebody help me with this problem ?

This is my namelist.input file:
Code:
 &time_control
 run_days                            = 0,
 run_hours                           = 00,
 run_minutes                         = 0,
 run_seconds                         = 0,
 start_year                          = 2020, 2020, 
 start_month                         = 06,   06,   
 start_day                           = 09,   09,   
 start_hour                          = 00,   00,   
 start_minute                        = 00,   00,   
 start_second                        = 00,   00,   
 end_year                            = 2020, 2020, 
 end_month                           = 06,   06,   
 end_day                             = 12,   12,   
 end_hour                            = 00,   00,   
 end_minute                          = 00,   00,   
 end_second                          = 00,   00,   
 interval_seconds                    = 10800
 input_from_file                     = .true.,.true.,
 history_interval                    = 60,  60,
 frames_per_outfile                  = 1000, 1000,
 restart                             = .false.,
 restart_interval                    = 5000,
 io_form_history                     = 2
 io_form_restart                     = 2
 io_form_input                       = 2
 io_form_boundary                    = 2
 debug_level                         = 0
 /

 &domains                 
 time_step                = 162,
 time_step_fract_num      = 0,
 time_step_fract_den      = 1,
 max_dom                  = 2,
 e_we                     = 195,      301, 
 e_sn                     = 162,      223, 
 e_vert                   = 35,       35,  
 p_top_requested          = 5000,
 num_metgrid_levels       = 34,
 num_metgrid_soil_levels  = 4,
 dx                       = 27000,     9000,
 dy                       = 27000,     9000,
 grid_id                  = 1,        2,    
 parent_id                = 1,        1,    
 i_parent_start           = 1,       47,    
 j_parent_start           = 1,       45,    
 parent_grid_ratio        = 1,        3,    
 parent_time_step_ratio   = 1,        3,    
 feedback                 = 1,
 smooth_option            = 0,
 /

 &physics
 mp_physics                          = 7,     7,
 use_aero_icbc                       = .true.,   
 use_rap_aero_icbc                   = .true.,
 aer_opt                             = 2, 
 aer_aod550_opt                      = 1,
 aer_aod550_val                      = 0.12,
 aer_angexp_opt                      = 1,
 aer_angexp_val                      = 1.7,
 aer_ssa_opt                         = 1,
 aer_asy_opt                         = 1,
 shcu_physics                        = 2,     2, 
 ra_lw_physics                       = 3,     3, 
 ra_sw_physics                       = 3,     3, 
 radt                                = 30,    30,
 bl_mynn_edmf                        = 0,     0, 
 sf_sfclay_physics                   = 5,     5, 
 sf_surface_physics                  = 2,     2, 
 bl_pbl_physics                      = 5,     5, 
 bldt                                = 0,     0, 
 cu_physics                          = 1,     1, 
 cudt                                = 5,     5, 
 isfflx                              = 1,
 ifsnow                              = 0,
 icloud                              = 1,
 surface_input_source                = 1,
 num_soil_layers                     = 4,
 sf_urban_physics                    = 0,     0,
 maxiens                             = 1,
 maxens                              = 3,
 maxens2                             = 3,
 maxens3                             = 16,
 ensdim                              = 144,
 /

 &fdda
 /

 &dynamics
 w_damping                           = 0,
 diff_opt                            = 1,
 km_opt                              = 4,
 diff_6th_opt                        = 0,      0,     
 diff_6th_factor                     = 0.12,   0.12,  
 base_temp                           = 290.
 damp_opt                            = 0,
 zdamp                               = 5000.,  5000., 
 dampcoef                            = 0.2,    0.2,   
 khdif                               = 0,      0,     
 kvdif                               = 0,      0,     
 non_hydrostatic                     = .true., .true.,
 moist_adv_opt                       = 1,      1,     
 scalar_adv_opt                      = 1,      1,     
 /

 &bdy_control
 spec_bdy_width                      = 5,
 spec_zone                           = 1,
 relax_zone                          = 4,
 specified                           = .true., .false.,
 nested                              = .false., .true.,
 /

 &grib2
 /

 &namelist_quilt
 nio_tasks_per_group = 0,
 nio_groups = 1,
 /

I've attached 10 of the 12 rsl.error files with the error:View attachment rsl.error.0001.txtView attachment rsl.error.0002.txtView attachment rsl.error.0003.txtView attachment rsl.error.0005.txtView attachment rsl.error.0006.txtView attachment rsl.error.0009.txtView attachment rsl.error.0010.txtView attachment rsl.error.0011.txtView attachment rsl.error.0018.txtView attachment rsl.error.0021.txt

Thank you in advance,
Todor
 
I tried executing the model with less processors with the hope that something might change.
Typically the execution of the model ends withing few seconds with the mentioned error.
I tried with 20 procs and the model was running for 38 minutes but in the end it failed again.
And the error logs are the same but the error appears in only 3 of the rsl.error files.
I don't know what to do with so little info about the error.
 
Hi,
I think the number of processors you're using is okay. I agree that seg-faults are difficult to figure out.

1) Take a look at this FAQ regarding common reasons for segmentation faults:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=133

2) As you sent several of the rsl* files, it's likely not the case, but just check to make sure there are no CFL errors in any of the files:
Code:
grep cfl rsl*
. If you do see them, see the link above.

3) Make sure you have enough space on the disk to which you're trying to write the output files.

4) You can also look at this FAQ regarding ways to debug your run:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=316
 
kwerner said:
Hi,
I think the number of processors you're using is okay. I agree that seg-faults are difficult to figure out.

1) Take a look at this FAQ regarding common reasons for segmentation faults:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=133

2) As you sent several of the rsl* files, it's likely not the case, but just check to make sure there are no CFL errors in any of the files:
Code:
grep cfl rsl*
. If you do see them, see the link above.

3) Make sure you have enough space on the disk to which you're trying to write the output files.

4) You can also look at this FAQ regarding ways to debug your run:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=316

Hi kwerner,
I tried the suggestions from 1): I could not find any trace of CFL error in the rsl.* files.
I also tried to make the stack size bigger, but it didn't help and I got the same errors.

So I enabled debugging and rebuilt the model and start it again.
I got the following stack trace in the rsl.* files:

Code:
Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0  0x7fb97f02b2ed in ???
#1  0x7fb97f02a503 in ???
#2  0x7fb97e6a7fcf in ???
#3  0x7fb97e7f7d8f in ???
#4  0x7fb97f55acd4 in ???
#5  0x7fb97f5455ff in ???
#6  0x7fb97f54f526 in ???
#7  0x7fb97f4e176a in ???
#8  0x7fb97f42f988 in ???
#9  0x7fb97f4301b1 in ???
#10  0x7fb97f4307c0 in ???
#11  0x55db1b7310a8 in ???
#12  0x55db1b42b73f in wrf_patch_to_global_generic_
	at wrf/WRF/frame/module_dm.f90:7047
#13  0x55db1b42badb in wrf_patch_to_global_real_
	at wrf/WRF/frame/module_dm.f90:6869
#14  0x55db1c66f3bb in collect_generic_and_call_pkg_
	at wrf/WRF/frame/module_io.f90:23120
#15  0x55db1c6717d9 in collect_real_and_call_pkg_
	at wrf/WRF/frame/module_io.f90:22820
#16  0x55db1c66e354 in collect_fld_and_call_pkg_
	at wrf/WRF/frame/module_io.f90:22742
#17  0x55db1c66de2a in wrf_write_field1_
	at wrf/WRF/frame/module_io.f90:22524
#18  0x55db1c677b63 in wrf_write_field_
	at wrf/WRF/frame/module_io.f90:22320
#19  0x55db1caef4e8 in wrf_ext_write_field_
	at wrf/WRF/share/wrf_ext_write_field.f90:177
#20  0x55db1c3f530f in output_wrf_
	at wrf/WRF/share/output_wrf.f90:1398
#21  0x55db1c2eff6c in __module_io_domain_MOD_output_history
	at wrf/WRF/share/module_io_domain.f90:392
#22  0x55db1c4c54f8 in med_hist_out_
	at wrf/WRF/share/mediation_integrate.f90:896
#23  0x55db1c4cde96 in med_before_solve_io_
	at wrf/WRF/share/mediation_integrate.f90:65
#24  0x55db1af7ad1c in __module_integrate_MOD_integrate
	at wrf/WRF/frame/module_integrate.f90:318
#25  0x55db1af7b8ff in __module_integrate_MOD_integrate
	at wrf/WRF/frame/module_integrate.f90:362
#26  0x55db1af0b3a3 in __module_wrf_top_MOD_wrf_run
	at ../main/module_wrf_top.f90:324
#27  0x55db1af0a0d1 in wrf
	at wrf/WRF/main/wrf.f90:29
#28  0x55db1af0a132 in main
	at wrf/WRF/main/wrf.f90:6

Do you have any idea what might be the cause of the error ?

Thank you
 
Hmm, that traceback isn't very informative.
1) Do you have enough disk space? Sometimes that can be the problem.
2) Did you check the met_em* files to make sure all the variables look okay at all levels?

If neither of those help, can you send the following files so I can run some tests:
wrfbdy_d01
wrfinput_d01
wrfinput_d02
The first two times for the met_em.d01 and met_em.d01 files

It's very likely these files are going to be too large to post here, so take a look at the home page of this forum for instructions on uploading large files. Thanks!
 
UPDATE I made a mistake: I've been running it with 32 processes and it haven't finished yet
Hi.
Sorry for the late answer.
I have enough disk space.
I looked at the met_em* files but I haven't spotted anything strange.
I've also uploaded the files that you asked me to in nextCloud so that you can take a look. Theys are in a zip file named sigbus_error_files.zip
The met_em files are with different times, because I've been working on another time period.
I also tried running the model with 14 processes and it is still running with no failure. I think the problem has something to do with
the number of processes used.

Thank you
Todor
 
Todor,
With the update, do you mean that you are running with 32 processors, instead of 14? Are you waiting to see if that completes without errors?
 
Sorry for the late response.
Yes, the update meant that I made a mistake and I was running with 32 processes. Unfortunately it failed again.
 
Hi,
Thanks for getting those files to me, and for letting me know. I ran this, using your wrfinput* and wrfbdy_d01 files, along with the namelist.input file you provided. I tried with 32 processors, and with 108, and both simulations ran without any problems. Unfortunately the problem seems to be related to your specific system. If you made any modifications to the code, I would try with non-modified code to ensure you didn't introduce a problem. Otherwise, I'd advise to discuss the issue with a systems administrator at your institution to see if they have any ideas for the failure (e.g., space issues).
 
Top