Segmentation fault on first nest move

milancurcic · Sep 8, 2020

I'm setting up a vortex following run with WRF 4.2.1. It's built on IBM Power9 using XL compilers.

The model runs, however I hit a segmentation fault as soon as the inner nests makes a move. I tested this by varying the time_to_move variable. The end of the log looks like this:

Code:

...
Timing for main: time 2019-08-30_00:01:55 on domain   2:    0.68377 elapsed seconds
Timing for main: time 2019-08-30_00:02:00 on domain   2:    0.69296 elapsed seconds
ATCF 2019-08-30_00:02:00    22.91   -67.75  995.2   45.5
 2019-08-30_00:02:00 vortex center (in nest x and y):  225.6396637 210.9163208
 2019-08-30_00:02:00 grid   center (in nest x and y):  200.0000000 200.0000000
 2019-08-30_00:02:00 disp          :  6.000000000 6.000000000
 2019-08-30_00:02:00 move (rel cd) :  1 1
  moving  2 1 1

  Signal received: SIGSEGV - Segmentation violation

  Traceback:

And here are the vortex-following related namelist parameters:

Code:

&domains
    time_step = 15
    time_step_fract_num = 0
    time_step_fract_den = 1
    max_dom = 2
    e_we = 664, 400
    e_sn = 700, 400
    e_vert = 40, 40
    p_top_requested = 5000
    num_metgrid_levels = 34
    num_metgrid_soil_levels = 4
    dx = 3000, 1000
    dy = 3000, 1000
    grid_id = 1, 2, 3
    parent_id = 1, 1, 2
    i_parent_start = 1, 471
    j_parent_start = 1, 208
    parent_grid_ratio = 1, 3
    parent_time_step_ratio = 1, 3
    feedback = 1
    smooth_option = 0
    vortex_interval  = 5, 5
    max_vortex_speed = 15, 15
    corral_dist = 8, 8
    track_level = 70000
    time_to_move = 0, 2
    use_surface = .false.
/

If it helps, I attach the full namelist.input and rsl.error.0000 files.

How can I best diagnose this?

Thank you!
Milan

kwerner · Sep 10, 2020

Hi Milan,
The problem could potentially be that you're not using enough processors to run this. You're only using 40 processors, and for a domain of 664 x 700, you will likely need more than that. Even though the model fails with the first nest move, it could be related to the size of d01. Take a look at this FAQ that helps you determine a good number of processors to use: https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=5082
If that's not the problem, please send your new rsl.* files (it's best if you can package all of them in a *.tar file and send that). Thanks!

milancurcic · Sep 11, 2020

Thank you, Kelly! I re-ran on 320 processors with seemingly same results. I attach all rsl files as rsl.tar. Thank you!

kwerner · Sep 16, 2020

Hi,
I attempted to run your case, using your namelist and GFS-FNL data for the dates you are using. I ran with V4.2.1, compiled with vortex-following, and I didn't get the problem you had. The only difference was that I set the history_interval to hourly, instead of every minute, as that was taking entirely too long to integrate. Is there a reason why you need the model interval to be 1 minute? If you try setting that to 60, instead, does that make any difference? Probably not, but just wanted to check.

milancurcic · Sep 18, 2020

Thanks a lot, that confirms that the segfault is not due to the namelist mis-config. I only used history_interval = 1 to diagnose this issue.

At this point I can only suspect that this may have something to do with my Registry issue reported in https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=37&t=9444.

I will keep digging there and report back what I find.

Segmentation fault on first nest move

milancurcic

New member

Attachments

kwerner

Administrator

milancurcic

New member

Attachments

kwerner

Administrator

milancurcic

New member