Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

(RESOLVED) Segmentation fault on ndown

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

tmazzett

New member
Howdy, I am working in /glade/scratch/tmazzett/WRF/run on the cheyenne system.
I have attempted running ndown a few times with different numbers of processors.
And segmentation faults are occuring crashing the process.
the segmentation faults happen most of the time on the same rsl. file numbers but change randomly sometimes with identical runs.
does anyone have any ideas on how to fix this?
Thomas
 
Hi Thomas,
There are a couple of things I see that could potentially be the culprit.
1) This may be because of the placement of your d02 inside d01. It's entirely too close to the outer edge of d01. As a rule of thumb, we advise to have about 1/3 of the space of d01 around all sides of d02. Take a look at this page for recommended settings and best practices, specifically for domain set-up.
2) You are using a 9:1 parent_grid_ratio. We recommend using a 3:1 or 5:1, and nothing ever larger than 7:1, but we see the best results with 3:1 and 5:1.
 
Thank you for the reply,
It seems to me that these things you suggest would result in more reliable runtime numerical dependability.
But is there anything about these that would arise in inability to process the wrfout_d01 with the wrfndi_d02 to create wrfinput_d02 and wrfbdy_d02?
My errors are not in cfl or runtime segmentation faults, it is just setting up the inputs.
 
Thomas,

Maybe this is related to memory usage, as the ndown jobs stops just before doing anything with the fine grid.
Code:
NDOWN_EM V4.2 PREPROCESSOR
  ndown_em: calling alloc_and_configure_domain coarse
 *************************************
 Parent domain
 ids,ide,jds,jde            1         138           1         187
 ims,ime,jms,jme           -4          30          -4          24
 ips,ipe,jps,jpe            1          23           1          17
 *************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
   alloc_space_field: domain            1 ,              313760848  bytes allocated
   DEBUG wrf_timetoa():  returning with str = [2017-01-05_18:00:00]
   DEBUG wrf_timetoa():  returning with str = [2017-01-05_18:00:00]
   DEBUG wrf_timetoa():  returning with str = [2017-01-07_00:00:00]
   DEBUG wrf_timeinttoa():  returning with str = [0000000000_000:000:005]
 DEBUG setup_timekeeping():  clock after creation,  clock start time = 2017-01-05_18:00:00
 DEBUG setup_timekeeping():  clock after creation,  clock current time = 2017-01-05_18:00:00

The cheyenne machine has larger memory nodes available, more than 2.4x the memory on the default nodes. It looks like you were going down this path on your own (using only 11 out of 36 processes).
Code:
#PBS -l select=6:ncpus=11:mpiprocs=11

Looking at the namelist, the resultant horizontal decomposition is approximately 176x137, which is a bit larger than average, but certainly not a problem. However, the big deal could be the 540 vertical levels! You might win the November award for most levels (darn, if we only had that contest).

Try just a couple of time periods with the large memory nodes on cheyenne, and use only two processors per node. Go big, and try 72 nodes. The "mem=109GB" option tells the scheduler to run your job on the large memory node partition.
Code:
#PBS -l select=72:ncpus=2:mpiprocs=2:mem=109GB

The 72x2 (nodes * processors/node) will give a total 144 MPI processes, and that decomposition will work with your relatively small coarse grid. Again, just try a couple of time periods to see if this is fixing the problem. This is 28x the memory of your first effort, so it is definitely "going big". If this idea does work, maybe dial back the number of nodes with a few iterations, maybe try 36x4, 24x6, 18x8 setups.

Later on down the road ...

The WRF model will be less of an issue for memory, as we can run an MPI decomposition for your large domain (1054x1504) with no concern that we also need to fit the small domain into the mix. However, with 540 vertical levels, you will probably also be on the large nodes for the WRF model. Once you are ready for WRF,
  • I'll give you some info on how to reduce the thousands (maybe tens of thousands) of rsl files you are about to create. It is a compile-time option.
  • You may want to consider your objective regarding output. For example, do you need every 3d field at 540 levels? The WRF model has I/O options (and you would need them mostly for "O"), but you can also reduce your file size by removing fields from the output stream. At 100 m resolution (so approximately 0.5 s for a time step), needing to output the data at a resolution of 15 min would be a snapshot every 1800 time steps. That is very crude temporal resolution.
  • You may want to consider generating station data, which would give you data per timestep at discrete locations.
 
:( i tried 66 large memory nodes 1 process on each. and it still failed in the same way.

would it have to do something with the vertical refinement of the vertical grid spacing?
 
Thomas,
OK, looks like you are in for some detective work.

The problem does not appear to be memory related.

  • The two other pieces that you mention could be an issue: the vertical refinement or the nested ratio. It would be worthwhile to test each separately. The vertical refinement requires turning off a switch and filling in e_vert for d02. The nested ratio would require that you reconstruct a new IC file (all the way back to geogrid).
  • It might be instructive to see where this is dying. If the code is rebuilt, use "configure -D". That should perhaps point to a particular line number where things are failing.
 
Ok, i will keep diving.

We tried to go 900m to 300m with ndown instead of 100m, and were able to get it to run with just deleting vert_refine_method.
 
We tried to go 900m to 300m with ndown instead of 100m, and were able to get it to run with just deleting vert_refine_method.

So to not confound the test, can you try the original 900 m -> 100 m, without the vertical refinement?
 
good idea, but there would be no point in simulating after trying that because there would be nowhere near enough vertical levels for numerical stability, right?
 
Thomas,
First we see why the ndown program is not working. The vertical spacing may be a problem later in WRF, but ndown is just interpolating.
 
yeah exactly. thank you for the help.

but at the moment, we are going to just stick with the 300m run we accomplished.

this is for a class project, and it might just take us too long to accomplish 100m, but this thread will be useful in my future battles with LES, and hopefully other forum users.
 
Top