Thomas,
Maybe this is related to memory usage, as the ndown jobs stops just before doing anything with the fine grid.
Code:
NDOWN_EM V4.2 PREPROCESSOR
ndown_em: calling alloc_and_configure_domain coarse
*************************************
Parent domain
ids,ide,jds,jde 1 138 1 187
ims,ime,jms,jme -4 30 -4 24
ips,ipe,jps,jpe 1 23 1 17
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 313760848 bytes allocated
DEBUG wrf_timetoa(): returning with str = [2017-01-05_18:00:00]
DEBUG wrf_timetoa(): returning with str = [2017-01-05_18:00:00]
DEBUG wrf_timetoa(): returning with str = [2017-01-07_00:00:00]
DEBUG wrf_timeinttoa(): returning with str = [0000000000_000:000:005]
DEBUG setup_timekeeping(): clock after creation, clock start time = 2017-01-05_18:00:00
DEBUG setup_timekeeping(): clock after creation, clock current time = 2017-01-05_18:00:00
The cheyenne machine has larger memory nodes available, more than 2.4x the memory on the default nodes. It looks like you were going down this path on your own (using only 11 out of 36 processes).
Code:
#PBS -l select=6:ncpus=11:mpiprocs=11
Looking at the namelist, the resultant horizontal decomposition is approximately 176x137, which is a bit larger than average, but certainly not a problem. However, the big deal could be the 540 vertical levels! You might win the November award for most levels (darn, if we only had that contest).
Try just a couple of time periods with the large memory nodes on cheyenne, and use only two processors per node. Go big, and try 72 nodes. The "mem=109GB" option tells the scheduler to run your job on the large memory node partition.
Code:
#PBS -l select=72:ncpus=2:mpiprocs=2:mem=109GB
The 72x2 (nodes * processors/node) will give a total 144 MPI processes, and that decomposition will work with your relatively small coarse grid. Again, just try a couple of time periods to see if this is fixing the problem. This is 28x the memory of your first effort, so it is definitely "going big". If this idea does work, maybe dial back the number of nodes with a few iterations, maybe try 36x4, 24x6, 18x8 setups.
Later on down the road ...
The WRF model will be less of an issue for memory, as we can run an MPI decomposition for your large domain (1054x1504) with no concern that we also need to fit the small domain into the mix. However, with 540 vertical levels, you will probably also be on the large nodes for the WRF model. Once you are ready for WRF,
- I'll give you some info on how to reduce the thousands (maybe tens of thousands) of rsl files you are about to create. It is a compile-time option.
- You may want to consider your objective regarding output. For example, do you need every 3d field at 540 levels? The WRF model has I/O options (and you would need them mostly for "O"), but you can also reduce your file size by removing fields from the output stream. At 100 m resolution (so approximately 0.5 s for a time step), needing to output the data at a resolution of 15 min would be a snapshot every 1800 time steps. That is very crude temporal resolution.
- You may want to consider generating station data, which would give you data per timestep at discrete locations.