Quilting w/ very large domains that break 32-bit limit

Topics specifically related to running the model in an HPC environment
Post Reply
Posts: 3
Joined: Wed May 27, 2020 12:11 am

Quilting w/ very large domains that break 32-bit limit

Post by gustafson » Mon Mar 15, 2021 11:51 pm


Has anybody worked with very large domains and quilting? Specifically, I am trying to use a domain that has 2146x2776x180 grid cells. I have tried up to 108 cores per I/O group and I still get the error "Possible 32-bit overflow on output server. Try larger nio_tasks_per_group in namelist." I could use some suggestions on ways to get around this problem.

Options I have thought of:
1) Use more I/O cores per group. Unfortunately, I cannot go any larger at the moment because that is the maximum number of tile divisions we can use for the outer domain that is smaller than the big, inner domain. (My understanding is the quilting limits the number of cores per I/O group to the number of MPI ranks in the Y direction.) Even so, I don't really want to devote more cores to I/O since this setup with two domains already would require at least 4 I/O groups to handle two domains' worth of wrfout and wrfrst files. I ultimately plan to use multiple output history files, which will run up the required I/O groups accordingly.

2) Drop the outer domain and just use one domain with ndown. This would be another way to allow me to use more I/O cores per group. I haven't tried it yet, but it is an option. I am not happy with this approach because it will compromise the accuracy of the solution. The inner domain is an LES setup that will require VERY frequent boundary updates to get good small-scale behavior. This will aggravate my I/O cost issues, particularly because input is essentially serial.

3) Use a larger outer domain. Sure, this is an obvious approach. However, even though the outer domain is cheaper to run, a large dx=500 m domain is still really expensive. I would rather spend that computing for the inner domain if possible.

4) Use pnetcdf without quilting. We've tried this and it works (with the CDF-5 netcdf format, see aside below). However, I believe this will be too slow in the end. We ned to to about 1-minute or less output frequency for at least part of the run. So, I am sensitive to I/O cost. The good thing about this working is that it at least proves netcdf can handle such big output, even if WRF struggles with it.

5) Compile WRF with -i8 instead of -i4. I've tried compiling WRF with 64-bit integers in a couple different ways, but so far to no avail. There are multiple F90 interfaces that break due to pre-defined versions of 32-bit and 64-bit integer expectations mapped to the same routine. Ultimately, I think this approach is what is needed, but it would mean an unknown number of modifications to the code infrastructure. Has anybody attempted and/or succeeded doing this? I would love to hear others' experiences. Is it worth the trouble? Will it even solve my problem?

As an aside, we already had to convert WRF to output using the CDF-5 format instead of CDF-2 because of the domain size. A domain this large breaks the CDF-2 limit on the size of a single variable. If others are using really big domains, check out Schwitalla (2020, GMD, https://doi.org/10.5194/gmd-13-1959-2020) for details, and the associated code changes at https://doi.org/10.5281/zenodo.3550622.

Thanks for any help and shared experiences.

-Bill Gustafson (PNNL)

Post Reply

Return to “High-performance Computing”