[Cheyenne] MPI task partitioning for MPAS-A on 15-3km variable mesh

jpiers · Jan 11, 2023

I am in the process of running a 5-day simulation of MPAS-A on Cheyenne using a 15-3km variable mesh obtained here: MPAS-Atmosphere mesh downloads. The mesh comes with a range of mesh partition files (e.g., x5.6488066.graph.info.part.256, x5.6488066.graph.info.part.1024... etc) for running Atmosphere in parallel.

What is the most efficient method of running MPAS in parallel with high-res meshes like 15-3km? In other words 1) how many MPI tasks should be used, and 2) how should they be split among Cheyenne nodes?

See attached for a sample PBS batch script I used to submit a job. This is modified from /glade/p/mmm/wmr/mpas_tutorial/job_scripts/run_model.pbs. Here, I used 8 MPIprocs on 32 nodes for a total of 256 MPI tasks. MPAS ran error free, but it took 12 hour of walltime for 12 hours of simulation time. I would like to speed this up in a way that is not overwhelming the Cheyenne compute nodes.

Thanks!

mgduda · Feb 27, 2023

Apologies for the delay in posting a reply here!

The short answer is that -- on Cheyenne -- we find that MPAS-A scales reasonably well (with >70% parallel efficiency) out to the point where each MPI task has ~100 grid columns. So in principle, you could run a simulation on the x5.6488066 mesh with around 6488066/100 = 64880 MPI tasks and make reasonably efficient use of your allocation. In practice, requesting large numbers of nodes can lead to long wait times in the queue, so most of us typically scale the model out to somewhere between 300 and 1000 grid columns per task (equating to between 6.5k - 21.6k MPI tasks).

If you'd like to make the most of your allocation on Cheyenne, I think using fully-subscribed nodes with 36 MPI tasks per node is probably the best approach. I have Metis installed in my home directory on Cheyenne in ~duda/metis-5.1.0-intel, so you could use the gpmetis program from ~duda/metis-5.1.0-intel/bin/gpmetis as described in Section 4.1 of the User's Guide to create your own graph partition files. For example, you could create a partition file for 7200 tasks that would fully utilize 200 nodes with 36 MPI tasks per node.

jpiers · Apr 5, 2023

Hi - just wanted to comment here that your suggestion worked well.

I used your installed Metis in ~duda/metis-5.1.0-intel to create my own graph partition file for 7200 tasks. Since I am using the 15-3km variable mesh, the new partition file was called x5.6488066.graph.info.part.7200

This allows for 200 fully-subscribed nodes:
#PBS -l select=200:ncpus=36:mpiprocs=36

Thank you for your help!

[Cheyenne] MPI task partitioning for MPAS-A on 15-3km variable mesh

jpiers

New member

Attachments

mgduda

Administrator

jpiers

New member