Best Practices

zmenzo · May 7, 2021

I need to run a series of 3month simulations using the 60-3km mesh. While I can get the model to run successfully, it only produces 3day/12hr submission, which is much too slow and computationally expensive for my purposes. I have recently removed all unnecessary variables and have attempted to optimize the nodes:cpi:mpi ratio. While cutting the allocation requirement in half, I still need to make the model more efficient. Does anyone have any suggestions on how this may be done? Possibly reducing the time step in my namelist? Does the number of soundings I'm submitted make that big a difference? etc.? Any thoughts or suggestions would be greatly appreciated.

Thank you all.

mgduda · May 24, 2021

Apologies for the long delay in posting a reply. I had a few thoughts that might help:

Compiling the model to use default single-precision real values might increase the computational throughput over default double-precision reals by around 30-35% on Cheyenne; you can select a single-precision build by adding "PRECISION=single" to your "make" command.
The small number of soundings you're writing probably doesn't negatively impact the runtime in any significant way.
Writing restart files every 3 simulated hours (i.e., output_interval="3:00:00") may be unnecessary (unless you need the restart files for a purpose other than restarting the simulation after your jobs hit their wallclock limit). If you know that you can reliably get through three simulated days per job submission, it may be worth setting the restart output interval to three days and also setting the run duration to three days in your namelist.atmosphere file.
We've found that, on Cheyenne, the MPAS-Atmosphere model scales reasonably efficiently down to around 150 grid columns per MPI task, so with 835586 columns in the 60-3 km mesh, you could potentially increase your MPI task count to around ~5500 to enable you to simulate more days per job submission.

Hopefully these will help (especially compiling with single-precision reals), and if you have any questions about any of these points, I'd be glad to clarify.

zmenzo · May 26, 2021

Thank you very much for the suggestions. I will attempt to make the changes. However, to get to the optimized MPI task count, what nodes:ncpu:mpiproc would you use? Mostly I am unsure of the node:cpu ratio that would allow for the most efficient run without taking more than my fair share of computational space (nodes).

Thanks again, Zach

mgduda · May 26, 2021

If you're running on Cheyenne in either the "economy", "regular", or "premium" queues, you're automatically given exclusive use of the nodes allocated to your job, so using ncpus=36 and mpiprocs=36 would make maximal use of the nodes. It's been a while since I've tried, but using SMT on the Broadwell cores (i.e., with ncpus=72:mpiprocs=72) might give some improvement in throughput for a given node count, too.

As for the node count, that would depend on the total number of MPI tasks that you're using. For example, if you were to use 3600 MPI tasks, the node count would be 100 (with 36 MPI ranks per node) or 50 (with 72 MPI ranks per node).

Best Practices

zmenzo

Member

Attachments

mgduda

Administrator

zmenzo

Member

mgduda

Administrator