Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF on high cores-per-node, e.g., AMD EPYC

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

gustafson

New member
Hi,

I'm curious if anybody has done work to compare the performance of WRF on the newer high-cores-per-node CPUs vs. the prior generation CPUs. What best practices are folks finding? How have you been forced to change how you run WRF? In my case, I am moving from an Intel-based cluster with 36 cores per node to a new AMD-based cluster with EPYC CPUs that use 128 cores per node. Having this many cores on a node is leading to I/O issues and makes my life difficult. I suspect folks starting to use DOE NERSC's new Perlmutter machine will run into similar difficulties (uses EPYC chips as well). I don't know the core count on NCAR's Cheyenne replacement, but the press release I found claimed it will have the EPYC chips, and thus it might have similar problems.

Specifically, the new AMD chips with 128 cores per node now lead to a bottleneck for I/O getting out of the node to the filesytem. For example, on a previous cluster with 36 cores per node, I could generally get satisfactory behavior with all 36 cores acting as MPI ranks and outputting data to disk via parallel-netCDF. However, with 128 cores per node, this is more than a magnitude slower because the nodes are more concentrated and the pipe out of the node to the network cannot handle all 128 cores writing to disk at the same time. To compensate, I have had to switch from an MPI-only computational approach (WRF's dmpar) to an MPI+OpenMP (WRF's smdm) approach. This is less efficient computationally and slows down the math part of the model. However, the I/O speeds up dramatically to offset the less efficient calculations because only the MPI ranks participate in I/O. I am still trying to find the optimal balance, but either 8:16 or 16:8 for the MPI ranks to OpenMP threads/rank ratio seems about right.

All this is quite dependent on the size of one's domain and number of cores used, which in my case is pushing the limits of WRF with the biggest domain about 2500 points in each horizontal dimension and using ~7100 cores right now. Running this at LES scales with frequent output pushes the limit of any computer, so trying to optimize this is important. If you have learned anything comparing WRF behavior on these newer CPUS, e.g. like comparing between NERSC's Cori and Perlmutter machines, I would love to hear about your experiences.

Thanks,
Bill Gustafson
Pacific Northwest National Laboratory
 
Top