Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

How to predict RAM usage

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

ruikang

New member
Hello,

My MPI run job would stop after ~10 calculation.

During the MPI run, it seems that there are X number of secondary processes using up about 8-10GB of RAM.

On top of that, there's a primary process, that uses an unknown maximum amount of RAM. If I run this as a single-CPU job, the job maxes out at using about 150GB of RAM.

Since the secondary processes have a set amount of RAM in use (size of data), the primary process is collecting from there, and it grows much more than that, and that may be what ends up dying due to lack of RAM.

Our cluster administrator and I don't know what the data structures, etc, are, or how it splits things up between the primary process and the secondary process. Since the primary process is the main issue, we'd have to request 150GB of ram in the job, which may cause delays in it running until a node is free with that much RAM.

So, my question is how to predict job sizing( and how to predict RAM usage), and if there's any way to either modify or manage that behavior, or, split the computation up into discrete parts that can be run sequentially?

FYI, for our case (i.e., a 60 km*60 km domain with a 300 m grid spacing and 67 vertical layers), we roughly need 9 days to run 1 day process with 640 CPUs.

Many thanks,
Ruikang
 
More detailed, our questions are as follow:

How to predict the RAM usage for your model, both as a single core job and as an MPI job, with a specific question on how much RAM the primary MPI process expects to use vs. the secondaries

And, if there's a way to split up the processing into discrete chunks. I don't know how to formulate that question since I don't know what WRF does, but if it's doing some sort of Step A, Step B, Step C, thing, can you do, for example, Step A and Step B, save the output, then run Step C seperately, if that saves/helps RAM usage.
 
Top