Hello,
My MPI run job would stop after ~10 calculation.
During the MPI run, it seems that there are X number of secondary processes using up about 8-10GB of RAM.
On top of that, there's a primary process, that uses an unknown maximum amount of RAM. If I run this as a single-CPU job, the job maxes out at using about 150GB of RAM.
Since the secondary processes have a set amount of RAM in use (size of data), the primary process is collecting from there, and it grows much more than that, and that may be what ends up dying due to lack of RAM.
Our cluster administrator and I don't know what the data structures, etc, are, or how it splits things up between the primary process and the secondary process. Since the primary process is the main issue, we'd have to request 150GB of ram in the job, which may cause delays in it running until a node is free with that much RAM.
So, my question is how to predict job sizing( and how to predict RAM usage), and if there's any way to either modify or manage that behavior, or, split the computation up into discrete parts that can be run sequentially?
FYI, for our case (i.e., a 60 km*60 km domain with a 300 m grid spacing and 67 vertical layers), we roughly need 9 days to run 1 day process with 640 CPUs.
Many thanks,
Ruikang
My MPI run job would stop after ~10 calculation.
During the MPI run, it seems that there are X number of secondary processes using up about 8-10GB of RAM.
On top of that, there's a primary process, that uses an unknown maximum amount of RAM. If I run this as a single-CPU job, the job maxes out at using about 150GB of RAM.
Since the secondary processes have a set amount of RAM in use (size of data), the primary process is collecting from there, and it grows much more than that, and that may be what ends up dying due to lack of RAM.
Our cluster administrator and I don't know what the data structures, etc, are, or how it splits things up between the primary process and the secondary process. Since the primary process is the main issue, we'd have to request 150GB of ram in the job, which may cause delays in it running until a node is free with that much RAM.
So, my question is how to predict job sizing( and how to predict RAM usage), and if there's any way to either modify or manage that behavior, or, split the computation up into discrete parts that can be run sequentially?
FYI, for our case (i.e., a 60 km*60 km domain with a 300 m grid spacing and 67 vertical layers), we roughly need 9 days to run 1 day process with 640 CPUs.
Many thanks,
Ruikang