Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A issues with parallelization: Segmentation fault

BAMS

New member
Dear,

I'm having some (hopefully standard) issues running MPAS-A.

I'm running a regional circular domain of 1500 km diameter on a 3km resolution.
Using ERA5 as initial and boundary conditions, and OSTIA for sea surface temperature, ./init_atmosphere_model runs fine and the init.nc and lbc*.nc look plausible (checked with convert_mpas and ncview).
Then ./atmosphere_model runs fine on 2 cores, but is obviously very slow. However, when increasing the number of cores to 4 or 8, it crashes.

I attached the following files:
- streams.atmosphere
- namelist.atmosphere
- run_mpas800: bash script to submit to SLURM
- slurm.err: the slurm error file
- log.atmosphere.0000.out

No log.atmosphere.*.err files were created.

Unfortunately, I couldn't find a solution on the forum.

Many thanks in advance!
 

Attachments

  • streams.atmosphere.txt
    1.6 KB · Views: 2
  • namelist.atmosphere.txt
    1.3 KB · Views: 4
  • run_mpas800.txt
    586 bytes · Views: 8
  • slurm.err.txt
    3 KB · Views: 3
  • log.atmosphere.0000.out.txt
    12 KB · Views: 4
Hi,

In your namelist.atmosphere, I found the following settings:

&io
config_pio_num_iotasks = 0
config_pio_stride = 1

It seems not correct. I don't know how many processors you use to run this case. Below is just an example to demonstrate how to make processors and &io consistent.

(1) Suppose I run a case using 32 nodes, each nodes have 36 processors. In total I use 1152 processors. Then I set

&io
config_pio_num_iotasks = 32
config_pio_stride = 36

(2) I should also have the decomposition file "generic.graph.info.part.1152" in my work directory.
(3) In my PBS script to run this case, I should set
#PBS -l select=32:ncpus=36:mpiprocs=36

Note that I specify 32 nodes and 36 processors/node, and the total number of processors is the same as that I specified in &io

Hope this is helpful for you.
 
From your log.atmosphere.0000.out file, it looks like you're using hybrid MPI+OpenMP:
Code:
 Compile-time options:
   Build target: gfortran
   OpenMP support: yes
   OpenACC support: no
   Default real precision: double
   Compiler flags: optimize
   I/O layer: SMIOL

 Run-time settings:
   MPI task count: 4
   OpenMP max threads: 64
and that you're also running MPAS v8.0.0:
Code:
 MPAS-Atmosphere Version 8.0.0

The MPAS v8.0.1 release does correct an issue with OpenMP; from the release notes:
Fix an OpenMP error in the deallocation of an array (rthdynten) when neither
the Grell-Freitas nor the Tiedtke/nTiedtke cumulus schemes are used. (PR #1099)
Can you try updating to MPAS v8.0.1 to see whether that improves the situation? I do see in your namelist that you're using the nTiedtke cumulus scheme, but nonetheless I think it would be worth updating to incorporate all released bugfixes.
 
Hi,

In your namelist.atmosphere, I found the following settings:

&io
config_pio_num_iotasks = 0
config_pio_stride = 1

It seems not correct. I don't know how many processors you use to run this case. Below is just an example to demonstrate how to make processors and &io consistent.

(1) Suppose I run a case using 32 nodes, each nodes have 36 processors. In total I use 1152 processors. Then I set

&io
config_pio_num_iotasks = 32
config_pio_stride = 36

(2) I should also have the decomposition file "generic.graph.info.part.1152" in my work directory.
(3) In my PBS script to run this case, I should set
#PBS -l select=32:ncpus=36:mpiprocs=36

Note that I specify 32 nodes and 36 processors/node, and the total number of processors is the same as that I specified in &io

Hope this is helpful for you.
Just some clarification here: the default settings
Code:
&io
    config_pio_num_iotasks = 0
    config_pio_stride = 1
/
instruct MPAS to use all MPI tasks as I/O tasks. For larger runs, this may not provide optimal performance, but it isn't necessarily incorrect to do this.
 
Thank you both for your quick responses!
We rebuild v8.0.1 and also disabled OpenMP. This did the trick.
We did not try v8.0.1 with OpenMP as suggested
 
Thanks for following up! If you do happen to try v8.0.1 with OpenMP enabled and find any issues, please don't hesitate to let us know and we can try to debug further.
 
Top