Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation Fault when Running MPAS (scale_region mesh discontinuity)

aroseman

Member
I keep getting a strange segmentation fault when running BCs on MPAS V8.3.1. I have tried increasing the amount of mpiprocceses (as seened in the attached .sh file) as well as undersubscribing, but it still happens. It also occurs at a random spot in the run, for example, in the log.init_atmosphere.0000.out, sometimes it stops in the attached example, or is it able to output a few LBC files before crashing.

Note: I am using a regional 60-3 km mesh and the outputted .graph.part. file which I applied "gpmetis -minconn -contig -niter=200 ${name}.graph.info 256" to get a .graph.part.256 file. I then used the new mesh_scaling tool "scale_region" to scale by 3 to a 20-1 km mesh.


Checking Memory:
qhist -j 3277247
==>

Job ID User Queue Nodes NCPUs NGPUs End Mem CPU Elap
------------ ---------- -------- ----- ------ ----- ------- -------- ------ ------
3277247 aroseman cpu 4 512 0 01-1644 66.87 37.37 0.01

What should I try to figure out the cause and avoid the error?
 

Attachments

  • submit_mpas_256_BCs.sh.txt
    754 bytes · Views: 4
  • log.init_atmosphere.0000_BC_36.out.txt
    648 bytes · Views: 4
  • mpas_8.3.1_BCs.e3276877.txt
    4.3 KB · Views: 4
Last edited:
Although I don't have any good ideas as to why the init_atmosphere_model program may be randomly stopping with a segmentation fault, I did notice that you appear to be using a rather old compiler from the intel/2023.0.0 module on Derecho. It could be worth trying with either the intel/2025.1.0 module or with the gcc/12.4.0 module; perhaps best would be to try using the same modules described in the "0. Prerequisites and environment setup" section of the most recent MPAS-A tutorial practice guide:
Code:
module --force purge
module load ncarenv/24.12
module load craype/2.7.31
module load gcc/12.4.0
module load ncarcompilers/1.0.0
module load cray-mpich/8.1.29
module load parallel-netcdf/1.14.0
 
Thanks! I was using the 2024 one before. I will try redoing from scratch with the 2025 workshop. I will use the intel option, since I am used to that. If not I will try compiling with gcc next.

This is very strange though, since this issue has not occurred before, using the same exact setup, though maybe its something small that didn't occur previously.
 
Last edited:
Hi Michael,

I ran with the newest modules, and using intel, and was able to get through Static, ICs, and BCs!

module --force purge
module load ncarenv/24.12
module load craype/2.7.31
module load intel/2025.1.0
module load ncarcompilers/1.0.0
module load cray-mpich/8.1.29
module load parallel-netcdf/1.14.0
module load netcdf/4.9.2

Thanks for the help!
 
Last edited:
However, unfortunately, after running the model with atmosphere_model, a segmentation fault occurs shortly after:

PBS Job Id: 3279777.desched1
Job Name: mpas_8.3.1_Run
Execution terminated
Exit_status=174
resources_used.cpupercent=18438
resources_used.cput=00:48:21
resources_used.mem=121042440kb
resources_used.ncpus=512
resources_used.vmem=60053264kb
resources_used.walltime=00:00:16

I also noticed something quite strange when using ncvis on the initial condition ==> there seems to be a region with NaN surface pressure in the center, it doesn't occur for every variable, but does for others like rho and relative humidity and occurs throughout multiple vertical levels.

Note: I have used this data before without issue.
1.png
 

Attachments

  • model_directory.txt
    5.1 KB · Views: 2
  • log.atmosphere.0000.out.txt
    1.3 KB · Views: 2
  • mpas_8.3.1_Run.e3279777_mpi256.txt
    1.8 KB · Views: 2
  • submit_mpas_256_Run.sh.txt
    865 bytes · Views: 1
  • grid.png
    grid.png
    247.3 KB · Views: 8
Last edited:
Could you try to clean and rebuild MPAS with DEBUG=true? Hopefully the log files might have more debugging info for us to look at.

Thanks!
 
Yes, will do that next.

Though I did find another possibility:

I am using the "scale_region.py" tool on Meshes & Mesh Utilities — MPAS Atmosphere documentation. I scaled the 60-3 km regional mesh to a 20-1 km mesh. (Note: I applied the tool on grid.nc, not static.nc) I just ran the Static and ICs step again with the non-scaled 60-3 km region mesh and found no such missing points in the center (1km region) extending up in the vertical (as I did in the 20-1km case).
==> It seems there is some issue with the scaling.
 

Attachments

  • 20-1km_GRID.png
    20-1km_GRID.png
    247.3 KB · Views: 8
  • 20-1km_grid_NCVIS.png
    20-1km_grid_NCVIS.png
    210.5 KB · Views: 7
  • 60-3km_GRID.png
    60-3km_GRID.png
    255.9 KB · Views: 5
  • 60-3km_NCVIS.png
    60-3km_NCVIS.png
    143.9 KB · Views: 6
Last edited:
Here is the output logs for the compile with DEBUG=true

Couldn't see any obvious issues.
 

Attachments

  • init_atmosphere_compile.log
    98.2 KB · Views: 0
  • atmosphere_compile.log
    451.2 KB · Views: 0
Last edited:
Please send me the directory where you run the failed case (sorry that I logged out derecho and lost the path you told me yesterday).

Thanks.
 
Wei helped me find a fix that works.

In the failed version, I used:
config_supersample_factor = 3
config_lu_supersample_factor = 1
config_30s_supersample_factor = 1

In the working version, I used:
config_supersample_factor = 9
config_lu_supersample_factor = 3
config_30s_supersample_factor = 3

It seems the mesh from the scaled version couldn't find any static data values leading to Nan values, unless a higher supersampling factor was used.
 
Last edited:
Thank you for the update. It is found that ter value is NaN at certain cells, which leads to NaNf value of PSFC. All other variables whose calculation involves PSFC are affected subsequently.

With smaller sample_factor for extremely high resolution mesh such as the 1-km mesh you used, errors may also occur in other fields like IVGTYP.

Many thanks for raising this issue, which I am not aware of before.
 
Top