Segfaults when more PEs are used

xtian15 · Sep 12, 2024

Hi,
I am making simulations with the 3-60km mesh (x20.835586) with different numbers of PEs. Everything runs as expected when I am using 1008 PEs or less, only the forecast runs too slowly with these amount of cores. Segfaults (as shown below) start to show up and get the job killed all together when I tried simulations with 1728, 2304, or 3456 PEs. I wonder why this is happening and how to resolve it. I am using MPAS-v8.1.

Thanks!

Code:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fd52cfefd8f in ???
#1  0x59a6e9 in ???
#2  0x59ab70 in ???
#3  0x59d02a in ???
#4  0x6f424b in ???
#5  0x552d54 in ???
#6  0x511876 in ???
#7  0x49dc07 in ???
#8  0x49ed90 in ???
#9  0x4067dc in ???
#10  0x405cba in ???
#11  0x7fd52cfdaeaf in ???
#12  0x7fd52cfdaf5f in ???
#13  0x405d04 in ???
#14  0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: **-01-09
  Local PID:  968029
  Peer host:  **-01-12
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 148 with PID 1060249 on node **-01-12 exited on

xtian15 · Sep 13, 2024

In some of the cases with more than 1008 PEs, I was also given the errors of, but not when using <= 1008 PEs.

Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
Flerchinger USEd in NEW version. Iterations= 10
2 total processes killed (some possibly by mpirun during cleanup)

Ming Chen · Sep 16, 2024

Have you looked at your results when you run the case with 1008 processors? Do all variables have reasonable values?

I ask this because the message "Flerchinger USEd in NEW version. Iterations= 10" indicates something wrong in the physics. If you work on the same case, then I would expect that the run with 1008 processors should also fail.

xtian15 · Sep 16, 2024

Yes, everything regarding the 1008 run looks reasonable and the run will go straight to the end. Strangely, the segfault and/or Flerchinger errors only show up when more cores are used.

Ming Chen · Sep 17, 2024

Pease upload your namelist.atmosphere, streams.atmosphere, and initial data for me to take a look.
It will be helpful if you can also upload the graph.info files for 1008 and 3456 PEs.

I have never seen such a case before. Hope we can repeat your case and figure out what is wrong.

xtian15 · Sep 19, 2024

I've uploaded everything into this folder. Please let me know if anything else is needed to kickoff the run. Thanks!

Ming Chen · Sep 19, 2024

Hi,

Thank you for uploading the files. I have downloaded all of them.

However, it seems that your data "init.conus.2023080100.nc" is either wrong or is damaged. This is because I cannot read this data and MPAS cannot run with this data. By issuing the command:

ncdump -h init.conus.2023080100.nc

I got the following error message:

Code:

ncdump: init.conus.2023080100.nc: NetCDF: Attempt to use feature that was not turned on when netCDF was built.

xtian15 · Sep 20, 2024

That's really odd. I downloaded the nc file myself from the shared folder myself and the file looks perfectly fine on my end. I sent the file via sendbig again to your email address and let's try it can work this time.

Ming Chen · Sep 24, 2024

I still cannot read the data you upload.
However, I run a few tests over the 60-3km mesh using different number of processors. All work just fine. This makes me believe that MPAS codes should work just fine.

When you run your case with the 60-3km mesh, did you relocate the refinement to a different region? if so, did you produce static.nc datafile for the new 60-3km mesh after relocation?

Segfaults when more PEs are used

xtian15

Member

xtian15

Member

Ming Chen

Moderator

xtian15

Member

Ming Chen

Moderator

xtian15

Member

Ming Chen

Moderator

xtian15

Member

Ming Chen

Moderator