Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault/corrupted backtrace when running MPAS-A with more than 144 CPUs

kniezgoda

New member
Hello, I am running MPAS-atmosphere in a Docker container and am encountering the dreaded stack corruption/null pointer/segmentation fault errors and corrupted backtraces when running with more than approximately 144 CPUs. For context, here is the hardware and software environment in which I am running MPAS-A:

Hardware:
I tested this on two different systems:
1) A machine with 192 AMD EPYC CPUs
2) A machine with 192 Intel Xeon Gold CPUs

Software:
Inside the Docker container, the operating system is Red Hat Enterprise Linux 9.4.
MPAS-atmosphere: 8.2.0
gfortran: 11.5.0
HDF5: 1.14.4-3
NETCDF-C: 4.9.2
NETCDF-F: 4.6.1
PNETCDF: 1.13.0
MPI: openmpi 5.0.3
I/O layer: SMIOL

For this post, I will show results from 3 examples of running a simulation using the 60km global mesh (x1.163842) with 3 different results to demonstrate the problem I am encountering.

Example 1: Successful run with 96 CPUs
I execute the atmosphere_model at the command line with mpirun rather than with a job scheduler. For example, to run with 96 CPUs I enter `mpirun -n 96 atmosphere_model`. In this case, the model executes without error, as evidenced by the wrap-up messaging at the end of the log.atmosphere.0000.out:
-----------------------------------------
Total log messages printed:
Output messages = 654
Warning messages = 3
Error messages = 0
Critical error messages = 0
-----------------------------------------

Example 2: Failed run with 128 CPUs, fixed by changing &io section of namelist.atmosphere
If I try to run with 128 CPUs, the model fails with something that looks like a stack corruption error after about 10 seconds. Here is a snippet of what is printed to the terminal:

#16 0xa93c4c in mgetput
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:2477
#17 0xa93c4c in req_aggregation
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:1741
#18 0xa93c4c in wait_getput
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:2228
#19 0xa951b2 in req_commit
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:980
#20 0xa951b2 in ncmpio_wait
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:1134
#21 0xa08595 in ???
#22 0x968bfe in ???
#23 0x9de612 in ???
#24 0x9850ae in ???
#25 0x988e16 in ???
#26 0x4c5ee2 in ???
#27 0x406a3c in ???
#28 0x405f0a in ???
#29 0x798ce73025cf in ???
#30 0x798ce730267f in ???
#31 0x405f54 in ???
#32 0xffffffffffffffff in ???

This is repeated many times (I'm guessing 128 times, once for each CPU?). In addition, no log files are generated (log.atmosphere.0000.out or .err) and no data files are created. However, if I change the &io section of the namelist.atmosphere to be:

&io
config_pio_num_iotasks = 1,
config_pio_stride = 128,
/

Then the model runs successfully! I also tried different combinations of numbers whose products are 128 (e.g. 2 and 64, 4 and 32; they all work). However...

Example 3: Failed run with 144 CPUs, cannot be solved by changing &io namelist section.
Running the model with 144 CPUs produces a similar (but more cryptic) corrupted backtrace terminal output:

#14 0xa088b6 in ???
#15 0x9ffd0b in ???
#16 0x976e33 in ???
#17 0x9dcb22 in ???
#18 0x9f950f in ???
#19 0x980412 in ???
#20 0x9873e2 in ???
#21 0x988279 in ???
#22 0x4c956f in ???
#23 0x4075e1 in ???
#24 0x405f00 in ???
#25 0x7c67999a25cf in ???
#26 0x7c67999a267f in ???
#27 0x405f54 in ???
#28 0xffffffffffffffff in ???

Unfortunately, modifying the &io settings like in Experiment 2 does not resolve this. When I implement the &io section change (I tried 1 and 144, 2 and 72) the model runs approximately 10 seconds longer before crashing, it generates a log.atmosphere.0000.out (but not a .err log file), and it generates the file from the "output" stream with 1 timestep in it (but it does not generate the file from the "diagnostics" stream). The log for this experiment is attached as log.atmosphere.0000.out_144fail.txt. I have also tried with 192 CPUs and the same result occurs. (Note: I created the 144 CPU partition file for this mesh using gpmetis as it is not included in the base distribution).

Final thoughts
1. At first, I suspected the system was running out of memory, but this does not appear to be the case. I monitored system memory usage using `top` after submitting the mpirun command, and there was sufficient available memory at the time of failure in all cases. Furthermore, both systems I've tried have sufficient memory for a simulation with 163,842 grid cells (approximately 750 GiB for the AMD machine and 500 GiB for the Intel machine) and I am the only user on these machines.

2. I tried changing the p-netcdf version because the stack corruption error from Experiment 2 suggests a potential issue with that library. I tried pnetcdf 1.14.0 and 1.8.1, but the same behavior appeared.

3. Likewise, I tried changing the MPI interface to mpich version 4.3 and I tried using PIO version 2.6.2 rather than the SMIOL layer, but neither change had an impact.

4. `ulimit` returns "unlimited".

5. Apologies for cross-pollinating the MPAS forums with WRF material, but for what it's worth I am seeing essentially the same behavior when running the real.exe program for WRF simulations.

I will also attach the namelist.atmosphere I am using for all these runs. In the attached namelist I have left the &io section as the default values (0 and 1). Please let me know if any other files would be helpful. I had to attach all my files with the .txt suffix because it wouldn't let me upload them with their intended file names.

Thanks,
Kyle Niezgoda
 

Attachments

  • namelist.atmosphere.txt
    2 KB · Views: 2
  • log.atmosphere.0000.out_144fail.txt
    19.7 KB · Views: 1
My top suggestion would be rebuilding with something like make gfortran CORE=atmosphere DEBUG=true to get more information. You could then also examine any core dumps generated by this run with gdb. Though this may be of limited usefulness with a stack corruption.

If you could attach the stdout and stderr output from any run as a file, I may be able to make a better guess with whatever stack trace is given. It could also help if you attached a streams.atmosphere file, though I'd guess you used pretty much default values.

Some ideas/questions:
  • What signal (or signals) are you getting near the stack traces?
  • Do you have enough disk space left on these machines?
  • What happens if you turn off all model output? I.e. set output_interval="none"for all entries in streams.atmosphere. This would really just ensure that writing output is the issue.
    • Further experiments with different io_type or precision could be helpful.
  • How are the MPI and other libraries installed in Docker? Are the MPI libraries well configured for the systems you are running on?
  • Have you tested other MPI programs with this container on the systems? Do you only get errors with PnetCDF programs?
  • What are all the limits on your machine? On newer Linux OS (like RHE 9.4) you can get a list of these with prlimit, or with ulimit -a. There may still be limits that kill your MPI program besides the file size returned by ulimit.

E.g. limits on a NCAR Supercomputer:
Bash:
prlimit
RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                 unlimited unlimited bytes
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited bytes
LOCKS      max number of file locks held      unlimited unlimited locks
MEMLOCK    max locked-in-memory address space unlimited unlimited bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0
NOFILE     max number of open files               16384     16384 files
NPROC      max number of processes              2060868   2060868 processes
RSS        max resident set size              unlimited unlimited bytes
RTPRIO     max real-time priority                     0         0
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals        2060868   2060868 signals
STACK      max stack size                     307200000 unlimited bytes

---

Aside: Getting repeats of the stack trace is expected, I think it is one per MPI rank (CPU). Sometimes these can be interleaved and just too much information. With OpenMPI, you can send the stdout and stderr to rank-specific files by adding --output-filename mpasa_out --merge-stderr-to-stdout to your mpirun command. This would create mpasa_out.000 to mpasa_out.144 files for your 144 rank case.
  • Alternatively, you could prefix the outputs with some settings to help you understand which rank is generating output. OpenMPI mpirun with --tag-output or setting MPIEXEC_PREFIX_DEFAULT (to any value) with MPICH and mpiexec. You could use grep or some other tools on the output you save to get a clearer idea.
 
Thank you for the thorough response. This has been very helpful.

I re-compiled MPAS with debug turned on, and have attached the captured output of the crash. Following your advice, I ran
mpirun --output-filename mpasa_out --merge-stderr-to-stdout -n 192 atmosphere_model
which produced files named like mpasa_out.prterun-ada74bd51455-563498@1.000.out, as you suggested. I have attached that file in this post. This should capture stdout and stderr for a single process, so hopefully this satisfies what you asked for.

Regarding the streams.atmosphere file, you are correct that I use the default file, but I have included it in this post as well.

The crash does not produce core dumps. Could I still utilize gdb despite this? I do not have experience using gdb, but it is installed in my container.

The earliest backtrace in the mpasa_out file points to code located at src/framework/mpas_dmpar.F on line 2837 (MPAS-Model/src/framework/mpas_dmpar.F at release-v8.2.0 · MPAS-Dev/MPAS-Model). I had a look through this file, but I have to admit I'm not entirely sure what to do with this knowledge. It seems like we're making progress though.


As for the rest of your questions and suggestions:

What signal (or signals) are you getting near the stack traces?
They are SIGBUS signals.

Do you have enough disk space left on these machines?
Yes, there are several 100s of gbs free. Additionally, I am able to run the model with 96 CPUs and, as far as I know, the number of processors you use should not affect the disk size of your model output (at least not in any meaningful way for this problem).

What happens if you turn off all model output? I.e. set output_interval="none"for all entries in streams.atmosphere. This would really just ensure that writing output is the issue.
Great suggestion, but this did not solve the problem.
In some cases when the output is turned on, the model will write a single timestep to the output file before it crashes with SIGBUS. The data in this timestep looks real (i.e. it's not all NANs or totally unphysical). This suggests to me that it's not an issue with writing output.

Further experiments with different io_type or precision could be helpful.
I know how to change the precision when you compile MPAS, but how would one go about changing the io_type? Do you mean we should use PIO2 instead of SMIOL?

Side note: I do not add a precision flag when I compile MPAS. E.g. I run:
make gfortran CORE=atmosphere DEBUG=true
which means the compilation should default to a double precision according to the users guide. However, based on the header information in my log.atmosphere.0000.out (attached in my first post) the precision of the model is set to single-precision. When I try PRECISION=double the build fails, so I imagine the make process is smart enough to catch for that exception and change the precision to single. Credit to the MPAS engineers.

How are the MPI and other libraries installed in Docker? Are the MPI libraries well configured for the systems you are running on?
For MPI, we have the openmpi 5.0.3 tarball downloaded locally on our machines. We extract it in Docker and then run the standard install commands:
./configure --prefix=/usr/local/; make -j8 all; make install
Other libraries use the same installation process. Here are some configuration flags we have for the other libraries:
  • hdf5: ./configure --enable-fortran --enable-cxx
  • netcdf-c: ./configure --disable-byterange
  • pnetcdf: ./configure --with-mpi
Otherwise, libraries do not include any configuration flags besides the prefix/install location.

For openmpi, I have run the suite of tests included in the version we use, and they all pass. I have also tried building MPAS and openmpi with OpenMP support, but was unable to run the model at all after making that change. Beyond this, I am not a system administrator by any means, so I’m not entirely confident in my ability to diagnose whether OpenMPI is optimally configured for our system.

Have you tested other MPI programs with this container on the systems? Do you only get errors with PnetCDF programs?
We have run WRF with the same docker container (and therefore the same environment with identical libraries, dependencies, etc..) We see similar issues in the parallelized components of WRF (e.g. we can run real.exe with 96 mpi tasks but not 144 or more). We have not tested more "simple" MPI programs, but I have run the suite of tests included in the openmpi 5.0.3 distribution, and they all pass.

Following your suggestion, I ran make ptests in the pnetcdf/examples/F90 folder and got all passes. By default this suite only used 8 mpi tasks at maximum, so I modified the Makefile to include a test with 192 mpi tasks; all the tests passed in that case as well.

What are all the limits on your machine?
I compared the results of prlimit on our machines and got the same results as the NCAR computer, with one exception:
Our machine: CORE max core file size 0 unlimited bytes
NCAR machine: CORE max core file size unlimited unlimited bytes
Is this a meaningful difference?

Thanks again for taking the time to address my post.

Cheers,
Kyle
 

Attachments

  • streams.atmosphere.txt
    1.4 KB · Views: 1
  • mpasa_out.prterun-ada74bd51455-563498@1.000.out.txt
    1.1 KB · Views: 1
Thank you for the files and info you sent back! One thing to check is if you are getting exactly the same error each time - is it always the same signal and stack trace if you run with the same parameters? If not, then it could require some deep deep debugging. If it is the same error, that's good news!

What signal (or signals) are you getting near the stack traces?
They are SIGBUS signals.
Darn, this could still mean many things. SIGBUS occurs by definition when trying to access truly inaccessible memory; SIGSEGV is a more common signal and is about accessing memory the process aren't allowed to. Today it usually means misaligned memory (apparently not always, but I haven't seen SIGBUS before).

Could you provide more details about how you are running the job from within the container? Also maybe your docker-compose.yml file? I've seen some suggestions that SIGBUS can occur with memory issues in the container, especially low swap space. I don't think this should happen by default with Docker...
  • If you are setting memory limits for your container or just want to try, you may be able to do something like docker run --memory-swap=0 ... to disable swap usage. Or enter a larger value (e.g. 20g) to ensure you have enough swap space.
---

Other replies:
The crash does not produce core dumps. Could I still utilize gdb despite this? I do not have experience using gdb, but it is installed in my container.
You can run gdb without core dumps, but that is more to do "live" debugging. The ?? entries in the stack trace limit how helpful this could be, some part of the software doesn't have debugging symbols added by -g flag that would make this more understandable. (It doesn't keep the module and function names in the compiled code.) Explore outside the forum for more information about gdb and debugging.

You may want to consider Valgrind or other debuggers to check the memory consistency if you continue to get SIGBUS errors.

What are all the limits on your machine?
I compared the results of prlimit on our machines and got the same results as the NCAR computer, with one exception:
Our machine: CORE max core file size 0 unlimited bytes
NCAR machine: CORE max core file size unlimited unlimited bytes
Is this a meaningful difference?
Ah! The CORE limit is why you aren't seeing the core dumps. Unless you raise the limit in your job script (or run script, shell session, etc), this soft limit of 0 bytes means no file will be created. You could set this to 'unlimited' or some size you think is appropriate for the system or run you are doing. Examples
  • prlimit --core=unlimited: will set the limit for the whole session or scope of whatever script executes it. Including the colon character ensures you only set the soft limit, just in case you have root access.
  • prlimit --core=4096 mpirun -n4 ./hello_world will set the limit only for the single command.
Further experiments with different io_type or precision could be helpful.
I know how to change the precision when you compile MPAS, but how would one go about changing the io_type? Do you mean we should use PIO2 instead of SMIOL?
You can change io_type by adding it to the entries in streams.atmosphere. Look at section 5 (esp. 5.2) of the v8.2.2 MPAS User's Guide. Though as you indicate, this may not be related to I/O at all.

Also the default build precision of the model is SINGLE with MPAS v8. It is concerning that you can't build with PRECISION=DOUBLE, but would be something to consider another time (and in another thread).

The disk space, PnetCDF test, and configurations all seem appropriate to me.

---

This is a fun thread. I'm learning more about Linux and MPI debugging through this :)
 
Top