Hello, I am running MPAS-atmosphere in a Docker container and am encountering the dreaded stack corruption/null pointer/segmentation fault errors and corrupted backtraces when running with more than approximately 144 CPUs. For context, here is the hardware and software environment in which I am running MPAS-A:
Hardware:
I tested this on two different systems:
1) A machine with 192 AMD EPYC CPUs
2) A machine with 192 Intel Xeon Gold CPUs
Software:
Inside the Docker container, the operating system is Red Hat Enterprise Linux 9.4.
MPAS-atmosphere: 8.2.0
gfortran: 11.5.0
HDF5: 1.14.4-3
NETCDF-C: 4.9.2
NETCDF-F: 4.6.1
PNETCDF: 1.13.0
MPI: openmpi 5.0.3
I/O layer: SMIOL
For this post, I will show results from 3 examples of running a simulation using the 60km global mesh (x1.163842) with 3 different results to demonstrate the problem I am encountering.
Example 1: Successful run with 96 CPUs
I execute the atmosphere_model at the command line with mpirun rather than with a job scheduler. For example, to run with 96 CPUs I enter `mpirun -n 96 atmosphere_model`. In this case, the model executes without error, as evidenced by the wrap-up messaging at the end of the log.atmosphere.0000.out:
-----------------------------------------
Total log messages printed:
Output messages = 654
Warning messages = 3
Error messages = 0
Critical error messages = 0
-----------------------------------------
Example 2: Failed run with 128 CPUs, fixed by changing &io section of namelist.atmosphere
If I try to run with 128 CPUs, the model fails with something that looks like a stack corruption error after about 10 seconds. Here is a snippet of what is printed to the terminal:
#16 0xa93c4c in mgetput
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:2477
#17 0xa93c4c in req_aggregation
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:1741
#18 0xa93c4c in wait_getput
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:2228
#19 0xa951b2 in req_commit
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:980
#20 0xa951b2 in ncmpio_wait
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:1134
#21 0xa08595 in ???
#22 0x968bfe in ???
#23 0x9de612 in ???
#24 0x9850ae in ???
#25 0x988e16 in ???
#26 0x4c5ee2 in ???
#27 0x406a3c in ???
#28 0x405f0a in ???
#29 0x798ce73025cf in ???
#30 0x798ce730267f in ???
#31 0x405f54 in ???
#32 0xffffffffffffffff in ???
This is repeated many times (I'm guessing 128 times, once for each CPU?). In addition, no log files are generated (log.atmosphere.0000.out or .err) and no data files are created. However, if I change the &io section of the namelist.atmosphere to be:
&io
config_pio_num_iotasks = 1,
config_pio_stride = 128,
/
Then the model runs successfully! I also tried different combinations of numbers whose products are 128 (e.g. 2 and 64, 4 and 32; they all work). However...
Example 3: Failed run with 144 CPUs, cannot be solved by changing &io namelist section.
Running the model with 144 CPUs produces a similar (but more cryptic) corrupted backtrace terminal output:
#14 0xa088b6 in ???
#15 0x9ffd0b in ???
#16 0x976e33 in ???
#17 0x9dcb22 in ???
#18 0x9f950f in ???
#19 0x980412 in ???
#20 0x9873e2 in ???
#21 0x988279 in ???
#22 0x4c956f in ???
#23 0x4075e1 in ???
#24 0x405f00 in ???
#25 0x7c67999a25cf in ???
#26 0x7c67999a267f in ???
#27 0x405f54 in ???
#28 0xffffffffffffffff in ???
Unfortunately, modifying the &io settings like in Experiment 2 does not resolve this. When I implement the &io section change (I tried 1 and 144, 2 and 72) the model runs approximately 10 seconds longer before crashing, it generates a log.atmosphere.0000.out (but not a .err log file), and it generates the file from the "output" stream with 1 timestep in it (but it does not generate the file from the "diagnostics" stream). The log for this experiment is attached as log.atmosphere.0000.out_144fail.txt. I have also tried with 192 CPUs and the same result occurs. (Note: I created the 144 CPU partition file for this mesh using gpmetis as it is not included in the base distribution).
Final thoughts
1. At first, I suspected the system was running out of memory, but this does not appear to be the case. I monitored system memory usage using `top` after submitting the mpirun command, and there was sufficient available memory at the time of failure in all cases. Furthermore, both systems I've tried have sufficient memory for a simulation with 163,842 grid cells (approximately 750 GiB for the AMD machine and 500 GiB for the Intel machine) and I am the only user on these machines.
2. I tried changing the p-netcdf version because the stack corruption error from Experiment 2 suggests a potential issue with that library. I tried pnetcdf 1.14.0 and 1.8.1, but the same behavior appeared.
3. Likewise, I tried changing the MPI interface to mpich version 4.3 and I tried using PIO version 2.6.2 rather than the SMIOL layer, but neither change had an impact.
4. `ulimit` returns "unlimited".
5. Apologies for cross-pollinating the MPAS forums with WRF material, but for what it's worth I am seeing essentially the same behavior when running the real.exe program for WRF simulations.
I will also attach the namelist.atmosphere I am using for all these runs. In the attached namelist I have left the &io section as the default values (0 and 1). Please let me know if any other files would be helpful. I had to attach all my files with the .txt suffix because it wouldn't let me upload them with their intended file names.
Thanks,
Kyle Niezgoda
Hardware:
I tested this on two different systems:
1) A machine with 192 AMD EPYC CPUs
2) A machine with 192 Intel Xeon Gold CPUs
Software:
Inside the Docker container, the operating system is Red Hat Enterprise Linux 9.4.
MPAS-atmosphere: 8.2.0
gfortran: 11.5.0
HDF5: 1.14.4-3
NETCDF-C: 4.9.2
NETCDF-F: 4.6.1
PNETCDF: 1.13.0
MPI: openmpi 5.0.3
I/O layer: SMIOL
For this post, I will show results from 3 examples of running a simulation using the 60km global mesh (x1.163842) with 3 different results to demonstrate the problem I am encountering.
Example 1: Successful run with 96 CPUs
I execute the atmosphere_model at the command line with mpirun rather than with a job scheduler. For example, to run with 96 CPUs I enter `mpirun -n 96 atmosphere_model`. In this case, the model executes without error, as evidenced by the wrap-up messaging at the end of the log.atmosphere.0000.out:
-----------------------------------------
Total log messages printed:
Output messages = 654
Warning messages = 3
Error messages = 0
Critical error messages = 0
-----------------------------------------
Example 2: Failed run with 128 CPUs, fixed by changing &io section of namelist.atmosphere
If I try to run with 128 CPUs, the model fails with something that looks like a stack corruption error after about 10 seconds. Here is a snippet of what is printed to the terminal:
#16 0xa93c4c in mgetput
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:2477
#17 0xa93c4c in req_aggregation
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:1741
#18 0xa93c4c in wait_getput
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:2228
#19 0xa951b2 in req_commit
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:980
#20 0xa951b2 in ncmpio_wait
at /mpas/libs/pnetcdf/pnetcdf-1.13.0/src/drivers/ncmpio/ncmpio_wait.c:1134
#21 0xa08595 in ???
#22 0x968bfe in ???
#23 0x9de612 in ???
#24 0x9850ae in ???
#25 0x988e16 in ???
#26 0x4c5ee2 in ???
#27 0x406a3c in ???
#28 0x405f0a in ???
#29 0x798ce73025cf in ???
#30 0x798ce730267f in ???
#31 0x405f54 in ???
#32 0xffffffffffffffff in ???
This is repeated many times (I'm guessing 128 times, once for each CPU?). In addition, no log files are generated (log.atmosphere.0000.out or .err) and no data files are created. However, if I change the &io section of the namelist.atmosphere to be:
&io
config_pio_num_iotasks = 1,
config_pio_stride = 128,
/
Then the model runs successfully! I also tried different combinations of numbers whose products are 128 (e.g. 2 and 64, 4 and 32; they all work). However...
Example 3: Failed run with 144 CPUs, cannot be solved by changing &io namelist section.
Running the model with 144 CPUs produces a similar (but more cryptic) corrupted backtrace terminal output:
#14 0xa088b6 in ???
#15 0x9ffd0b in ???
#16 0x976e33 in ???
#17 0x9dcb22 in ???
#18 0x9f950f in ???
#19 0x980412 in ???
#20 0x9873e2 in ???
#21 0x988279 in ???
#22 0x4c956f in ???
#23 0x4075e1 in ???
#24 0x405f00 in ???
#25 0x7c67999a25cf in ???
#26 0x7c67999a267f in ???
#27 0x405f54 in ???
#28 0xffffffffffffffff in ???
Unfortunately, modifying the &io settings like in Experiment 2 does not resolve this. When I implement the &io section change (I tried 1 and 144, 2 and 72) the model runs approximately 10 seconds longer before crashing, it generates a log.atmosphere.0000.out (but not a .err log file), and it generates the file from the "output" stream with 1 timestep in it (but it does not generate the file from the "diagnostics" stream). The log for this experiment is attached as log.atmosphere.0000.out_144fail.txt. I have also tried with 192 CPUs and the same result occurs. (Note: I created the 144 CPU partition file for this mesh using gpmetis as it is not included in the base distribution).
Final thoughts
1. At first, I suspected the system was running out of memory, but this does not appear to be the case. I monitored system memory usage using `top` after submitting the mpirun command, and there was sufficient available memory at the time of failure in all cases. Furthermore, both systems I've tried have sufficient memory for a simulation with 163,842 grid cells (approximately 750 GiB for the AMD machine and 500 GiB for the Intel machine) and I am the only user on these machines.
2. I tried changing the p-netcdf version because the stack corruption error from Experiment 2 suggests a potential issue with that library. I tried pnetcdf 1.14.0 and 1.8.1, but the same behavior appeared.
3. Likewise, I tried changing the MPI interface to mpich version 4.3 and I tried using PIO version 2.6.2 rather than the SMIOL layer, but neither change had an impact.
4. `ulimit` returns "unlimited".
5. Apologies for cross-pollinating the MPAS forums with WRF material, but for what it's worth I am seeing essentially the same behavior when running the real.exe program for WRF simulations.
I will also attach the namelist.atmosphere I am using for all these runs. In the attached namelist I have left the &io section as the default values (0 and 1). Please let me know if any other files would be helpful. I had to attach all my files with the .txt suffix because it wouldn't let me upload them with their intended file names.
Thanks,
Kyle Niezgoda