Limited-area simulation crashes when using multiple processors

yuanlian · Apr 8, 2022

Hi,

I was trying to run the limited-area simulation within a circular domain (about1800 cells), which was created from x1.10242 and partitioned by "gpmetis -minconn -contig -niter=200 graph.info N".

I was able to run the model with a single processor without any issue, but the model crashed after one time step once I used more than one processor (e.g., 2, 4, 6...). The error message was "CRITICAL ERROR: NaNs detected in w field". I then tried very small time step, the model would run a few time steps before wind fields blew up.

Has anyone seen similar issues with parallel computing of limited-area models? Any thoughts on what could lead to the numerical instabilities, halo exchange?

Thanks,
Yuan

mgduda · Apr 8, 2022

Generally, the model should give bit-identical results for one MPI task compared with N (e.g., 2, 4, 8, etc.) MPI tasks, for both global and regional simulations. (This ignores some optimization issues, and in some cases, getting bit-identical results requires disabling of some compiler optimizations).

Does the model does run correctly with multiple MPI tasks for a global simulation on the x1.10242 mesh using the same atmospheric fields that are used to initialize your regional simulation?

If a global simulation on the x1.10242 mesh works in parallel, you could try comparing the initial history files from your regional simulations using 1 MPI task and using 2 MPI tasks; if the initial history fields don't match, that might suggest that something is being corrupted in the reading of the initial state.

yuanlian · Apr 8, 2022

Hi Michael,

The global simulations with the same grid x1.10242 work fine with any number of processors. Thanks for the tips, I will compare the initial history files and report back.

Best,
Yuan

yuanlian · Apr 8, 2022

Hi Michael,

I checked the variables in the initial history files from one MPI task and multiple MPI tasks such as terrain, temperature and wind. They were all identical. I also switched off the optimization during compilation, but it made no change. Do you have other thoughts on debugging the issue?

Thanks much,
Yuan

yuanlian · Apr 10, 2022

The issue becomes a bit more complicated. I tested intel compiler+mpich (was gfortran+openmpi), now the model crashes with a single MPI task as well. I will debug further to see where the problem is.

mgduda · Apr 11, 2022

I certainly wouldn't claim that the MPAS-Atmosphere code is complete bug-free, but the released code is rather well tested. Given the curious nature of the issues you're seeing, I'm inclined to suspect that there might be some issues with the software environment on your system (whether it's compilers, MPI library, I/O libraries, etc.). If I'm able to think of any other suggestions to try, I'll follow up here. Otherwise, if you are able to get limited-area simulations running in parallel, it would be great if you could post here with any solutions that you've found.

mgduda · Apr 11, 2022

Since I mentioned software environment issues in my previous post, I suppose it's worth asking which version of the GNU and Intel compilers you've been using, and which versions of I/O libraries (NetCDF, Parallel-NetCDF, PIO) do you have?

Also, when you mentioned that you tried compiling without optimizations, did you manually remove the -O3 flag in the top-level Makefile? Or did you compile the code with DEBUG=true? If you haven't already done so, trying the latter might be worthwhile.

yuanlian · Apr 11, 2022

Hi Michael,

I tried both debug mode and removing -O3 flags and they behaved the same.

The MPAS version I am using is 7.0. The systems I have tried are:
MacOS: gfortran 11.2.0 + openmpi 4.1.2 + necdf 4.8.1 + pnetcdf 1.12.2 + pio 2.5.2 +
Pleiades: Intel compilers 2020.0.166 with intel MPI library + mpi-hpe/mpt.2.25 + netcdf 4.4.1 + pnetcdf 1.8.1 + pio 1.7.1

The global model works for both systems without any issues. The limited-area model on the other hand behaves differently on the two systems:
MacOS: model runs with single MPI task but crashes with two or more MPI tasks
Pleiades: model crashes regardless of number of MPI tasks

I also turned off all physics and got the same conclusion. I read that the differences between gfortran and intel fortran are mostly in the treatment of automatic arrays and allocatable arrays. Since others don't have the same issue I am facing, it is likely related to some modifications I made to the dynamical core.

I will keep debugging and report back.

Thanks,
Yuan

yuanlian · Apr 12, 2022

Just a follow up. The issue has been resolved

The issue came from the addition tracers I added to the model. In several places (i.e., before atm_bdy_adjust_scalars and atm_bdy_set_scalars) in mpas_atm_time_integration.F, scalars_driving was allocated but not initialized right after. Rather, the values were provided by function mpas_atm_get_bdy_state. When I added many extra tracers, I didn't use mpas_atm_get_bdy_state to set all scalar fields. This caused memory accessing issues when looping all scalar fields in atm_bdy_adjust_scalars and atm_bdy_set_scalars. The issue was gone when I initialized scalars_driving to zero right after it was allocated.

The following plots show a passive tracer field after six hours of simulation in a polar cap (60 degree north). All look good now.

EDIT: both GNU Fortran and Intel Fortran produced identical results.

Limited-area simulation crashes when using multiple processors

yuanlian

New member

mgduda

Administrator

yuanlian

New member

yuanlian

New member

yuanlian

New member

mgduda

Administrator

mgduda

Administrator

yuanlian

New member

yuanlian

New member