Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Model Integration hangs using 1024cores

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

makinde

New member
Good day
Appreciation goes to everyone on this platform that has made it a duty to help tackle errors and shed light on shadowed areas of model simulation with MPAS.

Please am having an issue with running MPAS model integration using more cores (e.g 1024cores). I have previously and successfully run MPAS_Atmos for both stretched grid and uniform resolution simulation. Whenever I tried to run MPAS using any resolution with more cores(1024) above 512, the model hangs with no output nor error. If I run the same resolution with a max of 512cores it works fine and well.
I have tested this on 60km_uniform, 60-15km_variable, and 60_10km_variable, it consistently stops at
Bootstrapping framework with mesh fields from input file 'x.XXXXXXX.init.nc'
even when I leave it to run for up to 48hrs.

Please find attached the log.atmosphere.0000.out file

Thanks.
 

Attachments

  • log.atmosphere.0000.out.txt
    1.6 KB · Views: 42
The "bootstrapping" part of the model start-up involves reading in the fields from the input NetCDF file that describe the horizontal mesh, reading the mesh partition (graph.info.part.X) information, and working out the cells/edges/vertices that belong in the "halo" regions on each MPI task. My initial suspicion is that there may be an issue in reading the input NetCDF file in parallel. I think you may have mentioned this in another post, but could you remind me which versions of the netCDF, parallel-netCDF, and PIO libraries you're using? Also, which compiler (and version) are you using?

Do you know of other applications that run successfully on >1024 cores of the cluster, and that also perform parallel file I/O? If so, that would help to isolate the issue to MPAS.
 
mgduda said:
The "bootstrapping" part of the model start-up involves reading in the fields from the input NetCDF file that describe the horizontal mesh, reading the mesh partition (graph.info.part.X) information, and working out the cells/edges/vertices that belong in the "halo" regions on each MPI task. My initial suspicion is that there may be an issue in reading the input NetCDF file in parallel. I think you may have mentioned this in another post, but could you remind me which versions of the netCDF, parallel-netCDF, and PIO libraries you're using? Also, which compiler (and version) are you using?

Do you know of other applications that run successfully on >1024 cores of the cluster, and that also perform parallel file I/O? If so, that would help to isolate the issue to MPAS.

Thank you Mgduda, I can only guess this information because I was not the one who compiled and install MPAS on the cluster am using. More so, there is more than one compiler on the cluster. So checking my PATH, I found that both intel compiler and gcc compiler are both referenced or loaded but echo of the local variable CC gives ICC as the compiler. The netCDF version is 4.1.3. I don't know how to get the parellel-netCDF version.

Based on your explanations above, does that mean that during the initialization run, I must make all the ouput files, like the static and met files pnetcdf?

Thanks
 
Top