Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

atmosphere_model malfunctions without error or stopping

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

mcauliffej12

New member
We are running into some a problem when running the atmosphere_model. We are using 144 processors (4 nodes with 36 processors each), but the model seems to crash. It doesn't cease operations, yet remains in the same part of the code for up to 2+ hours. We assume that something is wrong, and I have attached our log file (failed.log) to show where the software stops.

I have also attached my namelist. and streams.atmosphere.

We tried running with 48 processors total (1 node and 48 processors) and 144 processors total (4 nodes with 36 processors each). Each node has 250 GB of memory.

Thank you!
 

Attachments

  • namelist.atmosphere.txt
    1.8 KB · Views: 24
  • streams.atmosphere.txt
    1.8 KB · Views: 24
  • failed.log
    8.5 KB · Views: 23
From the log file, it looks like the model may be stalling at the point where it reads initial conditions from the "input" stream.

I'm not sure whether this may be the issue, but I have found in the past that the NetCDF-4 format can be really terrible when it comes to parallel I/O performance. It does appear from the io_type="netcdf4" specification in the definition of your "input" stream that you may have created initial conditions in this format with the init_atmosphere_model program.

I think the 24-km mesh may have few enough cells that you could create your initial conditions file in the default format (CDF-2 using the Parallel-NetCDF library), especially if you write the initial conditions in single precision (rather than double precision; see Section 3.4 of the User's Guide for details on how to compile with single-precision reals). The Parallel-NetCDF library (again, in my experience) offers much better parallel I/O performance.

I do see in your namelist that you're using 4 I/O tasks with a stride of 18:
Code:
&io
    config_pio_num_iotasks = 4
    config_pio_stride = 18
/
If you're running on four nodes with 36 MPI tasks per node, you might try either
Code:
&io
    config_pio_num_iotasks = 8
    config_pio_stride = 18
/
or
Code:
&io
    config_pio_num_iotasks = 4
    config_pio_stride = 36
/
to hopefully improve I/O performance. (Note that the product of the number of I/O tasks and the stride is equal to 144, your total number of MPI tasks.)
 
Top