Quilting error

zxdawn · Jan 8, 2019

Hi everyone,

I'm trying to use quilting in I/O tasks.
But, I got this error after running for two days (only in rsl.error.0378):

Code:

taskid: 378 hostname: cn10122
 Quilting with   2 groups of   5 I/O tasks.
 Namelist logging not found in namelist.input. Using registry defaults for varia
 bles in logging.
wrf.exe: nc4internal.c:902: nc4_type_free: Assertion `type->rc' failed.
forrtl: error (76): Abort trap signal

Other error logs related to quilting are like this:

Code:

taskid: 377 hostname: cn10122
 Quilting with   2 groups of   5 I/O tasks.
 Namelist logging not found in namelist.input. Using registry defaults for varia
 bles in logging.

The log of submitting job:

Code:

yhrun: error: cn11078: task 378: Aborted
yhrun: First task exited 60s ago
yhrun: tasks 0-377,379-383: running
yhrun: task 378: exited abnormally
yhrun: Terminating job step 9546397.0
yhrun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cn10868]: *** STEP 9546397.0 KILLED AT 2019-01-07T23:24:15 WITH SIGNAL 9 ***
slurmd[cn10868]: *** STEP 9546397.0 KILLED AT 2019-01-07T23:24:15 WITH SIGNAL 9 ***
yhrun: error: cn11078: task 372: Killed

Here's the part of quilting settings:

Code:

&domains
 nproc_x                            = 11,
 nproc_y                            = 34,
 /

&namelist_quilt
 nio_tasks_per_group                = 5,
 nio_groups                         = 2,
 /

Here's the environment (enable netcdf4):

Code:

Currently Loaded Modulefiles:
  1) intel-compilers/15.0.1           5) netcdf/4.3.3.1/01-CF-15          9) udunits/2.2.26-icc15
  2) MPI/Intel/MPICH/3.1-icc15-dyn    6) pnetcdf/1.6.1/00-icc15          10) nco/4.6.0-icc15
  3) zlib/1.2.8-icc15                 7) jasper/1.900.1/01-CF-15-libpng
  4) hdf5/1.8.13/06-CF-15             8) ncview/2.1.5

Regards,
Xin

kwthomas · Jan 8, 2019

Xin...

The key words are "KILL WITH SIGNAL 9". This is SIGKILL, aka kill -9.

One or more nodes of your job ran out of memory. The computer has to take immediate action, otherwise,
it can hang or crash and reboot. The OOM ("Out Of Memory") daemon process watches for this and sends
SIGKILL to the processes using the most memory.

If you have logon access to the compute nodes, you can run "dmesg -T" or "dmesg" if "-T" isn't supported
and it will show that SIGKILL was logged.

You need to run with more nodes.

Ming Chen · Jan 8, 2019

Xin,

Did you write output files before the model crashed (after running two days) ? Is this a nesting case or a single domain case?
I would also like to know how big the grid is, how many domains and how many processors you are using? Are you writing out using any other auxiliary streams? I also want to take a look at the rsl.error.0000 file.
Please send me your namelist.input.

zxdawn · Jan 8, 2019

kwthomas said:
Xin...

The key words are "KILL WITH SIGNAL 9". This is SIGKILL, aka kill -9.

One or more nodes of your job ran out of memory. The computer has to take immediate action, otherwise,
it can hang or crash and reboot. The OOM ("Out Of Memory") daemon process watches for this and sends
SIGKILL to the processes using the most memory.

If you have logon access to the compute nodes, you can run "dmesg -T" or "dmesg" if "-T" isn't supported
and it will show that SIGKILL was logged.

You need to run with more nodes.

Thanks, Kevin! I'll let the administrator check that.

zxdawn · Jan 8, 2019

Ming Chen said:
Xin,

Did you write output files before the model crashed (after running two days) ? Is this a nesting case or a single domain case?
I would also like to know how big the grid is, how many domains and how many processors you are using? Are you writing out using any other auxiliary streams? I also want to take a look at the rsl.error.0000 file.
Please send me your namelist.input.

Ming Chen,

Yes, output files are wrote and it's a single domain case.
Here's the size of wrfout* files:

Code:

......
....
1.1G Jan  7 23:04 wrfout_d01_2014-05-19_12:00:00
1.1G Jan  7 23:08 wrfout_d01_2014-05-19_12:30:00
1.1G Jan  7 23:11 wrfout_d01_2014-05-19_13:00:00
1.1G Jan  7 23:15 wrfout_d01_2014-05-19_13:30:00
0 Jan  7 23:16 wrfout_d01_2014-05-19_14:00:00
0 Jan  7 23:20 wrfout_d01_2014-05-19_14:30:00
0 Jan  7 23:23 wrfout_d01_2014-05-19_15:00:00

You can check more detail in namelist.input file and rsl.error.0000.View attachment logs.tar.gz

I'm using 384 processors: 11 * 34 is used for simulation and 5 * 2 is used for I/O tasks.

Code:

&domains
 nproc_x                            = 11,
 nproc_y                            = 34,
 /

&namelist_quilt
 nio_tasks_per_group                = 5,
 nio_groups                         = 2,
 /

Ming Chen · Jan 9, 2019

Xin,

I looked at your files. I have a few concerns about this case:

(1) The number of processors is way too large for the grid number(i.e., 384 processors for 430 x 270 grids)

(2) I am not sure why you want to run with the quilt option? Again I ask this because I don't think it is necessary to turn on this option for this case.

(3) If for some reason you do want to run with the quilt option, can you try the settings:

&namelist_quilt
nio_tasks_per_group = 10,
nio_groups = 1,

My personal experience with the quilt option is that, for a single domain case, usually nio_groups = 1 works better.

I have no explanation why the IO was successful in the first two days and then failed. I will talk to our expert and get back to you if we figure out the reason.

zxdawn · Jan 9, 2019

Ming Chen,

Thank you for your help!

1) How to determine which number of processors is good according to the grid number?
I'm always confused of it.

2) Because I/O task will take much time if I don't use quilt option. You can check my blog in detail about it: https://dreambooker.site/2018/12/02/Speedup-WRF-Chem/

3) As I restart the simulation now, I'll try nio_groups = 1 if it happens again.

Regards,
XIn

Ming Chen · Jan 10, 2019

Xin,

Please see my answers below:

1) How to determine which number of processors is good according to the grid number?
I'm always confused of it.

There is no strict rule. Personally I prefer to make each processor handle about 25-30 grids. For example, suppose each processor handles 25 grids, then ((e_we)/25) * ((e_sn)/25) = the number of processors you should use

2) Because I/O task will take much time if I don't use quilt option. You can check my blog in detail about it: https://dreambooker.site/2018/12/02/Speedup-WRF-Chem/

I am not sure whether this is related to chemistry. My test case with the same number of grids as yours indicates that I/O is pretty fast.

3) As I restart the simulation now, I'll try nio_groups = 1 if it happens again.
Please try and let me know if you still have the same problem.

kwthomas · Jan 10, 2019

Ming...

My experience with netcdf I/O in WRF is that it is horribly slow, even with optimal Lustre stripping.

For a 3 km CONUS run, it takes about 2 minutes to write an hourly history file. This is on STAMPEDE2 (TACC)
and BRIDGES (PSC). When you are doing 60 hour forecasts, you are talking a *long* time, not to mention
the SU abuse.

I used to do splitfile runs, as CAPS software can handle them. When the HWT (NOAA Hazardous Weather Testbed) started using UPP for postprocessing, that went out the window, as UPP can only handle joined files.

That's when I started using PNETCDF and quilting, I/O was very quick, and would run while the forecast
continued.

This past spring, I was doing 6 minute forecast I/O for 12 of the 60 hours, via CAPS code mods. This required
me setting "nio_groups" to 2. Contrary to the documentation in the code, this works fine. I double checked
the file logic. No problems were detected during operations.

I/O itself is due to the way that netcdf works. The larger the file, the more and large disk seeking occurs.
Small files are quick. Large files aren't.

HDF has always had the same problems for the same reasons.

Ming Chen · Jan 14, 2019

Kevin,

Thank you for the kind information and message. PNETCDF and I/O quilting are good options for people who have trouble with slow I/O. We actually don't have the same trouble here in NCAR, thanks to the great super computer we have here.

For the quilting option, honestly I don't have many experiences. Very often I just try to run a small case with different combinations of nio_tasks_per_group and nio_groups, and find which options work fine. I am not sure whether these options are machine -dependent and case-dependent.

Out software engineers are working to incorporate PIO in WRF. Hope this new feature can be available soon to solve the I/O issue.

Thanks again.

Ming

Quilting error

zxdawn

Member

kwthomas

New member

Ming Chen

Moderator

zxdawn

Member

zxdawn

Member

Ming Chen

Moderator

zxdawn

Member

Ming Chen

Moderator

kwthomas

New member

Ming Chen

Moderator