Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

metgrid crashes: get_min(): No items left in the heap

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

bartbrashers

New member
I can run ungrib, ghrsst-to-intermediate (writing SST:* files a bit bigger than my WRF domain), and avg_tsfc, just fine.

When I get to the metgrid stage, it gets through processing d01 (for a 5.5-day run I have 23 files), then writes 15 met_em.d02.* files, then stops with this message in both metgrid.out and metgrid.log:

ERROR: get_min(): No items left in the heap.

I'm not finding many clues using Google.

I'm using the 2019 version of PGI-CE:

# pgf90 -V

pgf90 19.10-0 64-bit target on x86-64 Linux -tp piledriver
PGI Compilers and Tools
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

I've attached my configure.wrf and configure.wps.

The compute nodes I'm running on have MemTotal: 65932648 kB according to /proc/meminfo. I was running 8 copies (8 consective 5-5-day periods) at the time, but the failure happens at the same stage regardless of how many other copies of metgrid are running at once.

I had accidentally set $OPAL_PREFIX to the PGI-supplied version, not the openmpi-3.1.2 version, when I compiled. But that shouldn't matter for metgrid, right? That's only a run-time thing.

Any ideas?
 

Attachments

  • configure.wps
    3.4 KB · Views: 77
  • configure.wrf
    20.3 KB · Views: 71
Could this be related to this oddity? wrf.exe and real.exe use my version of zlib, but the WPS components do not.

Code:
% ldd WRF-4.1.3/run/real.exe | grep libz
        libz.so.1 => /usr/local/src/wrf/LIBS/lib/libz.so.1 (0x00007f8e5af17000)

% ldd WRF-4.1.3/run/wrf.exe | grep libz
        libz.so.1 => /usr/local/src/wrf/LIBS/lib/libz.so.1 (0x00007f71f446f000)

% ldd WPS-4.1/ungrib.exe | grep libz
        libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007ff927317000)

% ldd WPS-4.1/metgrid.exe | grep libz
        libz.so.1 => /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1 (0x00007f4da9217000)

I told WPS to use my version, in both LD_FLAGS and COMPRESSION_LIBS:

Code:
% grep -C1 -i -e wrf -e lz WPS-4.1/configure.wps

WRF_DIR = ../WRF-4.1.3

WRF_INCLUDE     =       -I$(WRF_DIR)/external/io_netcdf \
                        -I$(WRF_DIR)/external/io_grib_share \
                        -I$(WRF_DIR)/external/io_grib1 \
                        -I$(WRF_DIR)/external/io_int \
                        -I$(WRF_DIR)/inc \
                        -I$(NETCDF)/include

WRF_LIB         =       -L$(WRF_DIR)/external/io_grib1 -lio_grib1 \
                        -L$(WRF_DIR)/external/io_grib_share -lio_grib_share \
                        -L$(WRF_DIR)/external/io_int -lwrfio_int \
                        -L$(WRF_DIR)/external/io_netcdf -lwrfio_nf \
                        -L$(NETCDF)/lib -lnetcdff -lnetcdf
--
#
COMPRESSION_LIBS    = -L/usr/local/src/wrf/LIBS/lib -ljasper -lpng -lz
COMPRESSION_INC     = -I/usr/local/src/wrf/LIBS/include
FDEFS               = -DUSE_JPEG2000 -DUSE_PNG
--
FNGFLAGS            = $(FFLAGS)
LDFLAGS             = -L/usr/local/src/wrf/LIBS/lib -lz -ljasper
CFLAGS              = -O -tp=istanbul

Do I have a problem in my configure.wps?
 
I don't think it is a compiling issue.
I suppose you built WPS in MPI mode, please let me know if I am wrong. In this case,

(1) can you run with a single processor, e.g.,
mpirun -np 1 metered.exe, and see whether you have the same problem?

(2) You can also try to run metgrid from the time when it failed, and see whether it can successfully process the data.
 
Because the d01 processing completed, re-starting the d02 processing at the point of failure (3/4 the way through) completes OK, and writes all 4 domains' met_em* files. If I start at the beginning and run till right before the crash, all 4 domain's met_em* files are created.

But I have a few hundred of these runs to do, and breaking them up like that would not be ideal.

I ran a test WPS run of a single 5.5-day period, on an AMD compute node with 132 GB of RAM (crash was on an Intel node with 64 GB of RAM). The AMD run completed without this error.

Ten 5.5-day WPS runs (consective 5.5-day periods) all running on the same AMD node works just fine.

A single WPS run on the Intel node crashed with the same messasge, as did a single WPS run on a different Intel node.

I was watching via 'top' as the runs on the Intel nodes progressed, and have verified it was not memory swapping - did not run out of RAM.

I used the same OS image to deploy to all compute nodes (using OpenHPC).

Conclusion: this is architecture-dependent. I have been using a compiler switch (-tp=istanbul) to maintain support for a few very old compute nodes. The Intel nodes are -tp=sandybridge.

Any suggestions for compiler switches, or how to dig into this further, would be greatly appreciated.
 
One more detail:

I was using my new ghrsst-to-intermediate program to use MUR SST in these runs. If I turn that off and just use the ERA5 SST field, this error does not happen, even in the Intel nodes.

https://github.com/bbrashers/WPS-ghrsst-to-intermediate
https://github.com/bbrashers/WPS-interp-intermediate

See related posts:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=65&t=9050
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=31&t=9032

Could this problem be fixed by manipulating METGRID.TBL? I have one LANDSEA mask in the ERA5 ungribbed FILE:* files, and a different LANDSEA mask (with the same name) in the SST:* files.
 
Based on the test that did NOT have this problem on AMD nodes, I ran 77 WPS runs on AMD nodes overnight, with no more than 6 runs per node.

Every single run HAD THE SAME CRASH.

Any thoughts on where I should look?
 
Now that WRF-4.2 and WPS-4.2 are released, could we pick this thread up again?

Summary:

Writing an SST field followed by a LANDSEA mask to each SST:YYYY-MM-DD_HH Intermediate file causes metgrid to crash with the error "get_min(): No items left in the heap".

Writing just the SST field to each Intermediate file does not cause the error.

This thread https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=31&t=9032 says it's best to include the LANDSEA mask in each Intermediate file.

I just completed an annual simulation using MUR SST with no LANDSEA mask in the Intermediate files. Performance was good, I can see no weirdness near the shorelines in the 1.33km domain.

Thanks!
 
If your SST data is spatially continuous, e.g., there is "SST" (equivalent to "SKINTEMP") over the land, then it is OK not to include LANDSEA in your SST files. In this case, I suppose "SKINTEMP" near the coast might be used in interpolation to produce SST at points near the coast. This certainly is not that accurate, but at least the data still look reasonable.
If your SST data only have values over the ocean, then without the landsea mask will introduce problems in metgrid. This is why I suggest you should include landsea mask in your SST file.
I don't know yet why metgrid crashed if you include landsea mask in your SST file. Can you send me your intermediate file, SST file that includes landsea mask, and namelist.wps to take a look? I would like to repeat your problem first, which might give me some idea what is wrong.
 
Hi Bart.

I also have some similar issues on how to create intermediate files of the MUR SST.
Another researcher and I wrote a Fortran code to read the MUR SST NetCDF files and write the intermediate files (SST:YYYY-MM-DD_HH). First, we didn't put any information in the code to write the SST mask. I thought that "missing_value" in METGRID.TBL was able to isolate the SST and, in this way, an appropriate metgrid interpolation could be done. However, after some detailed analysis, I noticed that there were some unrealistic SST gradients very close to the coastline. These SST gradients were created because, close to the coastline, the real program joined the MUR SST and the SST from the global atmospheric model (GFS in my case). I provided more information on the subject here
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=34&t=5588&p=10811&hilit=sst+landmask#p10811.
In some periods, when the MUR SST field was almost the same as the GFS SST field, the interpolated SST field looked OK near the coast. However, when the MUR SST field was very different from the GFS SST field, the interpolated SST field presented an unrealistic gradient near the coast. In the study region that I am interested in, it occurs due to the coastal upwelling, which is better represented in the MUR SST and it is not represented it the GFS SST (https://link.springer.com/article/10.1007/s00703-018-0622-5).
After noticing this, I realized that the use of an SST mask was necessary and I followed the suggestions presented here: https://www2.mmm.ucar.edu/wrf/users/FAQ_files/FAQ_wps_input_data.html.
So, I adapted the Fortran code to process the SST mask isolated from the SST, and in that case, the code wrote the MASK:YYYY-MM-DD_HH. Then, I used the mask in the simulation by modifying the METGRID.TBL and the namelist.wps. As the SST mask field is time-invariant, I informed it using the "constants_name" in the namelist.wps, and in the METGRID.TBL I added the lines masked=land and interp_mask=MASK(2). The value I used is because the SST MASK has values equal to 2 over the land surface.
I don't touch it for a while and as you can see here https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=34&t=5588&p=10811&hilit=sst+landmask#p10811, I haven't 'yet solved the issue of properly interpolating the MUR SST to the WRF model. I probably didn't modify METGRID.TBL or the namelist.wps correctly, or even the mask were not created correctly.
I hope this information can be useful. And now, with the information you provided, I would like to try some modification to the procedure I was using to see if we solve it.

Ian
 
Hi Ian,

Does the problem boil down to having TWO LANDSEA masks, one for the MUR SST dataset and a second one for the rest of the global data (e.g. ERA5 or GFS)?

If you give only one the higher-resolution MUR SST LANDSEA as an invariant file in the "constants_name" in namelist.wps, does that work?

Bart
 
Hi Bart.

Thanks for sharing your code (ghrsst-to-intermediate.f90)! I've used this code to convert a MUR SST file and it was easy to use. I will explain the tests I did.

- I successfully ran metgrid using the files FILE:YYYY-MM-DD_HH and SST:YYYY-MM-DD_HH at the same time-frequency, without the LANDSEA mask.

- I successfully ran metgrid using the files FILE:YYYY-MM-DD_HH with a 3-hour temporal resolution and a constant SST:YYYY-MM-DD_HH. In this case, I didn't include the LANDSEA mask. For this case, I considered
fg_name = 'FILE',
constants_name= = 'SST:YYYY-MM-DD_HH',

- I ran metgrid without success using the files FILE:YYYY-MM-DD_HH and SST:YYYY-MM-DD_HH at the same time-frequency, with the LANDSEA mask. When the domain d03 was being processed, the model stoped with the same error you got: ERROR: get_min(): No items left in the heap.

After getting these results, my idea was to give the higher-resolution MUR SST LANDSEA as an invariant file in the "constants_name" in namelist.wps. According to the information I found at Q5 (https://www2.mmm.ucar.edu/wrf/users/FAQ_files/FAQ_wps_input_data.html), I thought it was possible to do this. Then, I did the following tests:

- I ran metgrid without success using the files FILE:YYYY-MM-DD_HH with a 3-hour temporal resolution and a constant SST:YYYY-MM-DD_HH including the LANDSEA mask in the same file. For this case I considered
fg_name = 'FILE',
constants_name= = 'SST:YYYY-MM-DD_HH',

- I ran metgrid without success using the files FILE:YYYY-MM-DD_HH with a 3-hour temporal resolution and a constant SST:YYYY-MM-DD_HH and LANDSEA:YYYY-MM-DD_HH. In this case, the LANDSEA mask wasn't in the same file as the SST. For this, I considered
fg_name = 'FILE',
constants_name= = 'SST:YYYY-MM-DD_HH', 'LANDSEA:YYYY-MM-DD_HH',

In the last two tests, I received the following error message: "ERROR: Cannot combine time-independent data with time-dependent data for field LANDSEA.mask". I still don't know how to avoid this error.

Ian
 
Hi Ian,

When using a constants_name file, it should be constant - contain ONLY the land-sea mask. My program can generate a file like that for you, for example:

Code:
# ghrsst-to-intermediate -l /path/to/20190101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

That will create a file named LANDSEA:YYYY-MM-DD_HH, without any SST fields in it. Try using that file for the constants_name file.

My latest idea is to avoid a possible name-collision, and rename (in the ghrsst-to-intermediate.f90 code) LANDSEA to SST_MASK, then refer to that in METGRID.TBL, something like this:

Code:
========================================
name=SKINTEMP
mpas_name=skintemp
        interp_option=sixteen_pt+four_pt+wt_average_4pt+wt_average_16pt+search
        masked=both
        interp_land_mask  = SST_MASK(1)
        interp_water_mask = SST_MASK(0)
        fill_missing=0.
========================================
name=SST
        interp_option=sixteen_pt+four_pt
        fill_missing=0.
        missing_value=-1.E30
        interp_mask=SST_MASK(1)
        flag_in_output=FLAG_SST
========================================

Perhaps METGRID is getting confused by finding two different fields named LANDSEA, one from UNGRIB and one from ghrsst-to-intermediate.

I have confirmed that using the --nolandsea flag runs. I have not found any really bad-looking SST gradients near the coast, but I have not looked very closely.

Bart

Code:
ghrsst-to-intermediate --help
 Usage:  ghrsst-to-intermediate [Options] file.nc
 Options:
     -g geo_em.d01.nc        Output sub-grid that covers this domain.
     -b Slat,Nlat,Wlon,Elon  Specify sub-grid manually (no spaces).
     -s | --sst              Output SST:YYYY-MM-DD_HH files.
     -i | --seaice           Output SEAICE:YYYY-MM-DD_HH files.
     -l | --landsea          Output LANDSEA:YYYY-MM-DD_HH files.
     -n | --nolandsea        Don't append LANDSEA mask to SST ICE files.
     -d | --debug            Print debug messages to the screen.
     -V | --version          Print version number and exit.
     -h | --help             Show this help message.
 Required:
     file.nc                GHRSST file downloaded from the JPL PODAAC.

 Converts GDS version 2 GHRSST datasets from JPL PODAAC (in netCDF format)
 to WPS intermediate format. See
 https://podaac.jpl.nasa.gov/datasetlist?ids=ProcessingLevel&values=*4*&search=GHRSST&view=list

 Example:
 ghrsst-to-intermediate -sst -l -g ../ERA5/geo_em.d01.nc
 2014/20140101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
 
Hi Bart,

Thanks for the information!
In the second test I did, I considered the SST to be constant. I did this to try to control all the possibilities and understand the errors. In addition, it's a common practice to consider some fields constant in short simulations, such as the SST
(slide 40, https://www2.mmm.ucar.edu/wrf/users/tutorial/202001/duda_wps_general.pdf).
In the last test, I considered what you suggested. I converted the MUR SST file and created a file named LANDSEA:YYYY-MM-DD_HH, without any SST fields in it. To do it, I used the -l option. However, I received the following error message: "ERROR: Cannot combine time-independent data with time-dependent data for field LANDSEA.mask".

The idea of avoiding a possible name-collision sounds good!

Ian
 
I can confirm that this METGRID crash - get_min(): No items left in the heap - does not occur if the land-sea mask in SST WPS Intermediate files is not named the same as in the main output from UNGRIB. That is, if you are writing your own Intermediate files, don't name the land-sea mask LANDSEA. I named it SST_MASK, and made the changes to METGRID.TBL as in my previous post, and avoided this crash.

It makes sense that LANDSEA is "visible" globally in METGRID. I had assumed when reading in an SST Intermediate file it would be a local variable.
 
Hi Bart,
I've changed my METGRID.TBL according your suggestion:
Code:
name=SST
        interp_option=sixteen_pt+four_pt
        fill_missing=0.
        missing_value=-1.E30
        interp_mask=SST_MASK(1)
        flag_in_output=FLAG_SST
Code:
name=SKINTEMP
mpas_name=skintemp
        interp_option=sixteen_pt+four_pt+wt_average_4pt+wt_average_16pt+search
        masked=both
        interp_land_mask  = SST_MASK(1)
        interp_water_mask = SST_MASK(0)
        fill_missing=0.
And I also modified the LANDSEA in the ghrsst-to-intermediate.f90 program to SST_MASK and recompiled it.\
However, when I ran the metgrid program, it reported that:
Code:
[cloud@igplogin WPS.2]$ ./metgrid.exe 
Processing domain 1 of 2
    SST_MASK:2014-07-05_00
WARNING: Entry in METGRID.TBL not found for field SST_MASK. Default options will be used for this field!
 Processing 2014-07-05_00
    FILE
Here my namelist option:
Code:
&ungrib
 out_format = 'WPS',
 prefix     = 'FILE',
/

&metgrid
 fg_name         = 'FILE'
 constants_name  = 'SST_MASK:2014-07-05_00',
 io_form_metgrid = 2,
Please help me to solve it!
Thank you very much!
 
All that warning means is that when creating SST_MASK, it can't find a mask to apply.

I think you can safely ignore it. Do the SKINTEMP fields look OK, especially near the land-sea boundaries?
 
Hi Bart,
I've tried to ignore this warning and ran the real.exe program from WRF, however it failed.
Code:
 starting wrf task            0  of            1
 module_io_quilt_old.F        2931 T
-------------- FATAL CALLED ---------------
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    5837
FATAL CALLED FROM FILE:  <stdin>  LINE:    5837
 check comm_start, nest_pes_x, nest_pes_y settings in namelist for comm            1
 check comm_start, nest_pes_x, nest_pes_y settings in namelist for comm            1
-------------------------------------------
-------------------------------------------
In addition, I also attach my SKINTEMP map. I think it have something wrong in the land area.
Please help to to solve this problem.
 

Attachments

  • Capture.PNG
    Capture.PNG
    212.8 KB · Views: 3,073
Top