Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

real.exe crashing on large domains with "p_top_requested < grid%p_top possible"

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

SIGSEV

New member
Hi there,
I've opened a thread, coming from here http://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=66&p=849#p849 because my problem seems to be more general.
I'm using WRF/WPS v4.0.1, intel compiler, mpi and netCDF-4 to do test simulations with large domains over Europe with ECMWF model data as input. I already used ECMWF data a lot in the past to drive WRF and I'm very confident that this has nothing to do with my input data.

Step 1: Set up a domain with 1000x1000 grid points
-> geogrid works
-> ungrib works
-> metgrid works
-> real works (with 80 levels and p_top_requested=5000)
-> wrf works and produced wrfout-files

Step 2: Set up a domain with 2000x2000 grid points. The only values which are changed in the namelist.wps are "e_we" and "e_sn". The ECMWF-grib data fully covers the extended domain.
-> geogrid works
-> ungrib works
-> metgrid works
-> real FAILS with "p_top_requested < grid%p_top possible"

I tried:
- changing the number of CPUs
- different location of the domain
- changing from ioform= 2 to ioform = 102 for metgrid, auxinput1, input and bdy (of course wit he same number of processors while running metgrid.exe/real.exe)
- switching to WRFV 3.9.1.1
but no with no success.

However, I found this post on the wrf-users mailing list, dating back to 2009, where a user raised this issue and this sound exactly like the problem I am facing right now
http://mailman.ucar.edu/pipermail/wrf-users/2009/001363.html

Does anybody have further information on that or managed to run real.exe on a huge large?
Any hints will be greatly appreciated!
 
Please send me your namelist.wps and namelist.input (for both 1000x1000 and 2000x2000 grids). It will be helpful to tell which ECMWF data you try to ungrib.

Ming Chen
 
Hi Ming,
I'm using ECWMF IFS forecast data in GRIB1 format on 137 model levels at a three-hourly interval. The ECWMF GRIB data covers the area
lat = 80.0 to 30.000000 by 0.1° long = -40.0 to 50.0 by 0.1°. Do you need some specific information?
I ungrib the data and then run the calc_ecwmf_p.exe on the data before running met_grid.exe
From my perspective ungrib.exe seems to work fine, because the met_em files produced look reasonable so I suspect the problem being associated with real.exe

Here are my namelist for WRf and WPS for both grids
View attachment namelists_2000x2000.tar.gz
View attachment namelists_1000x1000.tar.gz

If also uploaded my met_em-files for the 1000x1000 grid and the 2000x2000 grid together with the namelists to your Nextcloud as SIGSEV_20181026_met_em_1000x1000.tar and SIGSEV_20181026_met_em_2000x2000.tar
 
I only get the file "SIGSEV_20181026_met_em_1000x1000.tar".
It seems that SIGSEV_20181026_met_em_2000x2000.tar is not uploaded.
 
Seems like the upload somehow crashed, maybe the file was too big?
I re-uploaded the met_em files for two time steps as separate archives (met_em_2000x2000_0.tar and met_em_2000x2000_1.tar) and I think it worked now
 
We looked at your data. It is found that TT in the point (1930, 1959, 28) is zero, which triggered floating overflow when running the REAL program.

However, the error message in the rsl file is:

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 1199
p_top_requested < grid%p_top possible from data
-------------------------------------------

This is quite misleading. We actually found no problem in pressure field.

Ming Chen
 
We notice some memory issue related to the large grid settings in this case. We will further look into it and get back to you as soon as possible. It seems that the original met_em data are correct, but somehow the model messed up the reading and gave wrong values.

Ming Chen
 
We have looked through the entire REAL program but haven't figured out yet what is wrong. We will keep looking into it and it may take time. Thank you for your patience.
 
Have you tried to build a "debug" version with subscript checking? There is always the
possibility that an "alloc" failed without checking the return code.

Another possibility is the compiler. You said Intel. I've had major problems with Intel 17.x and
Intel 18.x, especially on STAMPEDE2 (TACC). Try building with lower level of optimization.
I know for one WRF release, I had to run "real.exe" with -O0 (no optimization).

Intel 16 apparently was the last "stable" compiler, however, it doesn't support modern
hardware families.
 
I have worked with our software engineer to thoroughly examine REAL program, and we didn't find any problem that leads to such behavior. Basically we think this is a data issue. Specifically, please pay attention to surface variables, e.g., landuse type, lands mask, etc.
Unfortunately we don't have enough human power to explore various datasets and fix possible problems related to a specific dataset.
 
Top