Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

NetCDF error: NetCDF: One or more variable sizes violate format constraints

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

philipdumont

New member
Dear WRF gurus,

The group I work for recently tried upgrading from WPS/WRF version 3.6.1 to 3.8.1.  (Why only 3.8.1, when there are so many newer versions available?  Long story.  Let's not get into it here.)  The builds of the two versions were identical -- same compilers, options, libraries, hardware -- only the NCAR source versions differed.

Most of the jobs we've run on 3.8.1 work fine.  But there's one job -- which ran fine on 3.6.1 -- that fails on 3.8.1 in real.exe with error message "real: error opening wrfinput for writing"

Early in my debugging, it occurred to me to wonder what would happen if I took the output of WPS3.8.1 for the job, and ran real.exe version 3.6.1 on it.  First of all, if real.exe3.6.1 ran successfully on the WPS3.8.1 output, it would tend to indicate that the problem was *not* with WPS3.8.1 generating bad output that real tripped on, but rather that the problem really was in real.exe3.8.1.  Also, if real.exe3.6.1 could run on the same data that real.exe3.8.1 failed on, it might be instructive to compare the two runs to see where/why/when the 3.8.1 version failed.

Well, real.exe3.6.1 did run successfully on the WPS3.8.1 output.

The next thing I did to debug was run both versions of real.exe via the Linux strace(1) command to compare system call results.  And they were identical, right up to where 3.8.1 did a write of the error message it produced.  There was no indication, with respect to system call results, as to *why* the error message was printed.  In particular, despite the wording of the error message, the open(2) system call that opened file "wrfinput_d01" succeeded, as did the few writes to the file afterwards.  This would seem to indicate that, whatever the problem is, it is *not* something OS/syscall related.

Next, I turned up verbosity by changing the namelist.input entry for "debug_level" from 0 to 10.  When I did, I got this message in the 3.8.1 output that did not show up in the 3.6.1 output: "NetCDF error: NetCDF: One or more variable sizes violate format constraints"

So I did a bit of web searching for this error message.  The first page I landed on was this one: https://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg11872.html

Based on what it says about the 32-bit file offset limit, and that fact that this failing job is, I think, bigger than any of the ones that have succeeded on 3.8.1 (45 degrees of both longitude and latitude, 5km resolution), I have a shrewd guess as to what's going on.  I'm thinking that, for whatever reason (more info?  higher precision info?), the 3.8.1 version of real.exe just generates more output than the 3.6.1 version for the same input.  And the NetCDF library pre-computes how much space will be needed, and if it's "too much", fails before attempting to write it (but not before attempting to open the file).  And so 3.8.1, though perhaps generating "better" output, cannot handle as big a job.

Can any of you confirm (or refute) this guess?  Bonus kudos/gratitude if you can provide some sort of quantification as to relative constraints of the 3.8.1 version of real.exe compared to the 3.6.1 version.  And/or how we might get around these constraints.

I'm near positive the OS/FS we are using is quite capable of "large files" (64-bit offsets).  So I suppose the 32-bit limit mentioned in that link is a NetCDF limit?  By the way, I did see the mention of the "special conditions" under which the 2GBytes limit could be exceeded, and tried to follow the reference, but the link pointed to no such page.

I've attached namelist.input (same for both runs), and the 3.6.1 and 3.8.1 versions of the namelist.output and rsl.error.0000 files.  (Only one real.exe process, so only one rsl file per run.)

Thanks.
 

Attachments

  • namelist.input.txt
    6.5 KB · Views: 57
  • namelist.output-3.6.1.txt
    80.9 KB · Views: 50
  • namelist.output-3.8.1.txt
    82.8 KB · Views: 46
  • rsl.error.0000.3.6.1.txt
    585 KB · Views: 45
  • rsl.error.0000.3.8.1.txt
    257.3 KB · Views: 57
I talked to our expert and we don't see any reason why wrfinput cannot be written. I wonder whether you have set the environmental variable "large file support" ? if not, please set this variable and recompile WRF. Let me know whether REAL can work. Another option is, please build WRF with netCDF4, which allows larger size of variables. We did run cases with similar large grid numbers, and we didn't have such problems.
 
Woohoo!

I did not know (or had forgotten) that large file handling was an environment-variable-controlled option of the WRF build. Thanks for the reminder!

I did a rebuild -- first exporting WRFIO_NCD_LARGE_FILE_SUPPORT=1. And now the job that was failing in real.exe gets through real.exe.

Thanks again!
 
Top