philipdumont
New member
Dear WRF gurus,
The group I work for recently tried upgrading from WPS/WRF version 3.6.1 to 3.8.1. (Why only 3.8.1, when there are so many newer versions available? Long story. Let's not get into it here.) The builds of the two versions were identical -- same compilers, options, libraries, hardware -- only the NCAR source versions differed.
Most of the jobs we've run on 3.8.1 work fine. But there's one job -- which ran fine on 3.6.1 -- that fails on 3.8.1 in real.exe with error message "real: error opening wrfinput for writing"
Early in my debugging, it occurred to me to wonder what would happen if I took the output of WPS3.8.1 for the job, and ran real.exe version 3.6.1 on it. First of all, if real.exe3.6.1 ran successfully on the WPS3.8.1 output, it would tend to indicate that the problem was *not* with WPS3.8.1 generating bad output that real tripped on, but rather that the problem really was in real.exe3.8.1. Also, if real.exe3.6.1 could run on the same data that real.exe3.8.1 failed on, it might be instructive to compare the two runs to see where/why/when the 3.8.1 version failed.
Well, real.exe3.6.1 did run successfully on the WPS3.8.1 output.
The next thing I did to debug was run both versions of real.exe via the Linux strace(1) command to compare system call results. And they were identical, right up to where 3.8.1 did a write of the error message it produced. There was no indication, with respect to system call results, as to *why* the error message was printed. In particular, despite the wording of the error message, the open(2) system call that opened file "wrfinput_d01" succeeded, as did the few writes to the file afterwards. This would seem to indicate that, whatever the problem is, it is *not* something OS/syscall related.
Next, I turned up verbosity by changing the namelist.input entry for "debug_level" from 0 to 10. When I did, I got this message in the 3.8.1 output that did not show up in the 3.6.1 output: "NetCDF error: NetCDF: One or more variable sizes violate format constraints"
So I did a bit of web searching for this error message. The first page I landed on was this one: https://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg11872.html
Based on what it says about the 32-bit file offset limit, and that fact that this failing job is, I think, bigger than any of the ones that have succeeded on 3.8.1 (45 degrees of both longitude and latitude, 5km resolution), I have a shrewd guess as to what's going on. I'm thinking that, for whatever reason (more info? higher precision info?), the 3.8.1 version of real.exe just generates more output than the 3.6.1 version for the same input. And the NetCDF library pre-computes how much space will be needed, and if it's "too much", fails before attempting to write it (but not before attempting to open the file). And so 3.8.1, though perhaps generating "better" output, cannot handle as big a job.
Can any of you confirm (or refute) this guess? Bonus kudos/gratitude if you can provide some sort of quantification as to relative constraints of the 3.8.1 version of real.exe compared to the 3.6.1 version. And/or how we might get around these constraints.
I'm near positive the OS/FS we are using is quite capable of "large files" (64-bit offsets). So I suppose the 32-bit limit mentioned in that link is a NetCDF limit? By the way, I did see the mention of the "special conditions" under which the 2GBytes limit could be exceeded, and tried to follow the reference, but the link pointed to no such page.
I've attached namelist.input (same for both runs), and the 3.6.1 and 3.8.1 versions of the namelist.output and rsl.error.0000 files. (Only one real.exe process, so only one rsl file per run.)
Thanks.
The group I work for recently tried upgrading from WPS/WRF version 3.6.1 to 3.8.1. (Why only 3.8.1, when there are so many newer versions available? Long story. Let's not get into it here.) The builds of the two versions were identical -- same compilers, options, libraries, hardware -- only the NCAR source versions differed.
Most of the jobs we've run on 3.8.1 work fine. But there's one job -- which ran fine on 3.6.1 -- that fails on 3.8.1 in real.exe with error message "real: error opening wrfinput for writing"
Early in my debugging, it occurred to me to wonder what would happen if I took the output of WPS3.8.1 for the job, and ran real.exe version 3.6.1 on it. First of all, if real.exe3.6.1 ran successfully on the WPS3.8.1 output, it would tend to indicate that the problem was *not* with WPS3.8.1 generating bad output that real tripped on, but rather that the problem really was in real.exe3.8.1. Also, if real.exe3.6.1 could run on the same data that real.exe3.8.1 failed on, it might be instructive to compare the two runs to see where/why/when the 3.8.1 version failed.
Well, real.exe3.6.1 did run successfully on the WPS3.8.1 output.
The next thing I did to debug was run both versions of real.exe via the Linux strace(1) command to compare system call results. And they were identical, right up to where 3.8.1 did a write of the error message it produced. There was no indication, with respect to system call results, as to *why* the error message was printed. In particular, despite the wording of the error message, the open(2) system call that opened file "wrfinput_d01" succeeded, as did the few writes to the file afterwards. This would seem to indicate that, whatever the problem is, it is *not* something OS/syscall related.
Next, I turned up verbosity by changing the namelist.input entry for "debug_level" from 0 to 10. When I did, I got this message in the 3.8.1 output that did not show up in the 3.6.1 output: "NetCDF error: NetCDF: One or more variable sizes violate format constraints"
So I did a bit of web searching for this error message. The first page I landed on was this one: https://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg11872.html
Based on what it says about the 32-bit file offset limit, and that fact that this failing job is, I think, bigger than any of the ones that have succeeded on 3.8.1 (45 degrees of both longitude and latitude, 5km resolution), I have a shrewd guess as to what's going on. I'm thinking that, for whatever reason (more info? higher precision info?), the 3.8.1 version of real.exe just generates more output than the 3.6.1 version for the same input. And the NetCDF library pre-computes how much space will be needed, and if it's "too much", fails before attempting to write it (but not before attempting to open the file). And so 3.8.1, though perhaps generating "better" output, cannot handle as big a job.
Can any of you confirm (or refute) this guess? Bonus kudos/gratitude if you can provide some sort of quantification as to relative constraints of the 3.8.1 version of real.exe compared to the 3.6.1 version. And/or how we might get around these constraints.
I'm near positive the OS/FS we are using is quite capable of "large files" (64-bit offsets). So I suppose the 32-bit limit mentioned in that link is a NetCDF limit? By the way, I did see the mention of the "special conditions" under which the 2GBytes limit could be exceeded, and tried to follow the reference, but the link pointed to no such page.
I've attached namelist.input (same for both runs), and the 3.6.1 and 3.8.1 versions of the namelist.output and rsl.error.0000 files. (Only one real.exe process, so only one rsl file per run.)
Thanks.