Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe keeps running with no progress.

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

philipdumont

New member
We have been using WPS/WRF version 3.6 for some time.

Recently, we started tripping on a bug in nested domains when the region of interest was centered on the 180 degree longitude. We wanted a fix for this problem, heard that it is fixed in version 3.9, and so we are attempting to upgrade to WPS/WRF 3.9.latest. (We avoided going to WRF 4, expecting such a move to require a more difficult/involved integration of our software.)

To the extent possible, we are trying to use the same configs/namelists in 3.9 as we did in 3.6.

One exception to this is that we found the default value for o3input changed between 3.6 and 3.9. (I think the doc said it changed at 3.7.) Since our namelist.input file was not providing a value for o3input, we were getting a different value (different default) in 3.9 than we had been in 3.6, which caused wrf.exe to fail immediately at launch with a complaint of not being able to find file ozone.something-or-other. This was easily fixed giving an explicit "o3input=0" in namelist.input. (0 was the default value in 3.6.)

With that change wrf.exe 3.9 gets started. And it generates some output -- one or two of the wrfout_d* files show up (out of an expected few dozen). But then output just stops. Nothing written to any of wrfout*, rsl*, or any other file. The wrf.exe processes are still running, still using up as much CPU as they can get, but not doing anything.

When it gets to this state, I grab all of the wrf.exe processes with strace(1) (multiple -p options), and all I see is a whole lot of sched_yield(2) system calls. If I add an option to the strace command line to trace everything *but* sched_yield, I see that none of the wrf.exe processes are calling any other system calls at all.

Any idea what might cause this? Any idea how to go about debugging it?

Thanks.
 
Hello,

I experienced that with v3.9 in some installations. It was dependent on geographic location. For example, the completely same setup run was randomly stopping on only one area, where 2 exact the same domains for different area ran fine all the time.

Now, although that might indicate that something about static data is cause, I still think it is not. The problem actually gone away after I changed some of physics. To be honest, I don't remember which one but I think it was YSU PBL that was "responsible" for stalling. I don't have a proof, but it could even be compiler problem (I used ifort 2016 back then).

Maybe that can help you to narrow down the cause.
Ivan
 
None of the rsl.* files have any blatant error messages.

But there is something a bit weird about them.

With version 3.6, the rsl.* files had lots of lines of the form:

Timing for main (dt= NN.NN): time YYYY-MM-DD_hh:mm:ss on domain N: N.NNNNN elapsed seconds

The 3.9 rsl.* files have absolutely none of these lines, despite the fact that, as I said in my previous post, some wrfout* output was generated. These timing messages have to do with generated wrfout* output, right?

Maybe it doesn't mean much. I suppose it's possible that wrf.exe wrote some such messages, but they're just stuck in memory in an I/O buffer that never got flushed to disk before I gave up and killed it.
 
All,

Here's a whole lot more information about variations on the build and results of those different builds.

Our original build of 3.6 -- the one that was working fine (except for the 180 degree longitude problem) -- was built with Intel compilers. All dependent libraries -- netcdf, hdf5, and all the rest -- were just stock libraries from installed RPMS.

On my first attempt to build 3.9, I tried to build it the same way -- same compilers, same dependent libraries, same versions of them all. But I found that I couldn't. Because version 3.9 of WPS has (at least) one new source file in it -- metgrid/src/scan_input.F -- and that file (and maybe others) has a "use netcdf" in it, and that line requires a FORTRAN compiled module netcdf.mod, and the build couldn't find the file. Well, the file was installed with the netcdf-dev package, it was just in a place that the build wasn't looking. But after rearranging things so the build could find it, I was still getting a failure. Because the installed netcdf.mod was built with a GNU compiler, and the Intel compiler didn't like it.

(Side note: I found out after the fact that version 3.6 WPS also has a source file or two that have "use netcdf" in them. I guess (I haven't looked) that binaries built from these sources did not build. But since we don't use those binaries, I don't care -- indeed, I hadn't even noticed.)

Anyway, I certainly didn't want to use GNU-compiled WPS/WRF. Much too slow. So it looked like the only alternative was to make a home-grown, Intel-built netcdf, and use that in the WPS/WRF builds.

In order to minimize the number of changed variables, I went and got the srcrpm for the same version of netcdf that we had installed, and built it in exactly the same way that the srcrpm's SPEC file did -- except, of course, for the compilers used: Intel instead of GNU.

Armed with an intel-built netcdf, I built WPS/WRF 3.9.latest, pointing them to this home-built netcdf.

That's the wrf.exe that's stalling.

Now, somewhere along the way, it occurred to me to wonder if the trouble I was having with wrf.exe3.9 was not really a problem with 3.9, but with my home-grown netcdf.

To find out, I tried the following 2 variant builds.

First, I rebuilt 3.6 in exactly the same way I had built it before, except that I pointed it to home-grown netcdf that I'd use with my 3.9 build. I even double-checked my work by using the linux ldd(1) command to ensure the resulting binaries were *not* loading the installed, gnu-built netcdf library. I used this build to run the same job that wrf.exe 3.9 was stalling on. The 3.6 version did not stall -- ran to completion. This would tend to vindicate the home-built netcdf libraries as the culprit.

Next, I rebuilt WPS/WRF 3.9 with the installed, gnu-built netcdf. Since the Intel compiler did not want to use that netcdf.mod, I had no choice but to build WPS/WRF with the GNU compilers. Not something I'd want to use in production, but for the purpose of diagnosing the current problem, a worthwhile exercise. I ran the same job through this build, and it failed in roughly (not exactly) the same place, but in a different way. It generated a large fraction (maybe 1/5) of the first wrfout_d* file, but then, instead of stalling, it crashed.

Now, I have no idea whether the two different failures of the two different 3.9 builds are slightly different manifestations of the same problem, or two completely different problems. I'm hoping the former. So I present here the error message that came out in the rsl files, in the hope that someone will have a clue what it means, and better yet, what to do about it. See attached.

(Another side note. I ran mpirun with option "-n 16", and top/ps showed me all 16 wrf.exe processes. But I only got rsl files for rank 0. What's up with that? Usually I get two rsl files for each process. But, that's with Intel mpi. I built the gnu-compiled WPS/WRF against OpenMPI, and it seems to be quite different.)
 

Attachments

  • rsl.error.0000.txt
    5.3 KB · Views: 55
  • rsl.out.0000.txt
    5.1 KB · Views: 61
WRFV3.6 has some issues for nest that crosses the dateline. The bug has been fixed in WRFV3.9.
Would you please send me more information about your case running with WRFV3.9, i.e, your nameless.input, namelist.wps, and the forcing data for your case?
Thanks.
 
I made a mistake in my prior post.

I'd indicated that with the gnu-compiler-3.9 build, I'd used OpenMPI. I intended to do this because I had no idea whether or not Intel's MPI would work with gnu-compiled stuff.

Well, I intended to use OpenMPI. But I didn't edit my build script quite right, and so ended up building with Intel MPI compiler wrappers -- mpifort, mpicc, etc. (It's worth noting that even though I was using Intel's MPI compiler wrappers, they were honoring the configuration's request that the gnu compilers be used, so I was indeed using the gnu compilers I intended to.) But when I went to run a job, since I thought I'd built it with OpenMPI, I ran the job with OpenMPI's mpirun. It was this mismatch of Intel/Open MPI that was the cause of all the weirdnesses of that build -- the crash, the missing rsl files. When I ran that build with Intel's OpenMPI, there was no crash -- or stall -- it ran to completion. And all the rsl files showed up. So that answered that question: yes, you can use Intel's MPI with gnu compilers. (As long as you consistently use Intel MPI.)

So then, just for fun, I retried what I had tried before but got wrong: Gnu compilers, OpenMPI compiler wrappers (got it right this time), OpenMPI mpirun, WRF 3.9. That worked fine too.

Anyway, the fact that the gnu-compiled 3.9 (either MPI) builds run okay without a stall or a crash would seem to tend to indicate that there's nothing fundamentally wrong with our namelists etc. I think.

And then tried one more thing: Intel Compilers, OpenMPI compiler wrappers, OpenMPI mpirun, WRF3.9. Didn't build. Intel's MPI compiler wrappers seem to be okay with wrapping gnu compilers. OpenMPI's MPI compiler wrappers don't do so well wrapping Intel compilers. The Intel compilers complained about how OpenMPI's compiled fortran modules were not compiled by Intel compilers. So, gave up on that idea.

In summary, here are the builds I tried (not including the one mixed MPI goof), and their results:

3.6, Intel compilers, Intel MPI, stock netcdf: works
3.6, Intel compilers, Intel MPI, Intel-compiled netcdf: works
3.9, gnu compilers, either MPI, stock netcdf: works
3.9, Intel compilers, OpenMPI, Intel-compiled netcdf: won't build because of gnu-compiled netcdf module
3.9, Intel compilers, Intel MPI, stock netcdf: won't build because of gnu-compiled OpenMPI modules
3.9, Intel compilers, Intel MPI, Intel-compiled netcdf: stalls

One more thing. I also tried a debug version of 3.9, Intel compilers, Intel MPI, Intel-compiled netcdf (uncommented the FCDEBUG line in configure.wrf, clean, compile). Didn't help. Still stalled. No extra useful info in rsl files.
 
Ming Chen said:
WRFV3.6 has some issues for nest that crosses the dateline. The bug has been fixed in WRFV3.9.
Would you please send me more information about your case running with WRFV3.9, i.e, your nameless.input, namelist.wps, and the forcing data for your case?
Thanks.

Ming Chen,

Attached are the namelist for geogrid (geogrid.nl), the namelist for whatever other WPS tools use one (ungrib, metgrib: namelist.wps), and the WRF namelist (namelist.input). I don't know what "forcing data" means.
 

Attachments

  • geogrid.nl.txt
    714 bytes · Views: 55
  • namelist.input.txt
    6.5 KB · Views: 51
  • namelist.wps.txt
    427 bytes · Views: 47
I'm giving up on 3.9. Every variation of an Intel build of 3.9.anything that I've attempted has failed in wrf.exe.

I tried 3.8.latest. I'm able to get an Intel build of that which does not fail. But it doesn't have the dateline fix either.

However, if I use git's format-patch command to generate a patch of only the dateline fix revision (without any of the rest of 3.9), and apply that patch to the 3.8 source, and Intel-build the result, wrf.exe runs to the completion, and does not have the nested dateline interpolation problem.
 
It turns out this problem seems to have been related to an Intel Parallel Studio bug. Whether the Fortran Compiler, C Compiler, or MPI runtime, I have no idea (though I have a guess). But when I updated my Intel packages from 2016 update 2 to 2019 update 5, everything else being equal, and rebuilt WRF 3.9.1.1, the job that used to be hanging on me just worked.

A special shoutout to mgduda (I think) who convinced me, at the Winter WRF Tutorial, that a compiler bug is *not* so extraordinarily unlikely that it's not worth trying an update.
 
Top