restarts not producing identical output

dnolan · Mar 26, 2019

Hello,

I've been doing idealized hurricane simulations with both WRF3.4.1 and WRF 3.9.1.1.

Recently I've found that the output from restarts is not coming out identical to the original run. The results are identical when re-run from the initial conditions, but not from restart. This is true for both 3.4.1 and 3.9.1.1.

I actually had this problem before about 7 years ago, but then I got it fixed and then I had many runs with WRF3.4.1 that were repeated correctly.

One difference now is that these are idealized runs, with no wrfbdy files. The namelist is attached.

I'm running on 512 Intel processors using ifort (dm).

If anyone has had a similar experience, and/or solutions, please let me know!

Dave Nolan

kwerner · Mar 28, 2019

Hi Dave,
I've been trying to test this out, using some of the components from your namelist, but haven't had any success in replicating the issue yet. I have a few questions that may help:

1) When you compare the output, are there differences on every domain? If not, which domain(s) is it evident?
2) The namelist you sent seems to be for the initial run, so I don't know much about the restart - for instance, how long are your restart runs? Can you also attach your namelist for the restarts?
3) At which times are you comparing your output files?
4) Have you checked to see if this also happens in the latest version of the code (v4.0.3)?
5) Has any of the code been modified from the pristine "out-of-the-box" WRF code?

Thanks,
Kelly

dnolan · Mar 28, 2019

Hi Kelly,

Thanks for your response.

To answer your questions:

Yes, there are changes to the code. These are:

1) Change to the surface drag formula as a function of wind speed in the YSU surface layer scheme;

2) Change to the vortex center-finding algorithm for the moving nests; the geopotential is smoothed before the center is found;

Based on your replies, I'll do a test with pristine version of the code for 3.9.1.1. For consistency with ongoing research projects, it is not possible to go up to 4.0.3.

If the pristine 3.9.1.1 case fails, I'll send a more complete namelist and I can also post the input files on our web site for you to work with.

Dave Nolan

dnolan · Mar 31, 2019

Hi again,

So I recompiled WRF 3.9.1.1 from scratch with no modifications. Then I ran again one of my idealized hurricane simulations. To run faster I only used three domains.

First I ran for 6 days. Then I restarted from the beginning of day 5.

The results were different, very quickly. You can see differences in plots by 6 hours.

I know that WRF in the past has had trouble with restarts when moving nests are used. So then I ran again with only one domain, again for 6 days. I restarted from the beginning of day 6. Again, the results were different (not as clear to the eye so quickly, but there are numerical differences within 3 hours).

Finally, I ran the one-domain restart case again, to see if it would be randomly different every time. It was not; the two cases that start from the restart file are identical (at least for 24 hours).

It's conceivable that there is something about our supercomputer that is causing this. But it's also possible that it's WRF. We are using a lot of physics - double moment microphysics, SW and LW radiation, etc. There could be a problem that was not noticed before.

We were also wondering about single and double precision. The netcdf files are definitely in single precision. How can we tell if WRF is also running in single?

If you want to try it yourself, I can give you both my namelist.input and the wrfinput_d01 file. That should be all you need.

Thanks for your help.

Dave Nolan

kwerner · Apr 1, 2019

Hi Dave,
Yes, if you don't mind, can you attach those files for me? If the wrfinput file is too large to attach, take a look at the home page for this forum for information on uploading larger files.

Thanks,
Kelly

dnolan · Apr 1, 2019

Or perhaps you can download them from my site?

namelist:

https://drive.google.com/file/d/1NAtqsH3233pDK-ePt115_1qVEbbfqANC/view?usp=sharing

wrfinput_d01:

https://drive.google.com/file/d/1iki25d91qlZd-QxxEIK61TztLQnqKCB-/view?usp=sharing

Let me know if that works. Thanks!

kwerner · Apr 2, 2019

Hi,
Yes, that works. I have downloaded the files and will run some tests. I'll keep you posted.

Kelly

kwerner · Apr 5, 2019

Dave,
I haven't forgotten about you. It's taken me a while to carry out these tests. I have been able to repeat what you're seeing, and have determined that I can see differences if I run for only 3 days, and then restart from day 2-3, comparing output at the day 3 mark (running with only a single domain). I have also found that increasing the time step (I changed it to 150) makes the problem go away, but I haven't figured out why that is yet. I'll keep looking to see if I can find anything helpful and will keep you posted.

Kelly

dnolan · Apr 14, 2019

Sorry for not replying sooner, I was at a conference in Vienna.

Anyway, that's remarkable that you have seen the same problem. I was becoming sure that it was a problem with our supercomputer. And the time step aspect is really strange. I'm going to try that myself and see if it has the same result. Let me know if you want me to try anything else.

kwerner · Apr 15, 2019

Hi Dave,
I still don't have an underlying reason for why, but I have found that with your namelist (only running 1 domain/3 hours), I can get rid of the diffs if I change either of both of the following:
1) increase the time_step from 60 to 150 (didn't try values in between)
2) change damp_opt from 3 to 2 (which meant I also had to change dampcoef from 0.05 to 0.0003 to keep it from crashing immediately)

If I'm correctly interpreting the vague description of damp_opt in the README.namelist file, it seems that option 3 was only meant to be used for a real-data case, which may mean you would want to change this anyway, but I'm not certain. You could try a test using option 2 to see if you are still satisfied with the results, and if it gets rid of the diffs for you, as well.

Kelly

dnolan · Apr 16, 2019

Wow. That's very interesting.

I'm not inclined to reduce the time step. However, for the upper-level damping option, I originally was using damp_opt=3 with a much smaller coefficient, 0.0033, because I thought that should be the inverse of the dimensional damping time scale (i.e., 300 sec). However, Joe Klemp (who made that scheme) told me himself that the coefficient should be much larger because it is non dimensional relative to the vertical propagation speed of gravity waves.

So first I will try it again with my original value for damp_opt=3 and see what happens. If that still fails to reproduce I will go to damp_opt=2.

I will let you know what happens.

dnolan · Apr 19, 2019

Hello again,

Some bad news, unfortunately.

First I tried damp_opt=3 with a smaller coefficient. Then I tried damp_opt=2 with the value you mentioned, 0.003. Then I tried that with dt = 150. Then I tried damp_opt=0 with dt = 150.

All of these produced discernibly different results from restart within 24 hours. A screenshot of the comparison for the last case is attached.

About the cases that you said that got matching restarts, I have to ask...are you sure? I did notice that with some of the changes above the fields took longer to diverge.

Thanks for your help.

Dave Nolan

kwerner · Apr 22, 2019

Hi Dave,
I'm pretty sure... I'm attaching files for a run in which I only modified damp_opt (set to 2) and damp_coeff (set to 0.003). This is how I conducted this test:
run1
ran for 3 days:
2015-09-05_00
2015-09-08_00
see namelist.input.run1.txt

restart
ran for 1 day:
2015-09-07_00
2015-09-08_00
see namelist.input.restart.txt

I then used the diffwrf utility (found in external/io_netcdf/) to compare the 2 files at time 2015-09-08_00, and I get 0 diffs (see file diffwrf.txt). I'm also attaching the namelist you sent - the baseline I started from (namelist.input.dnolan.txt) so that you can compare the diffs between those files. The only thing I can think of is that perhaps you are simulating all 3 domains, and that makes a difference, or perhaps it's that I'm using the newly-released v4.1 code (because I initially saw the same diffs with this version and v3.9.1.1). I will try this again with v3.9.1.1 and with all 3 domains to see if I can get diffs, and keep you posted.

dnolan · Apr 22, 2019

Your dates are different from mine. I'm starting at 09-01_00, running to 09-05-00, but then doing the restart from 09-03 to 09-04. I think when I sent you my namelist, it was from the restart test which had a later date. But of course this should not matter at all to this issue.

We are also mixed up on the damping coefficient for damp_opt = 2. Your earlier email on this said 0.0003, and that is what I put into my namelist when I first tried damp_opt = 2. Then I just wrote back to you and said 0.003 by mistake. The namelist you just sent has 0.003, which I suspect is what you were using anyway.

I am definitely using only one domain. I think the fact that you went from 3.9.1.1 to 4.1 could matter a lot.

Thank you again for working on this.

Dave Nolan

kwerner · Apr 24, 2019

Hi Dave,
I tried a few more tests, but am still getting the same result - using damp_opt = 2 and dampcoef = 0.003 gives bit-for-bit identical results for restart output vs. original run output.

This is what I tried this time:
1) I changed to the exact dates you mention (even though, as you said, that shouldn't matter), running from 9/1 to 9/5 and restarting from 9/3 to 9/4. Original run and restart file at the start of 9/4 were identical. I also checked the times between 9/3 and 9/4 and they were also identical. I am changing 'frames_per_outfile' to 1 just so that I can have the exact date/time in each file to compare. I don't think this should make a difference with your runs, though.

2) I then realized that you are using Intel and I was using GNU for all of these tests. So I recompiled with Intel (V17.0.1) and tried the above test again - with this run, I do receive 1 difference in the REFL_10CM field, which is a known problem that was corrected with V4.1 (https://github.com/wrf-model/WRF/commit/86f767535923). Again, I don't think this is related to the many diffs you are seeing, as this should only be a problem if using nwp_diagnostics.

I'm attaching the same comparison you did (for Q2 at the 7th time within 9/3 - which of course is 2015-09-03_21:00:00).

At this point, it seems that the problem could be related to the environment and/or the specific version of Intel. If you haven't already compiled with no optimization (configure -d), then I would suggest that to see if it helps (along with the damp* changes). I also just talked to one of our software engineers who said they've seem similar problems with bit-for-bit results with the MPAS model when using particular versions of Intel. He suggested adding "-fp-model precise" to your configure.wrf script in the line "FCOPTIM." You'll need to make sure the line isn't commented out, and obviously you will need to first clean and reconfigure before compiling again.

dnolan · Apr 30, 2019

Hello again,

Here is the latest. Based on your previous message, I tried the following:

1) cleaned and configured again and then compiling with the flag -fp-model precise added to the optimization;

2) cleaned and configured again with configure -d and recompiled with no optimization;

3) cleaned and configured again with configure -d and then added -fp-model precise.

All of these were run with damp_opt = 2 and 0.003 and dt = 60 on d01 only.

All of them produced different results after restart.

At this point, I am ready to give up on getting bitwise reproduction from restart with WRF on our system.

(By the way, configure-d is not feasible anyway, it runs 10 times slower.)

There is some hope, in that UM is about to get a new supercomputer with IBM processors. WRF was originally developed on IBM and our previous system - for which I was able to get perfect reproduction - was also IBM. (Although, to be clear, I was able to get restart reproduction before with our Intel system.) But I am hopeful that it will be different on the new system.

I did want to ask, am I correct in interpreting what you wrote before that you did not get restart reproduction with different damp_opt and different time steps? And it sounds like that was even true for WRF 4.1? Obviously, WRF 4.1 (or any version) should reproduce from restart for any options.

I also had one other question that I am not sure was answered. It appears to me that WRF is running in single precision, and I know for sure that the outputs to wrfout and wrfrst are single precision. Is there some way to make sure it is not running in double but saving in single?

Thanks again for your time.

Dave Nolan

davegill · May 1, 2019

Dave,
Just a quick comment on your question about double precision.

You can build the WRF model (and the real program at the same time) to use 64-bit reals by default by adding a flag to the configure script:

Code:

> ./configure -r8

The WRF model is able to read either 32-bit or 64-bit reals. However, on output, the model will always use the default precision defined in the build.

Take a look at the ZNW (vertical coordinate, eta levels) from the model output. This is an easy way to determine the floating point precision of the build.

Double precision example:

Code:

> ncdump -v ZNW wrfout_d01_2000-01-24_12:00:00
data:

 ZNW =
  1, 0.993814761866313, 0.985950661559845, 0.976014242459852, 
    0.963557542641533, 0.948093083960134, 0.929123783404496, 
    0.906191230656412, 0.878942359453986, 0.847207978684442, 
    0.811077789981358, 0.770949002887757, 0.727525396211584, 
    0.681755337912341, 0.634503563426089, 0.586030473548833, 
    0.539504984171196, 0.49636880084888, 0.456375018160999, 
    0.419294717344076, 0.38491565599337, 0.353041053217325, 
    0.323488463291546, 0.296088731365236, 0.270685025242722, 
    0.247131937698118, 0.225294654184901, 0.205048181176488, 
    0.186276630720942, 0.168872557114707, 0.15273634189857, 
    0.137775623655674, 0.123904769347815, 0.111044384164019, 
    0.0991208570758672, 0.0880659394983561, 0.077816354644643, 
    0.0683134353386523, 0.0595027882124464, 0.0513339823662699, 
    0.0437602607092023, 0.036738272328173, 0.0302278243534581, 
    0.0241916519003705, 0.0185952047703237, 0.013406449690374, 
    0.00859568695929155, 0.00413538045066428, 0,

Single precision example:

Code:

> ncdump -v ZNW wrfout_d01_2000-01-24_12:00:00
data:

 ZNW =
  1, 0.9938147, 0.9859506, 0.9760143, 0.9635575, 0.9480931, 0.9291238, 
    0.9061912, 0.8789424, 0.847208, 0.8110777, 0.770949, 0.7275254, 
    0.6817554, 0.6345036, 0.5860305, 0.539505, 0.4963688, 0.456375, 
    0.4192947, 0.3849156, 0.353041, 0.3234884, 0.2960886, 0.2706849, 
    0.2471318, 0.2252945, 0.2050481, 0.1862765, 0.1688725, 0.1527362, 
    0.1377755, 0.1239046, 0.1110443, 0.09912077, 0.08806589, 0.07781631, 
    0.06831341, 0.05950278, 0.05133395, 0.04376025, 0.03673827, 0.03022783, 
    0.02419166, 0.01859524, 0.01340648, 0.008595719, 0.004135413, 0 ;

dnolan · May 5, 2019

Hello again.

Some significant developments.

Following a suggestion of a colleague, I did a much simpler version of the same case we have been using. No radiation, no cumulus, and warm rain microphysics (mp=1).

This reproduced from restart perfectly.

Then I tried mp=16 (WDM6), and also bought back cu = 6. This also reproduced perfectly.

Then I bought back radiation, sw=4 and lw=4. This did not reproduce.

I also tried sw=1 and lw=1. This also does not reproduce.

So the problem seems to be in the radiation schemes, or even the radiation driver.

Any ideas on what might be happening?

sw=5 and lw=5 is in the queue.

Dave Nolan

kwerner · May 6, 2019

Hi Dave,
That is interesting information. At this time, I'm not sure why that would be. I'd like to test this out more. Unfortunately our super computing system is down all week, so I can't perform any large runs, but I am at least running a test on a basic namelist vs. a basic namelist with radiation on my small machine. Even that is taking quite a while, though. I'll keep you posted as soon as I have any useful information. Do you mind attaching the namelist for your latest test run just so that I can see exactly what you're doing. I'd like to make my tests as identical to yours as is possible.

Thanks,
Kelly

dnolan · May 8, 2019

Here the latest:

sw=5 and lw=5 did not reproduce.

Then I tried the SW and LW schemes individually. Note that this required I hack the code in module_radiation_friver.F, as it will not allow you to run with only one of either SW or LW parameterizations turned on.

Either way, with sw=0 and lw=4, or lw=4 and sw=0, it did not reproduce.

I don't think you need to do much to do the same test. For example, try exactly the same case as before that did not reproduce, but with both radiations turned off.

That being said, I am wondering now if you did do the same test as me. I looked again at the pictures of the surface Q2 fields that you send before, and they do not look like mine at all. (Look further down in the posts.) Were you using my wrfinput_d01 file?

Dave Nolan

restarts not producing identical output

New member

Attachments

Administrator

New member

New member

Administrator

New member

Administrator

Administrator

New member

Administrator

New member

New member

Attachments

Administrator

Attachments

New member

Administrator

Attachments

New member

New member

New member

Administrator

New member