NWP_Diagnostic failure v3.9.1.1 and v4.0 HAILCAST

ccalkins · Oct 11, 2018

Dear WRF helpdesk,

I have been trying to run WRF with v3.9.1.1 with the v4.0 HAILCAST modifications. With nwp_diagnostic set to 0, it will run it course (rsl 0002). With nwp_diagnostic set to 1, in a AWS setting of c5.9xlarge (rsl 0000) and c5.18xlarge (rsl 0001),(the nwp_diagnostic set to 0 was also run at c5.18xlarge), the the wrf module dies at 00:00:20 calling inc/HALO_EM_D3_5_inline.inc and does not continue to the next step of 00:00:20 calling inc/PERIOD_BDY_EM_D3_inline.inc. I have included namelist with NWP turn on. Also to note is that the nwp_diagnostics fails even if HAILCAST is turned off.

Is this a current problem with this version or is there likely some other reason as to why this would be happening?

Chase Calkins

kwerner · Oct 12, 2018

Hi,
You mentioned that you've made the v4.0 hailcast modifications to V3.9.1.1 code. Have you tried to run out-of-the-box (without your modifications) 3.9.1.1 code (or v4.0 code)? If not, can you do a test with that to see if it works? Thanks.

ccalkins · Oct 15, 2018

Hello,

I have run WRFV3.9.1.1 out of the box with nwp_diagnostic set to 1 (rsl 0004) and it stops running at 00:01:20 calling inc/HALO_EM_HELICITY_inline.inc before I interrupt the wrf run. When I set nwp_diagnostic set to 0 (rsl 0003) it will run through, before I interrupt it. hailcast_opt and haildt have been removed and commented out corresporing aux parameters.

Chase

kwerner · Oct 15, 2018

Thanks for doing that test. Okay, I want you to try a couple different things:

1) See if increasing the number of processors helps. You have a pretty large domain and should probably be using more processors anyway. Can you try at a bare minimum something like 96? If that doesn't help, try to increase to around 120.

2) If the above doesn't help, can you try a basic namelist (start with the default namelist that comes with version 3.9.1.1), and only change what is necessary for use with your input files (dates, times, e_we, e_sn, dx/dy, etc.). Don't add any extra variables, or modify anything else, except for adding nwp_diagnostics = 1. If this works, then you know it's not your data, and it's something related to your namelist. You can then add/modify 1 or 2 variables at a time from your namelist until you can track down what is causing the problem. For what it's worth, I also tested this option (nwp_diagnostics = 1) with V3.9.1.1, and a very small/basic case, which worked without any errors.

ccalkins · Oct 17, 2018

Hello,

The problem seems to lie with the radiation physics. Specifically when ra_lw_physics is set to 1 and ra_sw_physics is set to 2. Other variations will run with completion (lw = 4, sw = 4), (lw = 1, sw = 4), (lw = 4, sw = 2). Is there a reason why lw =1 and sw = 2 would be incompatible, or unable to run in this case?

Chase

RCarpenter · Oct 17, 2018

I've had the same problem. I can confirm that using (4, 4) instead of (1, 1) solved the problem for me.

ccalkins · Oct 17, 2018

RCarpenter said:
I've had the same problem. I can confirm that using (4, 4) instead of (1, 1) solved the problem for me.

Dear RCarpenter,

Do you mean (lw = 1, sw = 2) instead?

RCarpenter · Oct 17, 2018

No, I had been using 1 for both options.

kwerner · Oct 18, 2018

Hi,
I've run basic cases using nwp_diagnostics =1 while setting lw/sw = 1 and also trying lw = 1 and sw=2. I am able to run without a problem in both circumstances. There must be something specific to your cases. Can you try to compile with debugging to see if that gives you any answers?
You'll need to issue a 'clean -a' and then reconfigure with ./configure -d
Then open your configure.wrf file and find the line FCDEBUG. You should see some of that line commented out with a "#". Remove the # to uncomment that, save the file and then recompile and try to run again. You should hopefully see the line where the code stops printed out in your rsl.out.0000 file.

1) Did you increase the number of processors? If not, please do that too.
2) You currently have debug_level = 100 in your namelist.input file. We actually do not recommend setting this at all, as it was originally implemented for specific developmental work and wasn't removed until V4.0. It really doesn't give much information and just adds a lot of junk to the rsl files, making them difficult to read. So set that to 0 before running again.

RCarpenter · Oct 18, 2018

Please see my comments at http://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=53&t=196

ccalkins · Feb 20, 2019

Dear All,

I am having some new issues running the NWP_Diagnotics in the WRV3.9.1.1 with ERAI data on AWS C5.18xl (72vCPU, 144GiB). 1) When I try running the UP_HELI_MAX, for April 1 & 2 of 2009, each of the 30 hour outputs for both days have been 0.0. Is there a reason for what could be causing this issue? 2) I have a issue running the REFD_MAX. It says that its in the NWP_Diagnotic section of the EM.COMMON. If I leave it in, the real.exe stops and doesn't give an error code, and the last line states begin wrf.input. I making sure if this isn't something else and instead have to clean and recompile WRV3.9.1.1 and make changes in the registry instead?

package nwp_output nwp_diagnostics==1 - state:wspd10max,w_up_max,w_dn_max,up_heli_max,w_mean,grpl_max
package radar_refl compute_radar_ref==1 - state:refl_10cm,refd_max

Chase

kwerner · Feb 21, 2019

Hi Chase,
I just ran a test using V3.9.1.1, along with your namelists and your 'my_file_d02.txt'. When I had REFD_MAX in my_file_d02.txt, I was able to run real.exe without any problems. I then ran WRF and in the hail.d01* files, I begin to see values for UP_HELI_MAX within about 3 hours.

The differences in my run vs. yours is that I used my own input data (GFS FNL), and I had to made a couple of modifications to the namelist:
1) I had to remove 'haildt', as this parameter was not put into the code until V4.0 - did you make modifications to your 3.9.1.1 code to allow for this?
2) I set debug_level = 0. This is an old variable that was put into the namelist many years ago for a specific testing purpose and wasn't removed until v4.0. We have found that it rarely provides any useful information, and instead just makes the rsl* files junky and difficult to read through.
3) I used 360 processors to run WRF, since this is a large domain. If you were able to run your case with only 36 actual processors (72 virtual processors), then I guess that is okay, and probably shouldn't make a difference with the problems you are seeing.

So perhaps it's the input data? Or if you have made modifications to your code, something could be wrong there. If so, you could run a test with the pristine (out-of-the-box) v3.9.1.1 code to see if that makes any difference. You could also try this with v4.0, as well.

Kelly

NWP_Diagnostic failure v3.9.1.1 and v4.0 HAILCAST

ccalkins

New member

Attachments

kwerner

Administrator

ccalkins

New member

Attachments

kwerner

Administrator

ccalkins

New member

RCarpenter

Member

ccalkins

New member

RCarpenter

Member

kwerner

Administrator

RCarpenter

Member

ccalkins

New member

Attachments

kwerner

Administrator