Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe crashing at initialization before writing 1st wrfout

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

frediani

New member
Hi,

I'm trying to run a specific case study but I can't get this simulation to finish writing the initialization files. I've tried a few different model versions, compilers, and CPUs. The model doesn't integrate through any timestep, it just crashes as it tries to write the 1st wrfout.

With 16 processors from 2 CPUs (8+8) I get this error:

Code:
forrtl: severe (408): fort: (2): Subscript #4 of the array SCALAR has value 2 which is greater than the upper bound of 1
Image              PC                Routine            Line        Source             
wrf.exe            000000000CF3DBCF  Unknown               Unknown  Unknown
wrf.exe            0000000001097AE9  force_domain_em_p       11088  module_dm.f90

With 12 processors from 1 CPU, I get this:

Code:
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source             
wrf.exe            000000000CF465FB  Unknown               Unknown  Unknown
libpthread.so.0    00002B33D88A4B00  Unknown               Unknown  Unknown
wrf.exe            00000000091E72C3  module_diffusion_        6963  module_diffusion_em.f90

The specified line number 6963 in module_diffusion refers to this statement:
Code:
rdzw(i,k,j) = 1.0 / ( z_at_w(i,k+1,j) - z_at_w(i,k,j) )
I calculated z_at_w and rdzw from wrfinput_d01 and wrfinput_d02 and they look ok. You'll find the netcdf files with these variables in the tarball for the real.exe files.

The source code for the attached test runs is the release-v4.2.2 (1e93b7e3) compiled with intel/19.1.1 and impi on Cheyenne.

I'm including the relevant files from real.exe, and from wrf.exe for the tests with 12 and 16 CPUs. The tarball names with "-D" correspond to executables compiled in debug mode with -D, and "noD" corresponds to the standard compilation.

I also tried release-v4.0.1 and release-v4.0.3, compiled with gnu and intel but I'm not including the tests files for these.

The input data to WPS is from HRRR v3 at pressure levels, downloaded from http://hrrr.chpc.utah.edu/. Let me know if you'd like to see the WPS files.

I'd really appreciate any tip on how to identify the issue. Thank you so much!
 

Attachments

  • wrf_4.2.2_intel1911-D-orig_cpu1x12.tgz
    72.2 MB · Views: 19
  • wrf_4.2.2_intel1911-D-orig_cpu2x8.tgz
    72.3 MB · Views: 18
  • wrf_4.2.2_intel1911-noD_cpu1x12.tgz
    18.2 MB · Views: 19
  • real_4.2.2_intel1911_noD.tgz
    149 MB · Views: 16
*Updated*

Hi,
Thank you for providing the files. I ran a couple of tests using your namelist and wrfbdy/wrfinput* files. The first test was everything "as-is" (i.e., using your namelist exactly as it's set up). As expected, the simulation failed immediately. The second test I ran was for a single domain only. I wanted to see if the problem was specifically with d02, and the test ran without any problems. I can't say for sure, but I feel fairly confident the problem is related to your parent_grid_ratio (and parent_time_step_ratio). You are currently using a 9:1 ratio, which is a pretty large difference. We typically recommend using a 3:1 or 5:1 grid ratio, but never more than 7:1. I'm curious if adding an additional nest between your parent and fine grid would make a difference.

Additionally, there are a few things you should modify in your namelist.
1) debug_level. Set this to 0. This is something that was originally added to the namelist for testing purposes, but has recently been removed from default namelists because it is typically pretty useless and only adds a lot of junk to your rsl files, making them gigantic and difficult to read through.
2) radt. This should be set to the same value for each domain, and should be ~1 min per thousand km grid spacing. Since your d01 resolution is 1 km, you should set this to radt = 1, 1 (or =1, 1, 1 - if you're adding a 3rd domain).
3) diff_opt. You should set this to the same value for all domains (diff_opt = 2, 2)
4) km_opt. You should set this to the same value for all domains (km_opt = 2, 2)

I would also recommend using more processors than 12 or 16. For your domain size, you would be perfectly safe to use 36. As for domain set-up, you can refer to this page for recommended practices for the namelist.wps parameters.
 
Top