Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF 4.0.1 Hangs After First Time-step

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

hellyj

New member
Aloha. We are seeing an increasing frequency of WRF jobs hang after writing the first time-step output. The problem appears to be MPI-related but we have no indication of what it causing the failure since the job sits there until it times-out. We've seen this problem occasionally over years with earlier versions but now are attempting to solve it. Wonder if anyone has any insight into what it might be and/or how to debug it?
 
Can you provide more information about those hanging jobs? Specifically, how many grid numbers? How did you build the code? How many processors are you using to run the job? is there any common feature for those hanging jobs?
 
Aloha. Not sure what you mean by grid numbers. There is a 3km grid nested within a 9km grid. The code was compiled on comet.sdsc.edu and the configure.wrf is attached as well as rsl.error.0000. Can't find anything in the rsl.error.* files and we're scratching our heads trying to figure out how to debug this. Will try increasing the debug level.
Thanks for looking at this.

Had to change the *.0000 to *.0000.txt to be able to upload it.
J.
 

Attachments

  • configure.wrf
    22.9 KB · Views: 38
  • rsl.error.0000.txt
    4.5 KB · Views: 46
Please send me your namelist.input and namelist.wps to take a look. Did you compile WRF in dmpar mode? Is this a cold-start case?
 
Aloha. Cold starts only. Don't know what Dumper mode is so I don't think I compiled it like that. Wouldn't that be reflected in configure.wrf? Here are the namelists.
 

Attachments

  • namelist.input
    5.7 KB · Views: 44
  • namelist.wps
    944 bytes · Views: 38
I don't think you are working with the standard WRFV4.0.1 code. Have you modified anything ?

In your namelist.input, I have seen some options not available in WRFV4.0.1:
mgram_opt = 1,
num_mgram = 9,
num_mgram_lev = 49,
meteo_name = "ACV", "BBY", "PTS", "SBA", "CCO", "SAC", "TCY", "VIS","LMC"
meteo_lats = 40.9720, 38.3191, 36.3042, 34.4294, 39.6999, 38.3000, 37.6800, 36.3100, 39.2374
meteo_lons = -124.1100, -123.0728, -121.8881, -119.8468, -121.9075, -121.4200, -121.4400, -119.3900, -123.1630
meteo_hghts = 200., 300., 400., 500., 600., 700., 800., 900., 1000., 1100., 1200., 1300., 1400., 1500., 1600., 1700., 1800., 1900., 2000., 2100., 2200., 2300., 2400., 2500., 2600., 2700., 2800., 2900., 3000., 3100., 3200., 3300., 3400., 3500., 3600., 3700., 3800., 3900., 4000., 4100., 4200., 4300., 4400., 4500., 4600., 4700., 4800., 4900., 5000.

Please clarify.
 
Yes, we are using a modified code. We've been working with Dave Gill and Wei Wang on this for a couple of years. This problem has been around for a few years but has recently gotten much worse possibly related to maintenance done on Comet last Feb. We have some evidence that there is a memory leak in the codebase and are trying to isolate it.
 
Aloha.
I think this is the problem and it has been raised previously but I've never seen a solution. I've also included the output from nc-config as some posts suggested it was a netCDF-related problem. I think I remember an old problem with colons and underscores but don't remember the context. Any help appreciated.

================================================ rsl.error.0000 tail ==========================================
[hellyj@comet-ln3 run]$ tail -f ~/rsl.error.0000
mediation_integrate.G 1944 DATASET=HISTORY
mediation_integrate.G 1945 grid%id 2 grid%oid 3
Timing for Writing wrfout_d02_2019-11-27_00:00:00 for domain 2: 5.41066 elapsed seconds
Timing for Writing QPFhourly_d02_2019-11-27_00:00:00 for domain 2: 0.00900 elapsed seconds
open_aux_u : error opening auxinput5_d02_2019-11-27_00:00:00 for reading. 100
d02 2019-11-27_00:00:00 Input data processed for aux input 5 for domain 2
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS 1 IE 26 JS 1 JE 17
WRF NUMBER OF TILES = 1
Timing for main (dt= 15.00): time 2019-11-27_00:00:15 on domain 2: 5.65952 elapsed seconds
================================================ rsl.error.0000 tail ==========================================


================================================ nc-config ==========================================
This netCDF 4.6.1 has been built with the following features:

--cc -> mpicc
--cflags -> -I/share/apps/compute/netcdf/intelmpi/v2018/include -I/share/apps/compute/hdf5/intelmpi/v2018/include
--libs -> -L/share/apps/compute/netcdf/intelmpi/v2018/lib -lnetcdf

--has-c++ -> no
--cxx ->

--has-c++4 -> no
--cxx4 ->

--has-fortran-> yes
--fc -> mpif90
--fflags -> -I/share/apps/compute/netcdf/intelmpi/v2018/include
--flibs -> -L/share/apps/compute/netcdf/intelmpi/v2018/lib -lnetcdff -L/share/apps/compute/netcdf/intelmpi/v2018/lib -L/share/apps/compute/hdf5/intelmpi/v2018/lib -lnetcdf -lnetcdf
--has-f90 -> no
--has-f03 -> yes

--has-dap -> yes
--has-dap4 -> yes
--has-nc2 -> yes
--has-nc4 -> yes
--has-hdf5 -> yes
--has-hdf4 -> no
--has-logging-> no
--has-pnetcdf-> no
--has-szlib -> no
--has-parallel -> yes
--has-cdf5 -> yes

--prefix -> /share/apps/compute/netcdf/intelmpi/v2018
--includedir-> /share/apps/compute/netcdf/intelmpi/v2018/include
--libdir -> /share/apps/compute/netcdf/intelmpi/v2018/lib
--version -> netCDF 4.6.1
 
Top