Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf hangs when initiating domain 2

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

martin

New member
Dear WRF Support,

My trial wrfchem simulation is running fine with the coarse domain (max_dom = 1). However, the wrf executable is hanging when the intermediate and fine domains are introduced (max_dom = 3). I tried also to run the simulation without chemistry namelist. But the wrf executable hung again when initiating domain 2 (intermediate domain). I also tried to run the simulation with different number of cores: 100, 84 and 24 in order to vary the number of grid points per core. But this trick did not solve the issue.

Although the chemistry namelist was excluded as described in the attached namelist, the rsl.error file still printed the following error:

open_aux_u : error opening auxinput5_d01_2015-04-17_00:00:00 for reading. 100



The auxinput_5 parameter was also set to 0,0,0,. But the same error was repeated.



With and without the chemistry namelist, the wrf executable is hanging when initiating domain 2. Is this issue related to the configuration of wrf on our cluster? Or I have some missing parameters in my namelist?



Any guidance would be appreciated.



P.S. My WRF version is 3.8.1



Regards

Martin
 

Attachments

  • namelist.txt
    8 KB · Views: 75
  • rsl.error.txt
    62.5 KB · Views: 69
  • rsl.out.txt
    62.4 KB · Views: 63
Hi...

Is the problem repeatable on all runs? If not, it can mean a bad node.

In MPI, tasks are run on a many processors. At times, MPI programs must wait until all processors get can
get to the same point before continuing. Write I/O is one of the reasons. You can't write data until all the
processors have data available.

If one of the processors never gets to this point, any MPI, such as WRF, will hang and do nothing useful.

Check *all* rsl* files for a complaint. There may be an error message suggesting what the problem may be.

You can also up "debug_level" to a large number. I use 9999. This will make the run more verbose, so you
might out what is happening this way.
 
Dear Kevin,

Thanks for getting back with some suggestions.

The wrf executable is hanging every time, the simulation initiates the intermediate domain. The debug level was set to 9999 but there were no particular errors except the one I got previously related to open_aux_u : error opening auxinput5_d01_2015-04-17_00:00:00 for reading. 100

In order to get rid of this error, the chemistry namelist was deleted from the namelist.input file as described in the attached namelist file. The number of cores was also lowered from 100 to 18 so that I will be able to check each rsl file. But the same error was listed in each rsl* file.

If the chemistry namelist was deleted, why an error related to anthropogenic input is being printed in each rsl* file?

Do you have other suggestions on how to solve this issue?

Thanks and regards
Martin
 

Attachments

  • namelist.input
    5 KB · Views: 67
  • rsl.error.0000.txt
    733.5 KB · Views: 67
  • rsl.out.0000.txt
    733.5 KB · Views: 68
Hi Martin...

I have an idea.

It may be your netcdf code. What version are you using? If it is in the NETCDF line, find out how it was
compiled. There are many ways that it could be compiled. I suspect that it was build in a way that doesn't
support "classic" format.

Depending on how NETCDF is accessed, you may somehow be going thru a HDF5 interface. In this protocol,
colons in filenames are ILLEGAL, as they are a special character, so they have a meaning that isn't what you
really want. WRF compensates for this by changing colons to underscore. They may have missed a write
statement, which is your error message.

Try copying or symlinking "auxinput5_d01_2015-04-17_00:00:00" to "auxinput5_d01_2015-04-17_00_00_00".
That's the colons changed to underscores. That might not work, but it could.
 
Dear Kevin,

Thanks for getting back.

Regarding the NETCDF classic format, I have to contact the cluster administrator.

But the issue is I am run a wrf simulation without chemistry and rsl* files are showing an error related to anthropogenic emissions file. Do you think that such issue is related to NETCDF library?

Regards
Martin
 
Dear Kevin,

I have contact our cluster administrator and below you can follow his reply:

Dear Martin,

I had a look at what the forum said. Netcdf installs the classical format by default. The only thing one can do is to disable hdf5 support. Then I am not sure whether we will run into other issues...

This is what the Netcdf website says:

''Starting from version 4.4.0, netCDF included the support of CDF-5 format. In order to allows defining large array variables with more than 4-billion elements, CDF-5 replaces most of the 32-bit integers used to describe metadata in file header with 64-bit integers. In addition, it supports the following new external data types: NC_UBYTE, NC_USHORT, NC_UINT, NC_INT64, and NC_UINT64. The CDF-5 format specifications can be found in (http://cucis.ece.northwestern.edu/projects/PnetCDF/CDF-5.html).
The classic file formats are now referring to the collection of CDF-1, 2 and 5 formats. By default, netCDF uses the classic format (CDF-1). To use the CDF-2, CDF-5, or netCDF-4/HDF5 format, set the appropriate constant in the file mode argument when creating the file.''



Now from what I understood is that we can build Netcdf with the classical library ONLY, by using: --disable-netcdf-4. Alternatively an older version of netcdf is used, < 3.6.0

But removing hdf5 will disable parallel execution of wrf, thus another library needs to be added. I believe this library would be: Pnetcdf.

Can you confirm with the forum if doing this would solve the issue ? WRF doesn't explain much and I don't use the software so the forum might offer better instructions.

Then we if have to re-compile netcdf or use an older version, I guess wrf needs to be recompiled again. In that case I would have to create a new profile so as not to destroy the current installation.


Kris

Any suggestions for your side:

Regards
Martin
 
Martin...

I've never been able to get the HDF5 interface running nothing but slow, as in minutes to write out a history
file. The domain I'm using is a 3 km CONUS for 60 hours, so we have to have fast I/O.

In addition, other software that I'm using specifically say not to use the HDF5 capability of NETCDF, as the
code doesn't support it.

I'm using NETCDF 3.6.2 on one supercomputer as their NETCDF 4.x builds don't work with WRF.

To solve the performance problem, I use PNETCDF. There is a bit of a learning curve, as you'll get hangs
or crashes if there is an inconsistency in "namelist.input". If you are interested, there are two files that
will need code mods. Let me know.
 
martin said:
Dear Kevin,

Thanks for getting back.

Regarding the NETCDF classic format, I have to contact the cluster administrator.

But the issue is I am run a wrf simulation without chemistry and rsl* files are showing an error related to anthropogenic emissions file. Do you think that such issue is related to NETCDF library?

Regards
Martin

Hi Martin,

I've found in the past that running WRF-Chem without chemistry is not exactly the same as running pure WRF. I'd recommend compiling a pure WRF executable, and trying your simulation first with that. Once you've got that working then switch back to the WRF-Chem executable, with chemistry added back into your namelist, and see if you can get that to work.
 
Top