I/O WRF error: missing netCDF files

chipilski · Jun 20, 2019

To whom it may concern,

I am running 10-minute forecasts for 40 different ensemble members on the stampede2 system. Unfortunately, I encountered a problem: not all ensemble members completed, even though they share nearly identical configurations. The errors appeared to happen on the final I/O portion of the code - in particular, the failing members were not able to write out the netCDF forecast files for the second forecast step. I have come across this problem on other supercomputing systems in the past and the fix has usually pertained to modifying the number of cores per node. A decrease in the number of cores per node gives the MPI processes on a given node access to more memory. To test this hypothesis, I created a working (debug) directory for one of the members that failed (member 4) and varied the number of cores using 25, 34 and 50 on the development partition of the KNL compute nodes. Unfortunately, even the 25 cores/node test failed to produce the required netCDF files.

The directories in which those tests are conducted have the path: /scratch/06421/hch/systematic_experiments/test2knl_AERI_11july2015. In this directory, you will find subdirectories with names test _*, which contain the aforedescribed experiments. Note that the mem01 directory contains results from the successful member, while mem04 is the member in which the WRF model fails to execute fully. Within each of those subdirectories, you will also find the job submission script under the name job.slurm. Standard output/error files of interest are: rsl.out.0000, rsl.error.0000, wrf.output. I am currently running a final ‘extreme’ experiment, in which the WRF model for member 4 is ran in a single core/single node configuration. The subdirectory for this experiment is test4_mem04.

I would greatly appreciate it if you could look into the problem!

Regards,
Hristo

chipilski · Jun 21, 2019

Update: My test on a single core/single node also failed, making it clear that memory overload is not an issue. I checked the namelist differences between the two members and noticed that they only differ by their Cu parameterisation: the successful member uses the Grell 3D Ensemble scheme, while the unsuccessful one - the Kain-Fritsch scheme. Normally, failures due to incompatible parameterisation schemes should occur during the model integration rather than I/O stages of WRF.

Any ideas regarding this behaviour would be appreciated!
Hristo

kwerner · Jun 21, 2019

Hi,
We do not have access to the Stampede machine. Can you attach some relevant files for the successful run, and non-successful run (e.g., namelist.input files, all the rsl* files for each case - package those into *.tar files)? For the unsuccessful runs, are they all using KF cu_physics? Thanks.

chipilski · Jun 21, 2019

kwerner,
I have attached the requested files. Note that I have only retained certain rsl files due to excessively large file size. Let me know if you need more information and how I can upload this.

Regarding your Cu physics question - yes, all unsuccessful runs used KF.

kwerner · Jun 21, 2019

Thanks for sending those. Unfortunately the rsl file you sent doesn't have any useful information in it. I have a few suggestions, though:
1) set debug_level = 0. This is something we took out of the default namelist beginning with V4.0 because it doesn't really give any useful debugging information and simply makes the rsl files huge and difficult to read through. Sometimes having this set large can actually fill up disk space, causing a run to fail. That's likely not the problem here, but in an effort to save space and to make the files more readable, you should set it to 0 (or remove it altogether).
2) After setting debug_level = 0, try running this one (failed) member again. When you do, and when it likely fails again, issue

Code:

ls -ls rsl.error.*

Then look at file size. We expect file rsl.error.0000 to be large, but if any others are larger than the others, there may be some useful information in those (at the end) - perhaps an error that helps us to identify a problem.
3) You can try to run this one member with the latest version of WRF (V4.1.1) to see if the problem happens to have been corrected since V3.7.1.
4) If none of those things help, then you may need to do compile with a debugger to try to see if you can get an error and line number to print out (again, only running one of the failed members). You would need to follow these steps:
a) ./clean -a
b) reconfigure
c) go into your configure.wrf file and look for the line 'FCDEBUG', and make sure the following are uncommented:
FCDEBUG = -g $(FCNOOPT) -fcheck=all
Recompile, and errors should print in error logs. It should hopefully stop right on the line that is causing the problem.

chipilski · Jun 27, 2019

kwerner, I have set the debug_level back to 0 and attached the rsl files. For the unsuccessful run, task number 0081 produces output (rsl.error.0081 and rsl.out.0081) whose size is larger than any other task. As you anticipated correctly, these are the two output files that hide useful error information. The end of the error file says that the error comes from the I/O portion of the code (module_io_wrf):

59 -------------- FATAL CALLED ---------------
60 FATAL CALLED FROM FILE: output_wrf.b LINE: 111
61 module_io_wrf: output_wrf: wrf_inquire_filename Status = 33554432
62 -------------------------------------------
63 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 81

It is surprising that a change in the Cu parameterisation scheme can lead to I/O problems. I will be interested to hear your opinion on that!

Thanks,
Hristo

kwerner · Jul 2, 2019

Hi,
Thanks for sending those. Unfortunately, this is still pretty vague. Can you actually try running this failed case with the debugger option I mentioned? I'm hoping that may help us determine where the problem is originating. After that, please send the new rsl* files (packed as a *.tar). Thanks.

kwerner · Jul 8, 2019

Hi,
I discussed this issue with a colleague who noticed that the nest timing is behaving oddly. It is doing adaptive time-stepping, but giving the same dt on both the coarse and fine grids. They suggested you make the following modifications to your namelist. These should be reasonable values for a 2-domain 12/4 km grid:

Code:

 max_step_increase_pct               = 5,   51,  51,
 starting_time_step                  = 48,   16,  5, 
 max_time_step                       = 84,   28,  9,
 min_time_step                       = 36,   12,  4,

Can you try that and see if it makes any difference?

Thanks.

chipilski · Jul 9, 2019

Hi kwerner,

Thanks for the useful suggestions! I did try to change the time step parameters, but unfortunately did not end up having a successful run. I have attached the .tar archive in this post. Did it provide more useful information? I am currently running other tests with our system and cannot run the debugger option. If you do not identify the problem from the newly attached rsl files, I will give the debugger a try in the near future.

Ming Chen · Jul 9, 2019

Would you please try the following two options separately, and let's see what is going to happen:

(1) change io_form_auxhist2 = 24,

(2) cu_physics = 1, 1, 0,

Please also send me your "my_output.txt" to take a look.

chipilski · Jul 10, 2019

I have attached the output from the two experiments you suggested, together with my_output.txt. Both experiments failed to output the complete set of wrfout files. Let me know if I can do something else!

Ming Chen · Jul 12, 2019

Once you set io_form_auxhist2 = 24, you need to modify 'my_output.txt ' accordingly, i.e.,

+:h:24:H_DIABATIC

Otherwise the code will be confused which stream to write out the specified variable in my_output.txt

allenea · Jul 29, 2019

kwerner said:
Hi,
I discussed this issue with a colleague who noticed that the nest timing is behaving oddly. It is doing adaptive time-stepping, but giving the same dt on both the coarse and fine grids. They suggested you make the following modifications to your namelist. These should be reasonable values for a 2-domain 12/4 km grid:

Code:

max_step_increase_pct = 5, 51, 51, starting_time_step = 48, 16, 5, max_time_step = 84, 28, 9, min_time_step = 36, 12, 4,

Can you try that and see if it makes any difference?

Thanks.

Something I just noticed and I might be wrong but it looks like step_to_output_time only works on the outer domain for nested runs and adjust_output_times only works to the hour. My issue was having wrfout Time variable have values like 2014-06-04_12:30:03 instead of 2014-06-04_12:30:00. People that I have talked to who have done adaptive time stepping have only output every 60 minutes and thus never experienced a problem. It's hard to tell if this is you exact issue, but I just ran into this. Going to see if it was addressed in WRF4. Which I had run on V3 and V4 and got the same issues, but maybe there is a new namelist variable I can change.

Ming Chen · Jul 30, 2019

Yes you are correct that the option "step_to_output_time" sometimes only works on the outer domain. We don't have a fix for this issue at present. I am sorry for the inconvenience caused by this.

I/O WRF error: missing netCDF files

chipilski

New member

chipilski

New member

kwerner

Administrator

chipilski

New member

Attachments

kwerner

Administrator

chipilski

New member

Attachments

kwerner

Administrator

kwerner

Administrator

chipilski

New member

Attachments

Ming Chen

Moderator

chipilski

New member

Attachments

Ming Chen

Moderator

allenea

New member

Ming Chen

Moderator