Writing large WRF model output files with pnetcdf

jkukulies · Jan 25, 2024

I am running an idealized case (em_quarter_ss) with a large domain at 250 m with 2496 grid cells in both directions and 96 vertical levels. While the same simulation at 500 m (with 1248x 1248 grid cells) works without problem on Derecho, the simulation hangs when it tries to write the first output file. I used io_format=102 for wrfinput_d01, but io_format=2 for the output. I have tried up to 60 nodes (each 128 cores) on Derecho and wonder if this problem can be solved by going even higher or if I should find a different solution, such as using the library parallel-netcdf.

I have tried to compile WRF with Pnetcdf, but was not successful.

Attached you can find the namelist and error log from the simulation trial as well as the compile log and code changes to wrf_io.F90 to use the pnetcdf library for large netcdf file output.

jkukulies · Jan 26, 2024

If you want to have a look at the case, it is located here: glade/work/kukulies/WRF/TempScripts/19_2011-07-13_CTRL_Midwest_-Loc1_MCS_Storm-Nr_JJA-8-TH5/250

kwerner · Feb 5, 2024

Hi,
I would like to apologize for the long delay in response. Between holidays, unplanned time out of the office, the AMS conference, and planning for our WRF tutorial taking place this week, we got pretty behind on forum questions. Thank you so much for your patience.

Are you still experiencing issues with this? I don't believe you need additional processors. Take a look at Choosing an Appropriate Number of Processors to determine an appropriate number for you.

I do believe, however, that when you use the io_format=102 option for any of the executables, you then have to continue using that format for the remaining executables. I.e., because you used it to create the wrfinput file, you probably have to use it to run WRF, as well.

jkukulies · Feb 5, 2024

No problem and thank you for the answer! Unfortunately I still have problems getting the simulation to run. I tried different numbers of processors based on the guidelines you linked to, but with the io_format=2 it always hangs when wrfinput is written. The same simulation itself with a quarter of the grid cells (dx=500m) runs pretty fast and there is no problem in writing the output files. Do you think that the problem is related to the size of the domain and that when choosing io_format=2 only one MPI process writes the file while the others wait? I know it is a large domain but then I thought that it should still be within what WRF can handle with the normal output format.

I also tried io_format= 102 for both IDEAL and WRF which also works fine. But the problem here is that I could not get the joiner program to work on Derecho and it seems that nobody has managed to do so thus far?

kwerner · Feb 15, 2024

Again, I'd like to apologize for the delay. Right after the tutorial, I had to be out of the office for several days, unexpectedly. Are you still struggling with this? I tried to look around in the directory where you pointed to above, but it looks like the rsl files are stopping when running ideal.exe, and not wrf.exe. Are you now having problems with that, as well? If so, I do see the error

Code:

ERROR: ghg_input available only for these radiation schemes: CAM, RRTM, RRTMG, RRTMG_fast
           And the LW and SW schemes must be reasonably paired together:
           OK = CAM LW with CAM SW
           OK = RRTM, RRTMG LW or SW, RRTMG_fast LW or SW may be mixed

To overcome this problem, set "ghg_input = 0" in the &physics section of the namelist.

A couple other thoughts:
1) Yes, I believe the code is struggling to write files that large.
2) Set debug_level = 0. Turning this on rarely provides any useful information and just makes the output files very large and difficult to read. I don't think this is causing your issue, but it could down the road.
3) For a domain this size, you probably should be using something like 15K processors, and at the VERY minimun, you should probably use around 1000. Have you tried using that many?

jkukulies · Feb 17, 2024

Thanks for all the tips! I set debug_level and ghg_input to 0 and tried to run IDEAL again on 120 nodes (15360 processors) but without success. There is no error coming up (I think even before when the ghg_input was not set to 0, this error did not cause IDEAL to fail).

I think it looks like the problem is that the program is not able to finish writing the wrfinput file within a wall time of 12 hours, even with >15000 cores. I am wondering if only one processor is responsible for the actual output writing when io_format is not set to 102? Therefore, I was wondering if it would be an option to compile WRF with pnetcdf or how you would go about this when I cannot use the joiner program on Derecho but it seems like an unreasonable amount of core hours that are just burned because it takes so long time to write the output.

kwerner · Feb 22, 2024

**Edited 2/26/24**

I should have been more clear that when I suggested using up to 15K processors, that would be for running wrf.exe, not ideal.exe. For some ideal cases, you can only use a single processor to run ideal.exe, but fortunately for this one, you can use more. I should mention than in my 12 years of experience, I've never seen a case where someone used more than a little more than 1000 grid spaces, so your case is more than double the size of any case I've experienced.

Regarding the "joiner" program, have you seen this post, and would that help you?

kwerner · Feb 26, 2024

Hi again,

I believe the issue is related to the number of vertical levels you're using. I ran a test with your set-up and was able to run ideal.exe with 1024 processors if I used the default number of vertical levels (41), but when I use 96, like you are requesting, it hangs - meaning something is wrong. If you want to use more than 41 levels, you may have to do some quick tests to see what the max is that you can use, but at least we know that ideal.exe can be run with your domain size - you'll just need to modify the number of vertical levels.

jkukulies · Feb 26, 2024

Thanks for running a test run with a smaller amount of vertical levels. That is a good point! However, my aim is to compare the 250m simulation with the same simulation at different grid spacings so it would be ideal if I could use the same number of vertical levels.

Since I have the parallel output files (io_format =102), I am back on makimg the joiner program work. Do you know if there is a working version somewhere on casper? I was not able to compile it on Derecho despite the changes you pointed me to in the other post.

If this does not work, do you have another strategy of how to combine the files (e.g. using a different software or programming language than the Fortran one)?

kwerner · Feb 26, 2024

Unfortunately I'm not aware of any working version of the joiner program, or any other software others have used. The joiner program was given to us by an outside colleague and is not officially supported by our group. I would recommend perhaps commenting on other posts that mention patching the tiles back together to see if they have any tips for you. If you do happen to get something working, and you don't mind, please share that code with us so that we can share it with others. Unfortunately we don't currently have the resources to work with the code. Thanks.

lkugler · Jan 31, 2025

I can confirm, as the original author wrote above, that at a domain size of about 2300x2300x100, I needed to use the same number for io_form_history and io_form_input, otherwise WRF fails with weird MPI segmentation faults!
It worked with io_form_* = 11 and 102.

See also Segmentation fault at a certain domain size

jamesrup · Mar 9, 2025

[I've been updating this as I make headway]

Hi all, now that the CISL and WRF team have helped solve a strange MPI-backend-related issue (this github thread) with large domain sizes on Derecho, I'm returning to the issue of finding a suitable I/O option, since this is where my major bottleneck now is. I'm running a domain of 2511 x 1926 x 80 (WRFv4.6.0), so the standard io_form_history = 2 is insanely slow, about 10 min per output time step!!

There seem to be 2 options:

1. io_form = 102. This option worked for me, yielding an incredible 600x speedup of write-out steps. BUT multiple threads around here seem to indicate that the JOINER software is defunct, so idk how to actually exploit this option since working with individual node files is impractical. Or am I wrong...does anyone have that package working on Derecho?

2. I've now been playing with io_form = 13 available only in recent generations of WRF, which does parallel I/O with netcdf4/HDF5 compression as described in of README.netcdf4par. (Note: this io_form option is not shown in the WRF userguide yet.) After some troubleshooting, I got this to work. Tips: do not use modules/packages referred to as "parallel-netcdf" since that's a different approach. What worked for me on Derecho, with WRF configure option #50 and basic nesting, was simply adding NETCDFPAR=$NETCDF to my environmental setup (attached), which simply points to the same netcdf package I would normally use with io_form = 2 (default option). On Derecho, $NETCDF points to /glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/oneapi/2023.0.0/iijr
The key here is that, as described in the README doc, setting the env variable NETCDFPAR triggers a few other things in the WRF configuration. That worked, then I needed to remember to include "nocolons = .true." in my namelist as the README tried to tell me.

That worked.

I'm next trying to see if I can get the best of both worlds and exploit the parallel write-out to one file but faster by turning off compression.

Any guidance and/or feedback on best practices here from other WRF experts welcome!

jkukulies · Mar 18, 2025

Hi @jamesrup!

To your first question: I have talked to multiple WRF experts from NCAR and the Argonne lab now and it seems like there is currently no working version of the joiner program. I have a python script that joins the output tiles from io_form = 102 which did not take too long for a simulation with 2496 x 2496 x 96 grid cells. It is not really in its best shape and publishable, but happy to share as a starting point to develop an efficient python program for that task.

William.Hatheway · Mar 18, 2025

jamesrup said:
[I've been updating this as I make headway]

Hi all, now that the CISL and WRF team have helped solve a strange MPI-backend-related issue (this github thread) with large domain sizes on Derecho, I'm returning to the issue of finding a suitable I/O option, since this is where my major bottleneck now is. I'm running a domain of 2511 x 1926 x 80 (WRFv4.6.0), so the standard io_form_history = 2 is insanely slow, about 10 min per output time step!!

There seem to be 2 options:

1. io_form = 102. This option worked for me, yielding an incredible 600x speedup of write-out steps. BUT multiple threads around here seem to indicate that the JOINER software is defunct, so idk how to actually exploit this option since working with individual node files is impractical. Or am I wrong...does anyone have that package working on Derecho?

2. I've now been playing with io_form = 13 available only in recent generations of WRF, which does parallel I/O with netcdf4/HDF5 compression as described in of README.netcdf4par. (Note: this io_form option is not shown in the WRF userguide yet.) After some troubleshooting, I got this to work. Tips: do not use modules/packages referred to as "parallel-netcdf" since that's a different approach. What worked for me on Derecho, with WRF configure option #50 and basic nesting, was simply adding NETCDFPAR=$NETCDF to my environmental setup (attached), which simply points to the same netcdf package I would normally use with io_form = 2 (default option). On Derecho, $NETCDF points to /glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/oneapi/2023.0.0/iijr
The key here is that, as described in the README doc, setting the env variable NETCDFPAR triggers a few other things in the WRF configuration. That worked, then I needed to remember to include "nocolons = .true." in my namelist as the README tried to tell me. That worked.

I'm next trying to see if I can get the best of both worlds and exploit the parallel write-out to one file but faster by turning off compression.

Any guidance and/or feedback on best practices here from other WRF experts welcome!

Maybe I'm misunderstanding but pnetcdf io=11 works for me on my workstation and shows great performance increases in speed. Though the file size is a little larger. Not sure why that is.

Writing large WRF model output files with pnetcdf

jkukulies

Member

Attachments

jkukulies

Member

kwerner

Administrator

jkukulies

Member

kwerner

Administrator

jkukulies

Member

Attachments

kwerner

Administrator

kwerner

Administrator

jkukulies

Member

kwerner

Administrator

lkugler

New member

jamesrup

Member

Attachments

jkukulies

Member

Attachments

William.Hatheway

Well-known member