Optimizing Model Setup/Core Count to Minimize Solution Times for WRF

jbellino · Aug 30, 2023

Hi all,

TL;DR
There are some really long solution times (greater than 100s) that occur at regular 5-hour intervals that don't seem to be associated with I/O and I can't figure out how—or if—I can decrease them. Throwing more CPUs has only shaved a few seconds off each time-step.

Background
I have a 20-day nested 4-km/1-km simulation for the Southeastern US and noticed that time-steps can be generally sorted into 3 categories based on the wall-time required for a solution:

Group 1–less than 10 seconds;
Group 2–greater than 10 seconds to 100 seconds; and
Group 3–greater than 100 seconds.

Group 1 solution times coincide with time-steps occurring between data ingestion periods (micro-physics, radiation, etc.), Group 2 solution times are associated with time-steps that coincide with the hourly input data intervals, and Group 3 solution times occur every 5 hours on the first time-step of the hour, but the pattern resets on the following day (e.g. 1975-03-17 05:00:20, 1975-03-17 10:00:20, 1975-03-17 15:00:20, 1975-03-17 20:00:20, 1975-03-18 05:00:20, etc.). Given that I'm running this model for a total of about 45 years of simulation time in 20-day chunks, if I can save 4-6 (2-3 per domain) hours per run that would save me several weeks-worth of wall-time overall.

To this end, I had aimed some tests at scaling CPUs to try to decrease my wall-time, but have so far been largely unsuccessful in achieving appreciable reductions in the latter group associated with the 5-hourly interval which has me wondering if there is something else going on in the model at these times that is separate and apart from computations that is causing this. Some of the relevant processes in the model include nudging which occurs every 3 hours in domain 1 and not in domain 2, history output is hourly, and input interval is hourly. I/O should not be a bottleneck as I'm working with a brand new machine that has a Lustre file system (75 GB/s read and write with 12 OSTs) and quilting enabled using 120 CPUs. I've looked at the radiation scheme time-steps and these are all very short (about 1s). Compiled with Intel (ftn/icc) Cray XC, no debugging and run with slurm -hint=nomultithreading. If anyone could suggest what could be happening in the model at these times I would greatly appreciate it.

Plot showing solution time associated with time-steps coincident with micro-physics time-step intervals for each domain for run using
2,300 CPUs and 3,600 CPUS (quilting CPUs counted separately).

Ming Chen · Aug 30, 2023

Hi,
Among all the WRF physics schemes, the radiation scheme takes the longest time for calculation. In your namelist.input, you set radt = 15, 4, which means that radiation scheme will be called every 15 and 4 minutes for D01 and D02 respectively. Note that I/O may also take relatively long time during the integration. The 'group 2' may be related to I/O, but I have no explanation why the pattern in 'group 3' displays a 5-hour interval.
I will forward your post to our software engineer, and hopefully he can have an answer to your question.

islas · Aug 30, 2023

Looking through the rsl.out.000, I don't think we can necessarily rule out I/O as the culprit. Just before those long timesteps there is output all similar to the effect of :

Timing for Writing /caldera/hovenweep/projects/usgs/water/cfwsc/jbellino/wrf/testing/B4B_check_19750317_19750406/output/15-minute_d02_1975-03-17_10:00:00 for domain        2:  104.80788 elapsed seconds

For instance, taking a look at this second instance it occurs, this accounts for almost the entirety of the the time for domain 2 @ 10:00:05. This is also the case for domain 1, which the timing penalty presumably propagates to. While the timesteps are different, it is quite suspicious that these long writes always coincide with the *next* timestep. I suspect this is an off-by-on logging issue where we have something like:

Code:

Timing for main: time 1975-03-17_10:00:00 on domain   1:    1.06802 elapsed seconds
...finalize processing of previous timestep (long write)...
...since we "finished" logging time for 10:00:00 any new time must be counted toward the next time step(?)...
...now penalize the next time step which actually probably ran fine...
Timing for main: time 1975-03-17_10:00:20 on domain   1:  120.63121 elapsed seconds

This of course, does not actually answer why it is happening every 5th hour nor resetting at the day.

For the reset at the day, I suspect that it may come down to a flushing of the aux history going to a new file, thus starting the cycle again. If this is the case, it does start to sound more like an I/O issue. A good way to test this might be to change the frames_per_auxhist24 to something beyond a day, and see if after 20:00:05 on domain 2 the next spike is logged at 01:00:05 the next day (+5 hours but no reset).

Why it would be exact at the 5th hour probably comes down to the same reason it might be an I/O issue - a coincidence of multiple events writing data out.

jbellino · Aug 31, 2023

Thank you both for your quick replies!

islas said:
For instance, taking a look at this second instance it occurs, this accounts for almost the entirety of the the time for domain 2 @ 10:00:05. This is also the case for domain 1, which the timing penalty presumably propagates to. While the timesteps are different, it is quite suspicious that these long writes always coincide with the *next* timestep. I suspect this is an off-by-on logging issue where we have something like

Thanks for clarifying that i/o timing can actually pollute timing for solution steps—this would mean that trying to look at how much time the model is spending on micro-physics by selecting time-steps that coincide with micro-physics intervals could be off base. The overall view of runtime and aggregation of solution times based on their length (e.g. groups 1-3 in my post above) is probably still useful though.

The problem with my previous thinking is that I had thought that using quilting servers to shunt i/o tasks would render these processes in the background...which is true, but only if the compute cores don't have to wait for the quilting servers to finishing dumping to disk before they can pass off the new information to be written and move on to the next time step.

To see if i/o was a bottleneck in this case, I started a test run with the same number of CPUs devoted to computations and double the number of i/o groups (nio_groups). Preliminary output looks promising as we can see that the long timesteps at 05:00:20 and 15:00:20 are gone in the list below. This is also reflected in the new plot. I think with further addition of i/o groups this can make a significant dent in model runtimes.

Optimizing Model Setup/Core Count to Minimize Solution Times for WRF

jbellino

New member

Attachments

Ming Chen

Moderator

islas

Member

jbellino

New member