I opensourced a from-scratch GPU native WRF-ARW v4 reimplementation (JAX): Looking for test feeback and advice how to proceed.

Nric

New member
Dear NCAR/UCAR team and all,


First, thanks for your amazing work on this project. I have the utmost respect for your work and I am fairly intimidated to even post this here:


TL;DR: Can someone test on a large GPU, give feedback and ideally take over the GIT if this Kernel is of interest?


Background: I have a hobby project where I want to offer the people of the Canary Islands a free actually working weather forecast that is better than the current public one, which seems best suited as a random number generator.

Due to the islands' steep versatile terrain and heavy microclimates, I built a physics-ML hybrid (which uses a 1km WRF grid as its physics backbone). The ML layers are fast af but the WRF physics component was overwhelming my old workstation, which will need to run this every night (ideally only off solar power only) in the future; So I needed something that is vastly faster and more energy efficient than WRF-v4 for me.

I found it to be a project where the intersection of Opus 4.8 coding and planning abilities together with gpt 5.5 math and physics abilities converge and I wanted a challenging large multi week all AI self-contained project anyway to learn and understand the current abilities and limitations on how agents arrange themselves to execute a large but verifiable task (Fable 5 did some final kernel speed and memory optimizations in the few days it was available, LOL).

The main work I personally did was to created an agent swarm using initial prompts and skill files. This swarm, primarily composed of Opus 4.8 Max agents teamed up or adversarial with GPT 5.5 XHigh, was coordinated by an Opus 4.8 XHigh manager who built and managed the project roadmap. The swarm completed a fast GPU-optimized rewrite that is (mostly) true to WRFv4, at least for the tiny cases I tested.

Regarding meteorology and the WRF-code side, I have little idea what I'm doing; this is a completely new domain for me. Therefore, the outcome was not quite what I expected, but it's still good enough to mention to you (I think, looking for feedback):

After an initial kernel deathmatch phase and reading some papers about nvfortran ports with predictable scaling limitations, the Agents autonomously decided to rewrite WRF via a XLA/JAX dycore kernel completely from scratch.

It took them about 2-3 days till the kernel was fast and stable but then another ~3 weeks and ~3000 sprints (agent runs) and ~2000 commits until this point where at least the core elements are all implemented, tested, documented, stable and memory and optimized.

While the outcome didn't quite meet my goal of speeding up my nested 1km canary compute by at least a factor of 2x compared to WRF on my CPU, it luckily met the 3x energy consumption reduction requirement.

The issue is not that the kernel is sub-optimal but that the tiny grid and the "tiny" rtx5090 is not suited for this architecture at all. The 5090 and all consumer GPUs have abysmal FP64 performance (sadly, attempts to reduce the pressure kernel mechanics to FP32 and keep them long-term stable using various computational tricks have all failed so far). More importantly, the kernel initialization overhead heavily impacts small grid calculations. But both issues become negligible for real GPUs and large grids theoretically: B200, GB300, even NVL72-GB300 should be support this by design, there should be no limit to the number of microkernels it can set up and compute in parallel. A NVL72-GB300 should be able to hold the entire world at a 1km grid in memory on one system and compute with brutal efficiency.

The Speed-up factor for my case was ~135% vs a 12-rank cpu. More interesting is that the compute per kWh for large grids and real GPUs should be in the range of 3<x<10 (and the GPU-CPU gap is getting larger). Also the compute per unit money should be vastly better - apart from the fact that GPU/TPU compute availability on earth will soon outrank cpu by a ridiculous margin.

So in summary my personal minimum goal was only partially met and I am pivoting to generative based AI method for my highly local, repetitive, and pattern-based problem, which has no primary claim for extreme weather correctness. It was a bit foolish to let AI design the kernel in a way that is optimized for huge systems I can not even test :). I guess it's true, a fool with a tool is still only a fool :)

Anyway, here we are. The early version I released (v0.17) is not feature-complete; it misses over 40 rare schemata. Version 0.18 will probably have them all and some additional cool kernel improvements, especially close remaining issues with RAINNC and QVAPOR and minimization of multi-GPU communication. All identity tests I ran were bit identical (where possible) and for "long term" real world simulations all variables remained stable and close to WRFv4 solution. I have included a scripted ready to go "hardcore test" (for my tiny GPU) in the Git repository that simulates a typical winter day in the Swiss Alps. It seems to align with the wrf-v4 solution (RAINNC and QVAPOR accumulated errors accepted as described and kernel level fix in flight). This test requires 32GB of VRAM to run (see github). The canary test (the only one I need personally) is all green compared to wrf-v4 numbers. I also tested it with 30 hindcasts, compared it via TOST to several hundred real weather stations, and as expected, skill is identical to WRF and hence works well for me. See the github readme to see what this re-write is and what it isn't.

But my request here is something else:

The issue is, that this was just a sub-project from a hobby project that I have zero funding for and not much time. I completed this using only free tokens from my company and nightly compute on my home workstation. I did it primarily to learn and see if swarms of the latest AI models could self-organize and handle a coding project on such a notoriously difficult codebase. And I am happy to acknolege that I have very little idea of what I am talking about here, my everyday work is on completly unrelated topics.

I do not have the time nor the compute nor any funding to run large scale validations in the regime where it would be scientifically or commercially interesting.

And I am not sure if this is even of interst for anyone here or if you already have better solutions.

But my hope was that someone here has the resources and knowhow to test this at scale and give full reports so the agents can build a 1.0.0 release at some point; and even more prefibly someone takes the GIT and continues the project to something validated and fully usable for everyone?


Best,
Enric
 
Update: 0.18.1 released which is now feature complete to WRF v4.

Let me know if there is interest or the ability to test this re-write on large scale systems.

I'm pivoting to training down-scaling generative AI that generates forecast (near) skill equal to 1km form 3km or 9km data in raw field space, this is probably more useful for my case as it should run in minutes not hours on my workstation. But I'm happy to ship another version of this physics only WRF GPU re-write or hand this project over if there is interest.

Best,
Enric
 
Update 0.18.2 - significant vram usage improvements, vram vs. ram usage now mostly en par with WRF v4 for the same cases.

That offsets the theoretical maximum in speedup that could be done on a consumer gpu vs wrf-cpu because now much bigger grids could be loaded putting it at 200-300% of a rtx5090 vs a 12-rank cpu on WRFv4 - but not tested due to reasons explained (not my usecase and a consumer GPU will never be near the optimal usecases with large grid and large fp64 native GPU devices).

I assume there is no interest in this here and will stop annoying you.

Best,
Enric
 
Back
Top