Fine-tuning Chronos-2 was supposed to close the gap. Instead, the first pass made things worse.
That regression was frustrating, but it was also useful. It forced me to stop looking at top-line metrics and start asking what the 79% P90 coverage figure was actually measuring — and whether the question I was asking was the right one.
The Metric That Was Hiding the Problem
A 79% P90 coverage rate sounds promising. In a rigorous backtesting setup, covering 79% of actual accumulation days within the forecast envelope is a reasonable result for a zero-shot model in a novel environment.
The problem is what that number is averaging over.
The Sierra Nevada has somewhere between 15 and 20 meaningful storm events per season. Everything else is flat snowpack — days where the snow on the ground isn't changing much, and where predicting "approximately zero accumulation" is both easy and statistically dominant. When you weight every hour equally, a model that nails the quiet days and misses the storms looks fine.
Strip out the quiet hours and ask the harder question — when snow is actually accumulating, does the model cover the real accumulation? — and 79% collapses to 7.3%.
That's a classic class imbalance trap. The metric was telling me the model was performing well because most of the data it was being evaluated on was easy. The actual task — predicting real snowfall during real storms — was failing at a rate that made the model nearly useless for the core use case.
What's Actually Working
The regression from fine-tuning exposed the problem clearly enough, but before committing to a solution, I wanted to understand what the baseline was getting right.
Storm onset detection is working. The zero-shot model is picking up HRRR atmospheric signals and starting to forecast accumulation before the SNOTEL sensors detect rising snow depth. That's meaningful: it means the atmospheric covariate integration is functioning. The model isn't ignoring the weather data — it's reading it and responding.
A covariate sensitivity diagnostic confirmed this more precisely. Running inference with a zeroed covariate baseline versus the real HRRR inputs produced a +15.5cm P90 delta. The model is genuinely using the atmospheric signal.
The failure modes are also specific, which is a good sign — diffuse failure is harder to fix than concentrated failure. Two patterns dominate: the model undershoots magnitude on large events (anything above 50cm of accumulation), and it keeps forecasting accumulation after a storm has passed — an echo effect where it's learned that a storm means snow for a while, but not how to read the signal that it's ending.
Both of those are addressable. But fixing them requires giving the model better inputs to work with.
The Problem with PRATE
My original precipitation covariate was NOAA HRRR PRATE — instantaneous precipitation rate. For a short-range nowcast, PRATE is reasonable. For a 48-hour forecast intended to tell someone whether to book a ski weekend, it's the wrong question.
PRATE tells you it's already snowing. What I need are covariates that tell me a storm is coming.
This realization pointed directly at the data engineering layer. The model isn't misreading the atmospheric inputs — it's working with inputs that don't contain enough leading-indicator signal to close the storm magnitude gap. Fine-tuning a model on the wrong covariates won't fix that. So before round two of fine-tuning, the covariate set needs to change.
Refetching 10 Years
I'm refetching the full 10-year HRRR archive with four targeted replacements:
- APCP — accumulated precipitation over the forecast window, rather than the instantaneous rate
- PWAT — precipitable water column. Atmospheric rivers show up in this field 12–24 hours before surface precipitation begins. It's one of the clearest leading indicators available in the HRRR output.
- VVEL at 700mb — vertical velocity at 700 millibars. Upward air motion at this level is a direct precursor to orographic precipitation. When moist air gets forced up the Sierra, this is where you see it first.
- 850mb temperature — the snow level proxy. The difference between a heavy snow event and rain-on-snow (which compacts the pack rather than adding to it) often comes down to whether the 850mb temperature is above or below freezing.
The full refetch is running now. Three more years of cold-storage history to pull. After that, I'll regenerate the training dataset in the Arrow format Chronos-2 requires and run the second fine-tuning pass.
The gap from 7.3% to something worth shipping isn't a modeling problem. It's a signal problem. Round two is about giving the model the inputs that actually carry the information it needs to predict what's coming.
I'm Jon Eby. I'm building PowDay.AI as a solo project on consumer hardware (i.e. RTX 4070 Ti), and writing about what I'm learning as I go. If this kind of work interests you, connect on LinkedIn.