Critical interpretations and detailed explanations of DeepMind Weather AI concepts
1. Input Preparation: Take current weather state Xt and previous state Xt-1 (6 hours ago) on lat-lon grid. Variables include temperature, wind, humidity at 37 pressure levels.
2. Encoding: Map from lat-lon grid (721×1440) to icosahedral mesh (40,962 nodes). Single GNN layer projects features to latent space.
3. Processing: 16 message-passing layers on multi-mesh. Each layer: edges aggregate from neighbors, nodes update. Multi-scale edges (M0-M6) enable both local and global information flow.
4. Decoding: Map back to lat-lon grid. Output is Δ (change), added to current state to get Xt+1.
5. Autoregressive rollout: Output becomes input for next step. Repeat 40 times for 10-day forecast (6h × 40 = 240h).
The multi-mesh is the key innovation — retaining edges from all refinement levels means a single message pass can propagate information both locally (M5-M6 edges) and globally (M0-M2 edges).
Problems with lat-lon grids:
Icosahedral advantages:
Forward process (training):
Progressively add Gaussian noise to weather state X0:
Xt = √ᾱt·X0 + √(1-ᾱt)·ε
As t→T, XT ≈ pure noise.
Reverse process (inference):
Learn neural network to predict the noise ε. Given Xt, predict ε, then compute Xt-1. Iterate from pure noise to realistic weather.
Score function insight:
∇log p(x) = -ε/σ. Learning to predict noise = learning the score. Score points toward high-density (realistic) weather states.
GenCast specifics:
• Conditions on Xt-1, Xt (previous weather states)
• Sparse transformer (attend to mesh neighbors only)
• Different random seeds → different ensemble members
GenCast: Iterative diffusion sampling. Multiple denoising steps per forecast. Slow but flexible.
FGN: Single forward pass. Inject 32D noise vector into conditional layer-norm across all layers. Same z everywhere → spatial coherence.
Why it works:
Trained on CRPS (per-location). But shared global z means easiest way to minimize CRPS everywhere = learn physically consistent patterns.
Result: 8× faster than GenCast, 6.5% better CRPS, ~24h earlier cyclone skill.
CRPS (key metric):
CRPS = E[|X-y|] - 0.5·E[|X-X'|]
Measures both calibration and sharpness. Lower is better.
Brier Score (binary events):
BS = E[(p - I(event))²]
For questions like "P(T > 35°C)?". Lower is better.
Spread-Skill Ratio:
SSR = spread / RMSE
Should be ≈1.0. SSR<1 = overconfident, SSR>1 = underconfident.
Rank Histograms:
Where does observation fall in sorted ensemble? Should be uniform if calibrated.
Data Parallelism:
• Same model replicated across devices
• Different data batches per device
• Gradient all-reduce after backward pass
• Use pmap in JAX for SPMD parallelization
Model Parallelism (if needed):
• Split layers across devices
• Pipeline parallelism for large models
• Careful about communication bottlenecks
Memory Optimization:
• Gradient checkpointing (recompute activations in backward)
• Mixed precision (bfloat16 on TPU)
• Sharded optimizer states
Speed: Minutes on single TPU vs hours on supercomputer with 10,000+ processors. Enables rapid updates and more ensemble members.
Accuracy: GraphCast/GenCast now beat ECMWF HRES/ENS — the previous gold standards developed over decades.
Ensemble efficiency: Cheap to generate many scenarios → better uncertainty quantification.
Applications:
Wind Power:
• Forecast 100m hub-height winds (turbine level)
• Predict ramp events (sudden production changes)
• Probabilistic capacity factor forecasts
Solar Power:
• Cloud cover prediction (affects irradiance)
• Direct vs diffuse radiation
• Dust/aerosol effects
Grid Optimization:
• Load forecasting (temperature → AC demand)
• Renewable integration planning
• Day-ahead market bidding
Extreme Events:
• Storm preparation (utility crews)
• Grid resilience under extreme heat/cold
• Wildfire risk (wind + humidity)