Explanations: Exegesis | DeepMind Weather AI

🧠 Technical Deep Dives

ARCHITECTURE
Walk me through how GraphCast makes a 10-day forecast.

Answer Framework:

1. Input Preparation: Take current weather state X^t and previous state X^t-1 (6 hours ago) on lat-lon grid. Variables include temperature, wind, humidity at 37 pressure levels.

2. Encoding: Map from lat-lon grid (721×1440) to icosahedral mesh (40,962 nodes). Single GNN layer projects features to latent space.

3. Processing: 16 message-passing layers on multi-mesh. Each layer: edges aggregate from neighbors, nodes update. Multi-scale edges (M0-M6) enable both local and global information flow.

4. Decoding: Map back to lat-lon grid. Output is Δ (change), added to current state to get X^t+1.

5. Autoregressive rollout: Output becomes input for next step. Repeat 40 times for 10-day forecast (6h × 40 = 240h).

💡 Key Point to Emphasize

The multi-mesh is the key innovation — retaining edges from all refinement levels means a single message pass can propagate information both locally (M5-M6 edges) and globally (M0-M2 edges).

REPRESENTATION
Why use an icosahedral mesh instead of a lat-lon grid?

Answer Framework:

Problems with lat-lon grids:

Polar singularity: Grid cells shrink to points at 90°N/S
Aspect ratio: Cells near poles have extreme elongation
Uneven area: 1° of longitude varies from 111km at equator to 0km at poles
CFL constraint: Tiny polar cells limit global timestep

Icosahedral advantages:

Uniform resolution: Near-constant ~25km spacing everywhere
Equal-area cells: Important for conservation laws
~5-6 neighbors: Efficient for message passing
Natural hierarchy: Each refinement quadruples nodes

PROBABILISTIC
Explain how diffusion models work in GenCast.

Answer Framework:

Forward process (training):
Progressively add Gaussian noise to weather state X₀:
X_t = √ᾱ_t·X₀ + √(1-ᾱ_t)·ε
As t→T, X_T ≈ pure noise.

Reverse process (inference):
Learn neural network to predict the noise ε. Given X_t, predict ε, then compute X_t-1. Iterate from pure noise to realistic weather.

Score function insight:
∇log p(x) = -ε/σ. Learning to predict noise = learning the score. Score points toward high-density (realistic) weather states.

GenCast specifics:
• Conditions on X^t-1, X^t (previous weather states)
• Sparse transformer (attend to mesh neighbors only)
• Different random seeds → different ensemble members

FGN
What's the key innovation in FGN vs GenCast?

Answer Framework:

GenCast: Iterative diffusion sampling. Multiple denoising steps per forecast. Slow but flexible.

FGN: Single forward pass. Inject 32D noise vector into conditional layer-norm across all layers. Same z everywhere → spatial coherence.

// FGN inference
z = sample_normal(32)  // One 32D vector
for layer in model:
    x = conditional_layer_norm(x, z)  // z conditions γ, β
// Output: coherent global weather field
                    

Why it works:
Trained on CRPS (per-location). But shared global z means easiest way to minimize CRPS everywhere = learn physically consistent patterns.

Result: 8× faster than GenCast, 6.5% better CRPS, ~24h earlier cyclone skill.

METRICS
How do you evaluate probabilistic weather forecasts?

Answer Framework:

CRPS (key metric):
CRPS = E[|X-y|] - 0.5·E[|X-X'|]
Measures both calibration and sharpness. Lower is better.

Brier Score (binary events):
BS = E[(p - I(event))²]
For questions like "P(T > 35°C)?". Lower is better.

Spread-Skill Ratio:
SSR = spread / RMSE
Should be ≈1.0. SSR<1 = overconfident, SSR>1 = underconfident.

Rank Histograms:
Where does observation fall in sorted ensemble? Should be uniform if calibrated.

⚙️ System Design & Engineering

DATA PIPELINE
Design a data pipeline for training on ERA5.

Answer Framework:

// Pipeline architecture
1. Storage: ERA5 in Zarr format (chunked, compressed)
   - Chunk by time (e.g., 1 month per chunk)
   - Enable parallel reads across workers

2. Loading: xarray + dask for lazy loading
   - Only load data when needed
   - Automatic parallelization

3. Preprocessing:
   - Normalize: (X - mean) / std per variable/level
   - Construct pairs: (X^{t-1}, X^t) → X^{t+1}

4. Batching:
   - Variable-length rollouts for training
   - Prefetch next batch while GPU computes

5. Augmentation:
   - Longitude rotation (Earth has no preferred origin)
   - Possibly random crop in time
                    

DISTRIBUTED
How would you scale training to multiple TPU pods?

Answer Framework:

Data Parallelism:
• Same model replicated across devices
• Different data batches per device
• Gradient all-reduce after backward pass
• Use pmap in JAX for SPMD parallelization

Model Parallelism (if needed):
• Split layers across devices
• Pipeline parallelism for large models
• Careful about communication bottlenecks

Memory Optimization:
• Gradient checkpointing (recompute activations in backward)
• Mixed precision (bfloat16 on TPU)
• Sharded optimizer states

# JAX distributed training pattern
@jax.pmap
def train_step(params, batch):
    loss, grads = jax.value_and_grad(loss_fn)(params, batch)
    grads = jax.lax.pmean(grads, 'batch')  # All-reduce
    params = optimizer.apply(grads, params)
    return params, loss
                    

🔬 Research & Impact

IMPACT
Why does ML weather prediction matter?

Answer Framework:

Speed: Minutes on single TPU vs hours on supercomputer with 10,000+ processors. Enables rapid updates and more ensemble members.

Accuracy: GraphCast/GenCast now beat ECMWF HRES/ENS — the previous gold standards developed over decades.

Ensemble efficiency: Cheap to generate many scenarios → better uncertainty quantification.

Applications:

Disaster prep: Earlier cyclone warnings save lives
Renewable energy: Wind/solar production forecasting
Agriculture: Planting/harvesting decisions
Aviation: Route optimization, delay prediction
Grid management: Load forecasting

LIMITATIONS
What's missing from current ML weather models?

Answer Framework:

Explicit physical constraints: Energy conservation, mass conservation not explicitly enforced
Precipitation: Still challenging due to sparse, non-Gaussian nature
Coupled systems: Ocean-atmosphere interaction, sea ice
Long timescales: Subseasonal-to-seasonal (S2S) prediction
Distribution shift: Climate change means future ≠ past
Interpretability: Why does model predict X?
Initialization: Still depends on traditional data assimilation

EXTENSIONS
How would you extend this work to energy applications?

Answer Framework:

Wind Power:
• Forecast 100m hub-height winds (turbine level)
• Predict ramp events (sudden production changes)
• Probabilistic capacity factor forecasts

Solar Power:
• Cloud cover prediction (affects irradiance)
• Direct vs diffuse radiation
• Dust/aerosol effects

Grid Optimization:
• Load forecasting (temperature → AC demand)
• Renewable integration planning
• Day-ahead market bidding

Extreme Events:
• Storm preparation (utility crews)
• Grid resilience under extreme heat/cold
• Wildfire risk (wind + humidity)

📋 4-Week Study Plan

Week 1-2: Foundations

Read GraphCast paper (especially Methods)
Complete Distill GNN tutorial
Implement simple message passing in JAX
Understand icosahedral mesh construction
Read GenCast paper
Study diffusion model basics

Week 3-4: Advanced + Practice

Read FGN paper (WeatherNext 2)
Understand functional perturbations
Study transformer attention
Practice system design questions
Review ERA5 data structure
Prepare personal narrative