🧠 Technical Deep Dives

1
ARCHITECTURE
Walk me through how GraphCast makes a 10-day forecast.
+
Answer Framework:

1. Input Preparation: Take current weather state Xt and previous state Xt-1 (6 hours ago) on lat-lon grid. Variables include temperature, wind, humidity at 37 pressure levels.

2. Encoding: Map from lat-lon grid (721×1440) to icosahedral mesh (40,962 nodes). Single GNN layer projects features to latent space.

3. Processing: 16 message-passing layers on multi-mesh. Each layer: edges aggregate from neighbors, nodes update. Multi-scale edges (M0-M6) enable both local and global information flow.

4. Decoding: Map back to lat-lon grid. Output is Δ (change), added to current state to get Xt+1.

5. Autoregressive rollout: Output becomes input for next step. Repeat 40 times for 10-day forecast (6h × 40 = 240h).

💡 Key Point to Emphasize

The multi-mesh is the key innovation — retaining edges from all refinement levels means a single message pass can propagate information both locally (M5-M6 edges) and globally (M0-M2 edges).

2
REPRESENTATION
Why use an icosahedral mesh instead of a lat-lon grid?
+
Answer Framework:

Problems with lat-lon grids:

  • Polar singularity: Grid cells shrink to points at 90°N/S
  • Aspect ratio: Cells near poles have extreme elongation
  • Uneven area: 1° of longitude varies from 111km at equator to 0km at poles
  • CFL constraint: Tiny polar cells limit global timestep

Icosahedral advantages:

  • Uniform resolution: Near-constant ~25km spacing everywhere
  • Equal-area cells: Important for conservation laws
  • ~5-6 neighbors: Efficient for message passing
  • Natural hierarchy: Each refinement quadruples nodes
3
PROBABILISTIC
Explain how diffusion models work in GenCast.
+
Answer Framework:

Forward process (training):
Progressively add Gaussian noise to weather state X0:
Xt = √ᾱt·X0 + √(1-ᾱt)·ε
As t→T, XT ≈ pure noise.

Reverse process (inference):
Learn neural network to predict the noise ε. Given Xt, predict ε, then compute Xt-1. Iterate from pure noise to realistic weather.

Score function insight:
∇log p(x) = -ε/σ. Learning to predict noise = learning the score. Score points toward high-density (realistic) weather states.

GenCast specifics:
• Conditions on Xt-1, Xt (previous weather states)
• Sparse transformer (attend to mesh neighbors only)
• Different random seeds → different ensemble members

4
FGN
What's the key innovation in FGN vs GenCast?
+
Answer Framework:

GenCast: Iterative diffusion sampling. Multiple denoising steps per forecast. Slow but flexible.

FGN: Single forward pass. Inject 32D noise vector into conditional layer-norm across all layers. Same z everywhere → spatial coherence.

// FGN inference z = sample_normal(32) // One 32D vector for layer in model: x = conditional_layer_norm(x, z) // z conditions γ, β // Output: coherent global weather field

Why it works:
Trained on CRPS (per-location). But shared global z means easiest way to minimize CRPS everywhere = learn physically consistent patterns.

Result: 8× faster than GenCast, 6.5% better CRPS, ~24h earlier cyclone skill.

5
METRICS
How do you evaluate probabilistic weather forecasts?
+
Answer Framework:

CRPS (key metric):
CRPS = E[|X-y|] - 0.5·E[|X-X'|]
Measures both calibration and sharpness. Lower is better.

Brier Score (binary events):
BS = E[(p - I(event))²]
For questions like "P(T > 35°C)?". Lower is better.

Spread-Skill Ratio:
SSR = spread / RMSE
Should be ≈1.0. SSR<1 = overconfident, SSR>1 = underconfident.

Rank Histograms:
Where does observation fall in sorted ensemble? Should be uniform if calibrated.

⚙️ System Design & Engineering

6
DATA PIPELINE
Design a data pipeline for training on ERA5.
+
Answer Framework:
// Pipeline architecture 1. Storage: ERA5 in Zarr format (chunked, compressed) - Chunk by time (e.g., 1 month per chunk) - Enable parallel reads across workers 2. Loading: xarray + dask for lazy loading - Only load data when needed - Automatic parallelization 3. Preprocessing: - Normalize: (X - mean) / std per variable/level - Construct pairs: (X^{t-1}, X^t) → X^{t+1} 4. Batching: - Variable-length rollouts for training - Prefetch next batch while GPU computes 5. Augmentation: - Longitude rotation (Earth has no preferred origin) - Possibly random crop in time
7
DISTRIBUTED
How would you scale training to multiple TPU pods?
+
Answer Framework:

Data Parallelism:
• Same model replicated across devices
• Different data batches per device
• Gradient all-reduce after backward pass
• Use pmap in JAX for SPMD parallelization

Model Parallelism (if needed):
• Split layers across devices
• Pipeline parallelism for large models
• Careful about communication bottlenecks

Memory Optimization:
• Gradient checkpointing (recompute activations in backward)
• Mixed precision (bfloat16 on TPU)
• Sharded optimizer states

# JAX distributed training pattern @jax.pmap def train_step(params, batch): loss, grads = jax.value_and_grad(loss_fn)(params, batch) grads = jax.lax.pmean(grads, 'batch') # All-reduce params = optimizer.apply(grads, params) return params, loss

🔬 Research & Impact

8
IMPACT
Why does ML weather prediction matter?
+
Answer Framework:

Speed: Minutes on single TPU vs hours on supercomputer with 10,000+ processors. Enables rapid updates and more ensemble members.

Accuracy: GraphCast/GenCast now beat ECMWF HRES/ENS — the previous gold standards developed over decades.

Ensemble efficiency: Cheap to generate many scenarios → better uncertainty quantification.

Applications:

  • Disaster prep: Earlier cyclone warnings save lives
  • Renewable energy: Wind/solar production forecasting
  • Agriculture: Planting/harvesting decisions
  • Aviation: Route optimization, delay prediction
  • Grid management: Load forecasting
9
LIMITATIONS
What's missing from current ML weather models?
+
Answer Framework:
  • Explicit physical constraints: Energy conservation, mass conservation not explicitly enforced
  • Precipitation: Still challenging due to sparse, non-Gaussian nature
  • Coupled systems: Ocean-atmosphere interaction, sea ice
  • Long timescales: Subseasonal-to-seasonal (S2S) prediction
  • Distribution shift: Climate change means future ≠ past
  • Interpretability: Why does model predict X?
  • Initialization: Still depends on traditional data assimilation
10
EXTENSIONS
How would you extend this work to energy applications?
+
Answer Framework:

Wind Power:
• Forecast 100m hub-height winds (turbine level)
• Predict ramp events (sudden production changes)
• Probabilistic capacity factor forecasts

Solar Power:
• Cloud cover prediction (affects irradiance)
• Direct vs diffuse radiation
• Dust/aerosol effects

Grid Optimization:
• Load forecasting (temperature → AC demand)
• Renewable integration planning
• Day-ahead market bidding

Extreme Events:
• Storm preparation (utility crews)
• Grid resilience under extreme heat/cold
• Wildfire risk (wind + humidity)

📋 4-Week Study Plan

Week 1-2: Foundations
  • Read GraphCast paper (especially Methods)
  • Complete Distill GNN tutorial
  • Implement simple message passing in JAX
  • Understand icosahedral mesh construction
  • Read GenCast paper
  • Study diffusion model basics
Week 3-4: Advanced + Practice
  • Read FGN paper (WeatherNext 2)
  • Understand functional perturbations
  • Study transformer attention
  • Practice system design questions
  • Review ERA5 data structure
  • Prepare personal narrative