🧠 Graph Neural Networks

Message Passing Neural Network (MPNN)
xi(k) = γ(k)( xi(k-1), ⊕j∈N(i) φ(k)(xi(k-1), xj(k-1), eji) )
Where:
  • xi(k) = embedding of node i at layer k
  • N(i) = neighbors of node i
  • φ = message function (typically MLP)
  • = permutation-invariant aggregation (sum, mean, max)
  • γ = update function (typically MLP)
  • eji = edge features from j to i
Three Steps of Message Passing
  1. MESSAGE: Each neighbor j computes a message for node i using φ(xi, xj, eji)
  2. AGGREGATE: Combine all incoming messages using ⊕ (sum/mean). Must be permutation-invariant!
  3. UPDATE: Node i updates its representation using γ, combining its old state with aggregated messages
Why Permutation Invariance?

Graph nodes have no natural ordering. The same graph with nodes labeled [A,B,C] or [C,A,B] should produce the same output.

Sum/mean aggregation guarantees this:
sum([mA, mB, mC]) = sum([mC, mA, mB])

# Pseudocode for one message passing layer def message_passing(nodes, edges, senders, receivers): # 1. Compute messages messages = message_fn(nodes[senders], nodes[receivers], edges) # 2. Aggregate at receiver nodes (permutation invariant) aggregated = segment_sum(messages, receivers) # 3. Update node representations new_nodes = update_fn(concat(nodes, aggregated)) return new_nodes

🌀 Diffusion Models

Forward Process (Adding Noise)
Xt = √ᾱt · X0 + √(1-ᾱt) · ε
Progressive noise addition where ε ~ N(0, I).
As t → T, XT approaches pure Gaussian noise.
t = cumulative noise schedule (decreases with t)
Reverse Process (Denoising)
Xt-1 = (Xt - √(1-αt) · εθ) / √αt + σtz
Neural network εθ predicts the noise.
Iteratively denoise from XT (noise) to X0 (data).
Each step removes a bit of noise.
Score Function
x log pt(x) = -ε / σt
The score function is the gradient of the log probability density. It points toward regions of higher density.

Key insight: Learning to predict the noise ε is equivalent to learning the score function. Moving opposite to the noise → moving toward the data distribution.
Denoising Score Matching Loss
L = 𝔼t, x0, ε[ ||εθ(xt, t) - ε||² ]
Training objective: Predict the noise that was added to the data.
Sample timestep t uniformly, add noise to get xt, train network to predict ε.

⚡ Functional Generative Networks

Conditional Layer Normalization
CLN(x, z) = γ(z) · (x - μ) / σ + β(z)
FGN's key mechanism: A 32-dimensional noise vector z conditions the normalization parameters γ and β throughout the entire network.

Same z applied globally → Forces spatial coherence. The easiest way for the model to minimize error everywhere is to learn physically consistent patterns.
Why 32D → 87M dimensions works

The weather has structure! Not all 87M outputs are independent.

  • Temperature gradients are smooth
  • Pressure systems are coherent
  • Wind follows physical laws

The 32D manifold captures the modes of variation in plausible weather states.

Marginals → Joints

FGN is trained only on per-location CRPS (marginals). Yet it learns realistic joint distributions!

Why? The shared global noise z means the only way to minimize CRPS everywhere simultaneously is to output spatially coherent fields.

📊 Evaluation Metrics

CRPS (Continuous Ranked Probability Score)
CRPS = 𝔼[|X - y|] - 0.5 · 𝔼[|X - X'|]
The key metric for probabilistic forecasts.

• X, X' = independent samples from forecast distribution
• y = actual observation
• First term: expected distance to truth (accuracy)
• Second term: spread of ensemble (sharpness)
Lower is better
RMSE (Root Mean Square Error)
RMSE = √( 𝔼[(Xpred - Xtrue)²] )
Standard metric for deterministic forecasts.

• Measures average magnitude of errors
• Penalizes large errors more than small ones
Lower is better
• For ensembles, typically use ensemble mean
ACC (Anomaly Correlation Coefficient)
ACC = corr(Xpred - Xclim, Xtrue - Xclim)
Correlation of anomalies from climatology.
Tests if forecast captures departures from normal.
Higher is better (1.0 = perfect, 0.0 = no skill)
Brier Score (for binary events)
BS = 𝔼[(ppred - 𝕀(event))²]
For probabilistic prediction of binary events.
E.g., P(temperature > 35°C) = 0.7, actual = 1 (yes)
Lower is better (0 = perfect)
Spread-Skill Ratio
SSR = ensemble_spread / RMSE
Tests ensemble calibration.

• SSR ≈ 1.0: well-calibrated (spread matches actual error)
• SSR < 1.0: overconfident (spread too narrow)
• SSR > 1.0: underconfident (spread too wide)