Mathematical Foundations | DeepMind Weather AI

🧠 Graph Neural Networks

Message Passing Neural Network (MPNN)

x_i^(k) = γ^(k)( x_i^(k-1), ⊕_j∈N(i) φ^(k)(x_i^(k-1), x_j^(k-1), e_ji) )

Where:

x_i^(k) = embedding of node i at layer k
N(i) = neighbors of node i
φ = message function (typically MLP)
⊕ = permutation-invariant aggregation (sum, mean, max)
γ = update function (typically MLP)
e_ji = edge features from j to i

Three Steps of Message Passing

MESSAGE: Each neighbor j computes a message for node i using φ(x_i, x_j, e_ji)
AGGREGATE: Combine all incoming messages using ⊕ (sum/mean). Must be permutation-invariant!
UPDATE: Node i updates its representation using γ, combining its old state with aggregated messages

Why Permutation Invariance?

Graph nodes have no natural ordering. The same graph with nodes labeled [A,B,C] or [C,A,B] should produce the same output.

Sum/mean aggregation guarantees this:
sum([m_A, m_B, m_C]) = sum([m_C, m_A, m_B])

# Pseudocode for one message passing layer
def message_passing(nodes, edges, senders, receivers):
    # 1. Compute messages
    messages = message_fn(nodes[senders], nodes[receivers], edges)
    
    # 2. Aggregate at receiver nodes (permutation invariant)
    aggregated = segment_sum(messages, receivers)
    
    # 3. Update node representations
    new_nodes = update_fn(concat(nodes, aggregated))
    return new_nodes
            

🌀 Diffusion Models

Forward Process (Adding Noise)

X_t = √ᾱ_t · X₀ + √(1-ᾱ_t) · ε

Progressive noise addition where ε ~ N(0, I).
As t → T, X_T approaches pure Gaussian noise.
ᾱ_t = cumulative noise schedule (decreases with t)

Reverse Process (Denoising)

X_t-1 = (X_t - √(1-α_t) · ε_θ) / √α_t + σ_tz

Neural network ε_θ predicts the noise.
Iteratively denoise from X_T (noise) to X₀ (data).
Each step removes a bit of noise.

Score Function

∇_x log p_t(x) = -ε / σ_t

The score function is the gradient of the log probability density. It points toward regions of higher density.

Key insight: Learning to predict the noise ε is equivalent to learning the score function. Moving opposite to the noise → moving toward the data distribution.

Denoising Score Matching Loss

L = 𝔼_{t, x₀, ε}[ ||ε_θ(x_t, t) - ε||² ]

Training objective: Predict the noise that was added to the data.
Sample timestep t uniformly, add noise to get x_t, train network to predict ε.

⚡ Functional Generative Networks

Conditional Layer Normalization

CLN(x, z) = γ(z) · (x - μ) / σ + β(z)

FGN's key mechanism: A 32-dimensional noise vector z conditions the normalization parameters γ and β throughout the entire network.

Same z applied globally → Forces spatial coherence. The easiest way for the model to minimize error everywhere is to learn physically consistent patterns.

Why 32D → 87M dimensions works

The weather has structure! Not all 87M outputs are independent.

Temperature gradients are smooth
Pressure systems are coherent
Wind follows physical laws

The 32D manifold captures the modes of variation in plausible weather states.

Marginals → Joints

FGN is trained only on per-location CRPS (marginals). Yet it learns realistic joint distributions!

Why? The shared global noise z means the only way to minimize CRPS everywhere simultaneously is to output spatially coherent fields.

📊 Evaluation Metrics

CRPS (Continuous Ranked Probability Score)

CRPS = 𝔼[|X - y|] - 0.5 · 𝔼[|X - X'|]

The key metric for probabilistic forecasts.

• X, X' = independent samples from forecast distribution
• y = actual observation
• First term: expected distance to truth (accuracy)
• Second term: spread of ensemble (sharpness)
• Lower is better

RMSE (Root Mean Square Error)

RMSE = √( 𝔼[(X_pred - X_true)²] )

Standard metric for deterministic forecasts.

• Measures average magnitude of errors
• Penalizes large errors more than small ones
• Lower is better
• For ensembles, typically use ensemble mean

ACC (Anomaly Correlation Coefficient)

ACC = corr(X_pred - X_clim, X_true - X_clim)

Correlation of anomalies from climatology.
Tests if forecast captures departures from normal.
Higher is better (1.0 = perfect, 0.0 = no skill)

Brier Score (for binary events)

BS = 𝔼[(p_pred - 𝕀(event))²]

For probabilistic prediction of binary events.
E.g., P(temperature > 35°C) = 0.7, actual = 1 (yes)
Lower is better (0 = perfect)

Spread-Skill Ratio

SSR = ensemble_spread / RMSE

Tests ensemble calibration.

• SSR ≈ 1.0: well-calibrated (spread matches actual error)
• SSR < 1.0: overconfident (spread too narrow)
• SSR > 1.0: underconfident (spread too wide)