Lesson 015 - Temporal Fusion Transformer (TFT): Multi-Horizon Power Forecasting with Attention¶

Lesson Navigation

Previous: Lesson 014 - LSTM Forecasting with MC Dropout | Next: Lesson 016 - Ensemble Forecasting, Ramp Detection and Model Evaluation

Phase: P4 | Language: English | Progress: 16 of 19 | All Lessons | Learning Roadmap

Date: 2026-02-26 Phase: P4 (AI Forecasting) Roadmap sections: [Phase 4 - Section 5.6 TFT Model, Section 5.7 Quantile Regression, Section 5.10 Attention] Language: English Previous lesson: Lesson 014

What You Will Learn¶

Why different forecast horizons such as 1 h, 6 h, 24 h, and 48 h require different modelling strategies
The four core components of the Temporal Fusion Transformer architecture: GRN, VSN, Multi-Head Attention, and Quantile Outputs
How the attention mechanism answers the question, "which historical time steps influenced the forecast?"
Why native quantile regression with pinball loss can produce P10, P50, and P90 directly, without relying on MC Dropout
How the Variable Selection Network provides built-in feature importance without a separate SHAP workflow

Section 1: Multi-Horizon Forecasting - Why One Model Is Not Enough¶

Real-World Problem¶

A storm front is approaching across the Baltic Sea. The transmission system operator, PSE, must make different decisions at different forecast horizons:

Horizon	Typical decision	Dominant information source
1-6 hours	Balancing-market actions	SCADA autocorrelation
6-24 hours	Day-ahead market bidding	NWP synoptic forecast
24-48 hours	Maintenance planning	Regime changes and weather-system evolution

XGBoost treats each row independently, which makes it strong at short horizons when carefully engineered lag features exist. LSTM captures temporal structure well, especially at medium horizons. However, neither architecture is designed to decide explicitly which past features and which past time steps matter most for each forecast horizon. TFT addresses exactly that gap.

What the Standards and Literature Say¶

For uncertainty-aware wind-power forecasting, two families of methods are common in this repository:

MC Dropout - used with LSTM in the previous lesson to estimate uncertainty from repeated stochastic forward passes.
Quantile regression - used in XGBoost and TFT to predict P10, P50, and P90 directly through the loss function.

The key academic reference is Lim et al. (2021), Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. The practical advantage of TFT is that interpretability and probabilistic forecasting are embedded into the architecture itself, rather than added as a post-processing step.

Section 2: TFT Architecture - Four Core Building Blocks¶

2.1 Gated Residual Network (GRN)¶

The Gated Residual Network is the basic nonlinear processing block inside TFT.

eta1 = W1 x + b1
eta2 = W2 · ELU(eta1) + b2
GRN(x) = LayerNorm(x + GLU(eta2))

The two important concepts are:

GLU (Gated Linear Unit) controls how much information is allowed to pass forward.
Residual / skip connection preserves gradient flow and helps training remain stable in deeper architectures.

In engineering terms, the GRN behaves like an adaptive gate. If a particular transformed signal is not useful, the network can suppress it instead of letting noise propagate through the model.

2.2 Variable Selection Network (VSN)¶

The Variable Selection Network learns the relative importance of each input feature.

v_j = GRN_j(xi_j)
weights = Softmax(GRN_w(xi))
VSN(xi) = Sum_j weights_j × v_j

This means the model does not merely receive all features equally. It learns which variables matter most for a given forecasting context.

For offshore wind forecasting, that often means:

at a 1 h horizon, recent power and recent wind-speed lags dominate,
at a 24 h horizon, NWP wind speed and cyclical calendar features such as hour-of-day become more important.

2.3 Multi-Head Attention¶

The attention mechanism follows the transformer formulation introduced by Vaswani et al. (2017):

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) · V

Interpretation:

Query: what information is currently needed,
Key: how each historical step is represented,
Value: the information stored in each historical step.

The practical benefit is interpretability. Attention weights can be extracted and visualised to show which historical periods influenced the forecast most strongly.

2.4 Quantile Output Heads¶

The output layer contains separate heads for each requested quantile.

self.quantile_heads = nn.ModuleList([
    nn.Linear(hidden_size, 1),  # P10
    nn.Linear(hidden_size, 1),  # P50
    nn.Linear(hidden_size, 1),  # P90
])

Training uses pinball loss:

L_tau(y, y_hat) = tau × max(y-y_hat, 0) + (1-tau) × max(y_hat-y, 0)
L_total = L_0.10 + L_0.50 + L_0.90

This gives a direct probabilistic forecast rather than estimating a mean first and uncertainty later.

Section 3: What We Built¶

New Files¶

backend/app/services/p4/tft_model.py - full TFT implementation with GRN, VSN, attention, training, prediction, and attention extraction
backend/tests/test_tft_model.py - unit and integration tests for the TFT forecasting workflow

Updated Files¶

backend/app/services/p4/__init__.py - module exports
backend/app/schemas/forecast.py - TFT request and response schemas
backend/app/routers/p4.py - new endpoints for training, inference, and attention inspection

Architecture Summary¶

Input (batch, lookback=72, n_features=19)
  -> Variable Selection Network
  -> LSTM encoder
  -> Multi-Head Attention
  -> Gated Residual Network
  -> Quantile heads (P10, P50, P90)

API Endpoints¶

Endpoint	Purpose
`POST /api/v1/forecast/train-tft`	Train TFT with TimeSeriesSplit validation
`POST /api/v1/forecast/predict-tft`	Return probabilistic power forecast
`POST /api/v1/forecast/tft-attention`	Return attention weights and variable-selection scores

Section 4: Comparing XGBoost, LSTM, and TFT¶

Attribute	XGBoost	LSTM	TFT
Architecture	Gradient-boosted trees	Recurrent neural network	LSTM + transformer-style attention
Best forecast window	< 6 h	6-24 h	12-48 h
Uncertainty method	Quantile regression	MC Dropout	Native quantile heads
Explainability	SHAP	Limited	Attention + VSN
Training speed	Fastest	Moderate	Slowest
Data requirement	Lowest	Medium	Highest

Ensemble Strategy¶

A practical system often combines all three:

< 6 h:   0.50 × XGBoost + 0.30 × LSTM + 0.20 × TFT
6-24 h:  0.20 × XGBoost + 0.40 × LSTM + 0.40 × TFT
24-48 h: 0.10 × XGBoost + 0.30 × LSTM + 0.60 × TFT

This reflects the reality that no single model dominates every horizon equally well.

Section 5: Physical Constraints¶

TFT outputs pass through the same physical-constraint layer used by the other forecasting models:

P >= 0 MW
P <= 15.0 MW
wind speed < 3.0 m/s -> P = 0
wind speed > 31.0 m/s -> P = 0
P10 <= P50 <= P90

These rules are not optional. They preserve engineering realism regardless of model architecture.

Section 6: Test Coverage¶

The TFT module is covered by tests across eight categories:

Test group	Focus
`TestGRN`	shape and residual behaviour
`TestVariableSelection`	valid weights and output shape
`TestMultiHeadAttention`	attention storage and normalisation
`TestTFTTraining`	training completion, CV folds, early stopping
`TestTFTPrediction`	monotonic quantiles and valid outputs
`TestTFTPhysicalConstraints`	cut-in and cut-out enforcement
`TestTFTAttention`	attention dimensions and feature labels

Interview Questions¶

Question 1: How is TFT different from XGBoost and LSTM?¶

Simple answer: XGBoost is strong on tabular short-horizon forecasting, LSTM is strong on sequence learning, and TFT combines sequential learning, built-in feature selection, attention, and direct quantile prediction in one architecture.

Technical answer: TFT uses Variable Selection Networks for feature-wise gating, recurrent encoding for temporal context, Multi-Head Attention for long-range dependencies, and native quantile heads for direct P10/P50/P90 prediction. It therefore combines interpretability and probabilistic forecasting more natively than either XGBoost or a standard LSTM.

Question 2: Why use pinball loss instead of MC Dropout?¶

Simple answer: Pinball loss teaches the model to predict a requested quantile directly, while MC Dropout estimates uncertainty indirectly from repeated stochastic predictions.

Technical answer: Pinball loss is asymmetric and quantile-specific. For example, at tau = 0.9, under-prediction is penalised much more than over-prediction. This makes the network learn the desired quantile directly, without assuming a Gaussian form for forecast uncertainty.

Question 3: How should attention weights be interpreted?¶

Simple answer: They show which historical time steps the model relied on most heavily when producing a forecast.

Technical answer: Attention scores provide a time-indexed importance map. In a wind-power system, high weight on a historical block may indicate that the model is tracking the onset of a weather front, a diurnal cycle, or another persistent temporal regime.

Explain It Simply¶

Today we added a forecasting model that can look far back in time and decide which past signals matter most for predicting future wind-farm power. Instead of only saying, "this is the expected value," it can also give a pessimistic and optimistic range, which is exactly what operators and traders need when risk matters.

Explain It Technically¶

In this lesson we introduced a Temporal Fusion Transformer pipeline for multi-horizon offshore wind-power forecasting. The implementation combines variable selection, recurrent temporal encoding, attention-based interpretability, and direct quantile outputs in one model family. Compared with the previous XGBoost and LSTM implementations, TFT is the most expressive and the most computationally demanding, but it is also the most aligned with long-horizon probabilistic forecasting and feature-importance inspection.