Transformer
The Transformer is a deep learning architecture based on the Self-Attention mechanism. It has achieved breakthrough results in financial time series forecasting, sentiment analysis, and cross-asset factor modeling — making it a key component of cutting-edge quantitative strategies.
Overview
Introduced by Vaswani et al. at NeurIPS 2017, the Transformer's core innovation is Multi-Head Self-Attention:
$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Attention weights capture dependencies between any positions in a sequence without sequential hidden state propagation (unlike LSTM), providing significant advantages on long financial time series.
Original paper: Attention Is All You Need — Vaswani et al., NeurIPS 2017
Finance-Specific Variants
| Model | Highlights | Link |
|---|---|---|
| FinBERT | Financial text sentiment analysis, BERT fine-tuned on financial corpora | ProsusAI/finbert |
| TFT (Temporal Fusion Transformer) | Multi-step time series prediction with static/dynamic features | TFT Paper |
| Informer | Optimized for long-sequence prediction, O(L log L) complexity | Informer Paper |
| PatchTST | Slices time series into patches for efficient local pattern capture | PatchTST Paper |
Applications in A-Share Quantitative Strategies
1. Financial News Sentiment Analysis (FinBERT)
FinBERT is pre-trained on large financial corpora (annual reports, research notes, news). It directly classifies A-Share announcements, exchange inquiry letters, and financial news into sentiment categories (positive/neutral/negative), outputting a sentiment score as an Alpha factor:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
inputs = tokenizer("Revenue exceeded expectations, net profit up 30% YoY", return_tensors='pt')
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1) # [negative, neutral, positive]2. Multi-Step Time Series Forecasting (TFT)
TFT supports multivariate input (OHLCV + factors + macro variables), handling both static encoded features (industry, sector) and dynamic time series features simultaneously. It outputs quantile-interval predictions — well-suited for A-Share risk management.
3. Cross-Asset Attention
Use a Transformer encoder to apply Cross-Attention across multiple stocks' features at the same timestep, capturing intra-industry co-movement effects (sector leaders driving peers). This forms the basis of graph-attention stock selection models.
4. Alpha Factor Sequence Modeling
Feed the last 60 days of multi-factor cross-sectional data as a sequence into a Transformer. Self-attention discovers how factor predictive power strengthens or decays over time, generating dynamic factor weights.
Core Concepts
| Concept | Description |
|---|---|
| Positional Encoding | Transformers have no inherent position awareness — sine/cosine encodings are added |
| Multi-Head Attention | Multiple Q/K/V groups in parallel, capturing different subspace dependencies |
| Dropout | p=0.1–0.3, prevents overfitting on limited financial data |
| Layer Norm | Per-layer normalization, stabilizes scale differences across financial variables |
d_model | Hidden dimension — 64–256 is typically sufficient for financial sequences |
Strengths & Limitations
Strengths:
- Self-attention naturally captures long-range price and factor dependencies — outperforms LSTM/GRU
- Pre-trained models (FinBERT) require no training from scratch — low transfer learning cost
- Parallel training structure is faster than RNN-based models
Limitations:
- Large parameter counts — high overfitting risk given limited A-Share history
- Requires GPU for training — inference latency is much higher than tree-based models
- Low interpretability — factor importance is less intuitive than XGBoost/LightGBM
