Skip to content

Transformer

The Transformer is a deep learning architecture based on the Self-Attention mechanism. It has achieved breakthrough results in financial time series forecasting, sentiment analysis, and cross-asset factor modeling — making it a key component of cutting-edge quantitative strategies.


Overview

Introduced by Vaswani et al. at NeurIPS 2017, the Transformer's core innovation is Multi-Head Self-Attention:

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Attention weights capture dependencies between any positions in a sequence without sequential hidden state propagation (unlike LSTM), providing significant advantages on long financial time series.

Original paper: Attention Is All You Need — Vaswani et al., NeurIPS 2017


Finance-Specific Variants

ModelHighlightsLink
FinBERTFinancial text sentiment analysis, BERT fine-tuned on financial corporaProsusAI/finbert
TFT (Temporal Fusion Transformer)Multi-step time series prediction with static/dynamic featuresTFT Paper
InformerOptimized for long-sequence prediction, O(L log L) complexityInformer Paper
PatchTSTSlices time series into patches for efficient local pattern capturePatchTST Paper

Applications in A-Share Quantitative Strategies

1. Financial News Sentiment Analysis (FinBERT)

FinBERT is pre-trained on large financial corpora (annual reports, research notes, news). It directly classifies A-Share announcements, exchange inquiry letters, and financial news into sentiment categories (positive/neutral/negative), outputting a sentiment score as an Alpha factor:

python
from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model     = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

inputs  = tokenizer("Revenue exceeded expectations, net profit up 30% YoY", return_tensors='pt')
outputs = model(**inputs)
probs   = torch.softmax(outputs.logits, dim=-1)  # [negative, neutral, positive]

2. Multi-Step Time Series Forecasting (TFT)

TFT supports multivariate input (OHLCV + factors + macro variables), handling both static encoded features (industry, sector) and dynamic time series features simultaneously. It outputs quantile-interval predictions — well-suited for A-Share risk management.

3. Cross-Asset Attention

Use a Transformer encoder to apply Cross-Attention across multiple stocks' features at the same timestep, capturing intra-industry co-movement effects (sector leaders driving peers). This forms the basis of graph-attention stock selection models.

4. Alpha Factor Sequence Modeling

Feed the last 60 days of multi-factor cross-sectional data as a sequence into a Transformer. Self-attention discovers how factor predictive power strengthens or decays over time, generating dynamic factor weights.


Core Concepts

ConceptDescription
Positional EncodingTransformers have no inherent position awareness — sine/cosine encodings are added
Multi-Head AttentionMultiple Q/K/V groups in parallel, capturing different subspace dependencies
Dropoutp=0.1–0.3, prevents overfitting on limited financial data
Layer NormPer-layer normalization, stabilizes scale differences across financial variables
d_modelHidden dimension — 64–256 is typically sufficient for financial sequences

Strengths & Limitations

Strengths:

  • Self-attention naturally captures long-range price and factor dependencies — outperforms LSTM/GRU
  • Pre-trained models (FinBERT) require no training from scratch — low transfer learning cost
  • Parallel training structure is faster than RNN-based models

Limitations:

  • Large parameter counts — high overfitting risk given limited A-Share history
  • Requires GPU for training — inference latency is much higher than tree-based models
  • Low interpretability — factor importance is less intuitive than XGBoost/LightGBM

Official References

⚡ Real-time Data · 📊 Smart Analysis · 🎯 Backtesting