ExtraTrees
Extremely Randomized Trees (ExtraTrees) is an enhanced variant of Random Forest that introduces complete randomness in split thresholds, further reducing variance and increasing training speed.
Overview
Proposed by Geurts, Ernst, and Wehenkel (2006), ExtraTrees differs from Random Forest in one critical way:
- Random Forest: Evaluates several candidate split thresholds per feature, selects the best
- ExtraTrees: Selects split thresholds completely at random — no search for optimality
This extreme randomization yields lower variance at the cost of a small increase in bias. In practice, ExtraTrees often trains faster and generalizes better on high-dimensional financial datasets.
Original paper: Extremely randomized trees — Geurts, Ernst & Wehenkel, Machine Learning 63(1), 3–42, 2006
Applications in A-Share Quantitative Strategies
1. High-Dimensional Factor Screening
A-Share factor libraries may contain hundreds of raw factors. ExtraTrees converges faster in high-dimensional, small-sample settings — ideal for rapid feature importance evaluation across 3900+ stocks.
2. Intraday Timing
For minute-level OHLCV feature matrices, ExtraTrees' training speed advantage is particularly valuable when models need frequent retraining throughout the trading day.
3. Ensemble with Random Forest
Averaging ExtraTrees and Random Forest predictions (equal weight) typically produces more stable IC values than either model alone — a common ensemble baseline.
Key Parameters (Finance-Recommended)
| Parameter | Description | Recommended |
|---|---|---|
n_estimators | Number of trees | 100–500 |
max_features | Features per split | "sqrt" (classification) |
max_depth | Tree depth | None |
min_samples_leaf | Minimum samples per leaf | 5–20 |
bootstrap | Bootstrap sampling | False (default) |
n_jobs | Parallel workers | -1 |
Key Difference from Random Forest
ExtraTrees defaults to bootstrap=False (uses the full dataset), while Random Forest defaults to bootstrap=True. Keep the default in most financial backtesting scenarios.
Strengths & Limitations
Strengths:
- Faster training than Random Forest (no threshold search)
- Lower variance — better generalization on noisy financial data
- Highly parallelizable with
n_jobs=-1
Limitations:
- Random thresholds reduce individual tree accuracy — requires more trees to compensate
- Slightly less robust to outliers than Random Forest
