Skip to content

ExtraTrees

Extremely Randomized Trees (ExtraTrees) is an enhanced variant of Random Forest that introduces complete randomness in split thresholds, further reducing variance and increasing training speed.


Overview

Proposed by Geurts, Ernst, and Wehenkel (2006), ExtraTrees differs from Random Forest in one critical way:

  • Random Forest: Evaluates several candidate split thresholds per feature, selects the best
  • ExtraTrees: Selects split thresholds completely at random — no search for optimality

This extreme randomization yields lower variance at the cost of a small increase in bias. In practice, ExtraTrees often trains faster and generalizes better on high-dimensional financial datasets.

Original paper: Extremely randomized trees — Geurts, Ernst & Wehenkel, Machine Learning 63(1), 3–42, 2006


Applications in A-Share Quantitative Strategies

1. High-Dimensional Factor Screening

A-Share factor libraries may contain hundreds of raw factors. ExtraTrees converges faster in high-dimensional, small-sample settings — ideal for rapid feature importance evaluation across 3900+ stocks.

2. Intraday Timing

For minute-level OHLCV feature matrices, ExtraTrees' training speed advantage is particularly valuable when models need frequent retraining throughout the trading day.

3. Ensemble with Random Forest

Averaging ExtraTrees and Random Forest predictions (equal weight) typically produces more stable IC values than either model alone — a common ensemble baseline.


ParameterDescriptionRecommended
n_estimatorsNumber of trees100–500
max_featuresFeatures per split"sqrt" (classification)
max_depthTree depthNone
min_samples_leafMinimum samples per leaf5–20
bootstrapBootstrap samplingFalse (default)
n_jobsParallel workers-1

Key Difference from Random Forest

ExtraTrees defaults to bootstrap=False (uses the full dataset), while Random Forest defaults to bootstrap=True. Keep the default in most financial backtesting scenarios.


Strengths & Limitations

Strengths:

  • Faster training than Random Forest (no threshold search)
  • Lower variance — better generalization on noisy financial data
  • Highly parallelizable with n_jobs=-1

Limitations:

  • Random thresholds reduce individual tree accuracy — requires more trees to compensate
  • Slightly less robust to outliers than Random Forest

Official References

⚡ Real-time Data · 📊 Smart Analysis · 🎯 Backtesting