Skip to content

Random Forest

Random Forest is an ensemble learning method proposed by Leo Breiman in 2001. By combining Bootstrap sampling with random feature subsets across many decision trees, it achieves low-variance predictions — making it one of the most widely used baseline models for A-Share quantitative stock selection.


Overview

Random Forest introduces dual randomness during tree construction:

  1. Bootstrap Sampling: Each tree trains on a random sample drawn with replacement
  2. Random Feature Subsets: At each split, only $\sqrt{p}$ (classification) or $p/3$ (regression) features are considered

This design reduces inter-tree correlation, dramatically lowering variance while maintaining low bias.

Original paper: Random Forests — Leo Breiman, Machine Learning 45(1), 5–32, 2001


Applications in A-Share Quantitative Strategies

1. Multi-Factor Stock Selection

Use value, growth, momentum, and quality factors (50+ Alpha factors) as input features with next-day return direction or future returns as labels. The model outputs scored rankings for stock selection.

2. Feature Importance — Factor Screening

feature_importances_ is computed via MDI (Mean Decrease Impurity) and provides an objective ranking of factor predictive contribution — an alternative to traditional IC-based factor ranking.

3. Out-of-Bag (OOB) Error Estimation

Setting oob_score=True provides a generalization error estimate without cross-validation, which is naturally suited for financial time series (OOB samples were not used in training that tree).


ParameterDescriptionRecommended
n_estimatorsNumber of trees100–500
max_featuresFeatures per split"sqrt" (classification)
max_depthMaximum tree depthNone or 5–15
min_samples_leafMinimum samples per leaf5–20
oob_scoreEnable OOB evaluationTrue
n_jobsParallel workers-1 (all cores)
class_weightClass imbalance handling'balanced'

Strengths & Limitations

Strengths:

  • Robust to outliers and noise — well-suited for variable A-Share data quality
  • No feature scaling required — raw factor values can be fed directly
  • OOB estimate avoids data leakage, especially suitable for time series
  • feature_importances_ directly produces a factor ranking

Limitations:

  • Slower inference than single trees (aggregates all trees)
  • High memory usage with large forests
  • Generally lower accuracy than Boosting models (higher bias)

Official References

⚡ Real-time Data · 📊 Smart Analysis · 🎯 Backtesting