Random Forest
Random Forest is an ensemble learning method proposed by Leo Breiman in 2001. By combining Bootstrap sampling with random feature subsets across many decision trees, it achieves low-variance predictions — making it one of the most widely used baseline models for A-Share quantitative stock selection.
Overview
Random Forest introduces dual randomness during tree construction:
- Bootstrap Sampling: Each tree trains on a random sample drawn with replacement
- Random Feature Subsets: At each split, only $\sqrt{p}$ (classification) or $p/3$ (regression) features are considered
This design reduces inter-tree correlation, dramatically lowering variance while maintaining low bias.
Original paper: Random Forests — Leo Breiman, Machine Learning 45(1), 5–32, 2001
Applications in A-Share Quantitative Strategies
1. Multi-Factor Stock Selection
Use value, growth, momentum, and quality factors (50+ Alpha factors) as input features with next-day return direction or future returns as labels. The model outputs scored rankings for stock selection.
2. Feature Importance — Factor Screening
feature_importances_ is computed via MDI (Mean Decrease Impurity) and provides an objective ranking of factor predictive contribution — an alternative to traditional IC-based factor ranking.
3. Out-of-Bag (OOB) Error Estimation
Setting oob_score=True provides a generalization error estimate without cross-validation, which is naturally suited for financial time series (OOB samples were not used in training that tree).
Key Parameters (Finance-Recommended)
| Parameter | Description | Recommended |
|---|---|---|
n_estimators | Number of trees | 100–500 |
max_features | Features per split | "sqrt" (classification) |
max_depth | Maximum tree depth | None or 5–15 |
min_samples_leaf | Minimum samples per leaf | 5–20 |
oob_score | Enable OOB evaluation | True |
n_jobs | Parallel workers | -1 (all cores) |
class_weight | Class imbalance handling | 'balanced' |
Strengths & Limitations
Strengths:
- Robust to outliers and noise — well-suited for variable A-Share data quality
- No feature scaling required — raw factor values can be fed directly
- OOB estimate avoids data leakage, especially suitable for time series
feature_importances_directly produces a factor ranking
Limitations:
- Slower inference than single trees (aggregates all trees)
- High memory usage with large forests
- Generally lower accuracy than Boosting models (higher bias)
