Random Forest

Random Forest is an ensemble learning method proposed by Leo Breiman in 2001. By combining Bootstrap sampling with random feature subsets across many decision trees, it achieves low-variance predictions — making it one of the most widely used baseline models for A-Share quantitative stock selection.

Overview

Random Forest introduces dual randomness during tree construction:

Bootstrap Sampling: Each tree trains on a random sample drawn with replacement
Random Feature Subsets: At each split, only $\sqrt{p}$ (classification) or $p/3$ (regression) features are considered

This design reduces inter-tree correlation, dramatically lowering variance while maintaining low bias.

Original paper: Random Forests — Leo Breiman, Machine Learning 45(1), 5–32, 2001

1. Multi-Factor Stock Selection

Use value, growth, momentum, and quality factors (50+ Alpha factors) as input features with next-day return direction or future returns as labels. The model outputs scored rankings for stock selection.

2. Feature Importance — Factor Screening

feature_importances_ is computed via MDI (Mean Decrease Impurity) and provides an objective ranking of factor predictive contribution — an alternative to traditional IC-based factor ranking.

3. Out-of-Bag (OOB) Error Estimation

Setting oob_score=True provides a generalization error estimate without cross-validation, which is naturally suited for financial time series (OOB samples were not used in training that tree).

Key Parameters (Finance-Recommended)

Parameter	Description	Recommended
`n_estimators`	Number of trees	100–500
`max_features`	Features per split	`"sqrt"` (classification)
`max_depth`	Maximum tree depth	`None` or 5–15
`min_samples_leaf`	Minimum samples per leaf	5–20
`oob_score`	Enable OOB evaluation	`True`
`n_jobs`	Parallel workers	`-1` (all cores)
`class_weight`	Class imbalance handling	`'balanced'`

Strengths & Limitations

Strengths:

Robust to outliers and noise — well-suited for variable A-Share data quality
No feature scaling required — raw factor values can be fed directly
OOB estimate avoids data leakage, especially suitable for time series
feature_importances_ directly produces a factor ranking

Limitations:

Slower inference than single trees (aggregates all trees)
High memory usage with large forests
Generally lower accuracy than Boosting models (higher bias)

Random Forest ​

Overview ​

Applications in A-Share Quantitative Strategies ​

1. Multi-Factor Stock Selection ​

2. Feature Importance — Factor Screening ​

3. Out-of-Bag (OOB) Error Estimation ​

Key Parameters (Finance-Recommended) ​

Strengths & Limitations ​

Official References ​