AutoML — Automated Machine Learning
AutoML automates model selection, feature engineering, and hyperparameter optimization — enabling rapid construction of high-performance predictive models for A-Share quantitative strategies without manual tuning.
Overview
AutoML automates the following pipeline stages:
- Feature Preprocessing: Normalization, imputation, encoding
- Model Selection: Searches across candidate algorithms (RF, XGBoost, LightGBM, linear models, etc.)
- Hyperparameter Optimization (HPO): Bayesian optimization, Hyperband, TPE
- Ensemble Learning: Weighted combination of multiple models (Stacking/Voting)
Major frameworks:
| Framework | Highlights | Official Link |
|---|---|---|
| auto-sklearn | scikit-learn based, Bayesian optimization + ensemble | auto-sklearn |
| FLAML | Microsoft, resource-aware, extremely fast | FLAML |
| Optuna | HPO framework, works with any model | Optuna |
| H2O AutoML | Enterprise-grade, large-scale distributed | H2O.ai |
FLAML paper: FLAML: A Fast and Lightweight AutoML Library — Wang et al., MLSys 2021
Applications in A-Share Quantitative Strategies
1. Fast Factor Baseline Evaluation
Before committing to deep research on a new factor, use AutoML to quickly assess its predictive power as an objective baseline. This avoids selection bias from experience-driven or manual tuning.
2. Model Search Replacing Manual Tuning
Use FLAML to search over XGBoost/LightGBM/CatBoost hyperparameter spaces with AUC or IC as the objective:
import flaml
automl = flaml.AutoML()
automl.fit(
X_train, y_train,
task="classification",
metric="roc_auc",
time_budget=300 # Find best model within 5 minutes
)
print(automl.best_estimator)
print(automl.best_config)3. Ensemble for Stability
auto-sklearn and H2O AutoML natively support Ensemble output — automatically weighting LightGBM, RandomForest, and LogisticRegression. The resulting IC stability significantly exceeds any single model.
4. Time-Series Cross-Validation (Critical!)
Financial time-series data must not use random splits. Configure AutoML with a custom CV generator:
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5)
automl.fit(X, y, eval_method="cv", split_type=cv)Framework Comparison
| Feature | auto-sklearn | FLAML | Optuna |
|---|---|---|---|
| Model selection | ✅ Auto | ✅ Auto | ❌ Manual |
| HPO strategy | Bayesian + ensemble | Resource-aware | TPE/CMA-ES |
| Speed | Slow (resource-heavy) | Fastest | Medium |
| Ensemble output | ✅ | ✅ | ❌ |
| Time-series CV | Custom needed | Custom needed | Native support |
Strengths & Limitations
Strengths:
- Dramatically reduces tuning time and lowers human-induced overfitting risk
- Systematic search typically finds better configurations than manual tuning
- FLAML completes searches on 1000-stock datasets in under 5 minutes
Limitations:
- Financial time-series requires careful CV configuration — framework defaults are often inappropriate
- Black-box search makes it harder to control overfitting direction
- Ensemble models are slow to serve — not suitable for real-time signal generation
