CatBoost
CatBoost is a gradient boosted decision tree framework developed by Yandex, distinguished by its native support for categorical features and Ordered Boosting strategy — no manual One-Hot Encoding required.
Overview
The name "CatBoost" comes from "Category" + "Boosting". It is specifically optimized for categorical features using Ordered Target Statistics — computing category target means from historical samples to avoid target leakage. Training supports CPU/GPU and the model can be exported to ONNX/PMML for production deployment.
Official docs: CatBoost About Research papers: CatBoost Papers
Applications in A-Share Quantitative Strategies
1. Mixed-Feature Factor Modeling
A-Share data naturally contains categorical fields: industry classification, sector codes, CSRC industry categories. CatBoost accepts these directly without LabelEncoder or One-Hot encoding, reducing feature engineering overhead significantly.
2. Financial Report Feature Utilization
Categorical fields like report type (initial/amendment/correction) and audit opinion (unqualified/qualified) can be passed directly as categorical features — CatBoost encodes them automatically to capture financial quality signals.
3. Market Timing Classifier
Use macro-state labels (bull/bear/sideways) as categorical features combined with technical indicators to build a market timing model producing long/short signals.
Key Parameters (Finance-Recommended)
| Parameter | Description | Recommended |
|---|---|---|
iterations | Number of trees | 300–1000 |
learning_rate | Step size | 0.01–0.1 |
depth | Tree depth | 4–8 |
l2_leaf_reg | L2 regularization | 1–10 |
cat_features | Categorical feature indices | Per actual columns |
eval_metric | Evaluation metric | 'AUC' / 'NDCG' |
task_type | Compute device | 'CPU' / 'GPU' |
early_stopping_rounds | Early stopping | 50–100 |
Strengths & Limitations
Strengths:
- No preprocessing required for categorical features — pass industry/sector codes directly
- Ordered Boosting prevents target leakage, more robust on financial time series
- Built-in SHAP values and feature importance visualization for factor attribution
- ONNX export for seamless trading system integration
Limitations:
- Slower training than LightGBM
- For purely numerical features, XGBoost/LightGBM often outperform
