CATBOOST - gradient boosting from Yandex

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

What Is CatBoost and Why You Need It

CatBoost (Categorical Boosting) is a modern gradient boosting algorithm developed by Yandex’s engineering team. This machine‑learning library is specially optimized for handling categorical features and is designed to solve a wide range of tasks: from classification and regression to ranking and recommendation systems.

The core of the algorithm follows the principles of gradient boosting on decision trees, but with significant improvements for processing categorical data. CatBoost automatically handles categorical variables without the need for prior encoding, making it especially attractive for real‑world tabular data.

Key Advantages of CatBoost

Automatic Handling of Categorical Features

CatBoost revolutionizes the way categorical data is processed. Traditional algorithms require pre‑transformation of categorical variables via one‑hot encoding or label encoding. CatBoost uses its own statistical aggregation method based on history (CTR – Click‑Through Rate), which automatically processes categorical features without information loss.

High Accuracy and Resistance to Overfitting

The algorithm delivers excellent results on various data types thanks to symmetric trees and advanced regularization techniques. Built‑in mechanisms for preventing overfitting provide stable results even on small datasets.

Flexibility in Computational Resources

CatBoost supports both CPU and GPU training, allowing you to significantly speed up model building on large datasets. The library works efficiently with both small data samples and massive industrial datasets.

Integration with Popular Tools

Full compatibility with the Python data‑analysis ecosystem: Pandas, NumPy, Scikit‑learn. This ensures easy integration into existing machine‑learning pipelines.

Built‑in Analysis Tools

CatBoost offers rich capabilities for visualizing the training process, analyzing feature importance, and monitoring model quality in real time.

Technical Features of the Algorithm

Categorical Feature Processing

The main distinction of CatBoost from other boosting algorithms lies in its handling of categorical variables. Instead of traditional encoding methods, CatBoost uses target‑based statistics computed from historical data. This preserves all information about categorical features without increasing data dimensionality.

Symmetric Trees

CatBoost builds symmetric (balanced) trees, which provide better generalization and prediction stability. This differs from XGBoost and LightGBM, which use leaf‑wise trees.

Missing Value Handling

The algorithm automatically processes missing values without additional preprocessing. CatBoost treats missing values as a separate category and handles them efficiently.

Support for Text Features

CatBoost can work with textual data, automatically extracting features from text and using them for model training.

Installation and Configuration of CatBoost

Installation via pip

pip install catboost

Installation with GPU Support

pip install catboost[gpu]

Import Core Modules

from catboost import CatBoostClassifier, CatBoostRegressor, CatBoostRanker
from catboost import Pool, cv, sum_models

Preparing Data for CatBoost

Working with Categorical Features

CatBoost can automatically detect categorical features, but for better control it is recommended to specify them explicitly:

import pandas as pd
from catboost import CatBoostClassifier

# Load data
df = pd.read_csv('data.csv')

# Define categorical features
cat_features = ['gender', 'region', 'category']
# or by index
cat_features_idx = [0, 2, 5]

# Prepare data
X = df.drop('target', axis=1)
y = df['target']

Using Pool for Optimization

Pool is a special CatBoost object for storing data, providing more efficient handling of large datasets:

from catboost import Pool

# Create training Pool
train_pool = Pool(
    data=X_train,
    label=y_train,
    cat_features=cat_features,
    feature_names=list(X_train.columns)
)

# Create evaluation Pool
eval_pool = Pool(
    data=X_eval,
    label=y_eval,
    cat_features=cat_features,
    feature_names=list(X_eval.columns)
)

Training Classification Models

Basic Example

from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create and train model
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',
    eval_metric='AUC',
    random_seed=42,
    verbose=100
)

# Train with validation set
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    early_stopping_rounds=50,
    plot=True
)

# Predict
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

Training Regression Models

Configuration for Regression Tasks

from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Create regression model
regressor = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=8,
    loss_function='RMSE',
    eval_metric='MAE',
    random_seed=42,
    verbose=100
)

# Train
regressor.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    early_stopping_rounds=100
)

# Predict
y_pred = regressor.predict(X_test)

# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")

Hyperparameter Tuning

Core Model Parameters

model = CatBoostClassifier(
    # Core parameters
    iterations=1000,           # Number of trees
    learning_rate=0.1,         # Learning rate
    depth=6,                   # Tree depth
    
    # Loss function and metrics
    loss_function='Logloss',
    eval_metric='AUC',
    
    # Regularization
    l2_leaf_reg=3.0,
    bagging_temperature=1.0,
    
    # Categorical handling
    one_hot_max_size=10,
    
    # Early stopping
    early_stopping_rounds=50,
    
    # Technical
    random_seed=42,
    verbose=100,
    thread_count=4,
    task_type='CPU'            # CPU or GPU
)

Automatic Hyperparameter Search

from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV

# Define grid
param_grid = {
    'iterations': [500, 1000, 1500],
    'learning_rate': [0.01, 0.1, 0.2],
    'depth': [4, 6, 8],
    'l2_leaf_reg': [1, 3, 5]
}

# Create model
model = CatBoostClassifier(
    random_seed=42,
    verbose=0,
    cat_features=cat_features
)

# Grid search
grid_search = GridSearchCV(
    model, param_grid,
    cv=3, scoring='roc_auc',
    n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

Working with Quality Metrics

Built‑in Classification Metrics

  • Accuracy – proportion of correct predictions
  • AUC – area under the ROC curve
  • F1 – F1 score
  • Precision – precision
  • Recall – recall
  • Logloss – logarithmic loss

Built‑in Regression Metrics

  • RMSE – root mean squared error
  • MAE – mean absolute error
  • R2 – coefficient of determination
  • MAPE – mean absolute percentage error

Using Custom Metrics

def custom_metric(y_true, y_pred):
    # Your own metric implementation
    return np.mean(np.abs(y_true - y_pred))

model = CatBoostRegressor(
    eval_metric=custom_metric,
    verbose=100
)

Visualization and Model Analysis

Tracking Training Progress

# Training with visualization
model = CatBoostClassifier(
    iterations=1000,
    verbose=100,
    plot=True  # Enable interactive plot
)

model.fit(X_train, y_train, eval_set=(X_test, y_test))

Feature Importance Analysis

# Get feature importance
feature_importance = model.get_feature_importance(prettified=True)
print(feature_importance)

# Plot importance
import matplotlib.pyplot as plt

features = X.columns
importance = model.get_feature_importance()

plt.figure(figsize=(10, 6))
plt.barh(features, importance)
plt.xlabel('Feature Importance')
plt.title('Feature Importance in CatBoost Model')
plt.tight_layout()
plt.show()

SHAP Values

import shap

# Create SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Plot
shap.summary_plot(shap_values, X_test)

Saving and Loading Models

Saving a Model

# Save in CatBoost native format
model.save_model("catboost_model.cbm")

# Save as JSON
model.save_model("catboost_model.json", format='json')

# Save as ONNX
model.save_model("catboost_model.onnx", format='onnx')

Loading a Model

# Load model
loaded_model = CatBoostClassifier()
loaded_model.load_model("catboost_model.cbm")

# Predict with loaded model
predictions = loaded_model.predict(X_test)

Integration with ML Pipelines

Using in a Scikit‑learn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', CatBoostClassifier(
        iterations=500,
        verbose=0,
        cat_features=cat_features
    ))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Cross‑Validation

from catboost import cv
from sklearn.model_selection import cross_val_score

cv_results = cv(
    pool=Pool(X, y, cat_features=cat_features),
    params={
        'iterations': 1000,
        'learning_rate': 0.1,
        'depth': 6,
        'loss_function': 'Logloss'
    },
    fold_count=5,
    shuffle=True,
    stratified=True,
    seed=42,
    verbose=100
)

print(f"Mean AUC: {cv_results['test-AUC-mean'].iloc[-1]:.4f}")

Advanced CatBoost Capabilities

Time‑Series Modeling

ts_model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=8,
    loss_function='RMSE',
    eval_metric='MAE',
    random_seed=42
)

ts_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    use_best_model=True,
    early_stopping_rounds=100
)

Model Ensembling

from catboost import sum_models

models = []
for i in range(3):
    model = CatBoostClassifier(
        iterations=500,
        learning_rate=0.1,
        depth=6,
        random_seed=i,
        verbose=0
    )
    model.fit(X_train, y_train, cat_features=cat_features)
    models.append(model)

ensemble_model = sum_models(models, weights=[0.4, 0.3, 0.3])

TensorBoard Monitoring

model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    verbose=100,
    train_dir='./catboost_logs'   # TensorBoard logs
)

model.fit(X_train, y_train, cat_features=cat_features)

Summary Table of Core CatBoost Methods and Functions

Category Method / Function Description Example
Model Creation CatBoostClassifier() Classification model model = CatBoostClassifier(iterations=1000)
  CatBoostRegressor() Regression model model = CatBoostRegressor(depth=6)
  CatBoostRanker() Ranking model model = CatBoostRanker(learning_rate=0.1)
  Pool() Data container pool = Pool(X, y, cat_features=[0, 1])
Training fit() Train model model.fit(X_train, y_train, cat_features=cat_features)
  fit_transform() Train and transform data X_transformed = model.fit_transform(X, y)
Prediction predict() Get predictions y_pred = model.predict(X_test)
  predict_proba() Get class probabilities proba = model.predict_proba(X_test)
  staged_predict() Predictions at each stage staged_preds = model.staged_predict(X_test)
  predict_log_proba() Log‑probabilities log_proba = model.predict_log_proba(X_test)
Evaluation score() Model scoring accuracy = model.score(X_test, y_test)
  eval_metrics() Compute metrics metrics = model.eval_metrics(pool, ['AUC', 'Accuracy'])
  get_best_score() Retrieve best result best_score = model.get_best_score()
  get_evals_result() Training metrics history evals = model.get_evals_result()
Model Analysis get_feature_importance() Feature importance importance = model.get_feature_importance()
  get_object_importance() Object importance obj_importance = model.get_object_importance(pool)
  calc_feature_statistics() Feature stats stats = model.calc_feature_statistics(X, y)
  get_feature_names() Feature names names = model.get_feature_names()
Save / Load save_model() Save model model.save_model('model.cbm')
  load_model() Load model model.load_model('model.cbm')
  copy() Duplicate model model_copy = model.copy()
Parameters get_params() Get parameters params = model.get_params()
  set_params() Set parameters model.set_params(iterations=2000)
  get_param() Get single parameter lr = model.get_param('learning_rate')
Cross‑validation cv() Cross‑validation cv_results = cv(pool, params, fold_count=5)
  train_test_split() Split data train_pool, test_pool = train_test_split(pool)
Visualization plot_tree() Tree visualization model.plot_tree(tree_idx=0)
  plot_predictions() Prediction plot model.plot_predictions(X_test, y_test)
Utilities sum_models() Ensemble models ensemble = sum_models([model1, model2])
  to_regressor() Convert to regressor regressor = classifier.to_regressor()
  select_features() Feature selection selected = model.select_features(X, y)

Practical Use Cases

Customer Churn Analysis

# Prepare churn data
churn_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',
    eval_metric='AUC',
    class_weights=[1, 3],   # Balance classes
    random_seed=42
)

churn_model.fit(
    X_train, y_train,
    cat_features=['region', 'tariff_plan', 'payment_method'],
    eval_set=(X_test, y_test),
    early_stopping_rounds=50,
    verbose=100
)

# Results
churn_proba = churn_model.predict_proba(X_test)[:, 1]
high_risk_customers = X_test[churn_proba > 0.7]

Real‑Estate Price Forecasting

price_model = CatBoostRegressor(
    iterations=1500,
    learning_rate=0.05,
    depth=8,
    loss_function='RMSE',
    eval_metric='MAE',
    random_seed=42
)

price_model.fit(
    X_train, y_train,
    cat_features=['district', 'building_type', 'condition'],
    eval_set=(X_test, y_test),
    early_stopping_rounds=100,
    verbose=100
)

predicted_prices = price_model.predict(X_test)

Recommendation System

ranker = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRank',
    random_seed=42
)

ranker.fit(
    X_train, y_train,
    group_id=group_ids_train,
    cat_features=['category', 'brand', 'season'],
    eval_set=(X_test, y_test, group_ids_test),
    verbose=100
)

recommendations = ranker.predict(X_candidates)

Performance Optimization

Configuring for Large Datasets

large_data_model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.2,
    depth=6,
    
    # Memory optimization
    max_ctr_complexity=1,
    simple_ctr=['Borders', 'Counter'],
    
    # Multithreading
    thread_count=8,
    
    # GPU acceleration
    task_type='GPU',
    devices='0:1',
    
    random_seed=42,
    verbose=100
)

Using GPU

gpu_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    
    task_type='GPU',
    devices='0',
    gpu_ram_part=0.8,
    
    random_seed=42
)

Debugging and Monitoring

Tracking Overfitting

model = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.1,
    depth=6,
    
    early_stopping_rounds=50,
    metric_period=50,
    verbose=50,
    
    # Snapshot saving
    save_snapshot=True,
    snapshot_file='model_snapshot.cbm',
    snapshot_interval=600,   # every 10 minutes
    
    random_seed=42
)

Metric Logging

model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    
    train_dir='./catboost_logs',
    logging_level='Verbose',
    
    random_seed=42
)

Frequently Asked Questions

How does CatBoost handle categorical features?
CatBoost uses a unique approach based on statistical aggregation over historical data (CTR). It creates target‑based statistics for each category, allowing effective use of categorical information without prior encoding.

Do I need to preprocess data beforehand?
CatBoost minimizes the need for preprocessing. It automatically handles missing values, categorical features, and does not require scaling of numeric columns. Nevertheless, basic data cleaning and outlier analysis are still recommended.

How to choose optimal hyperparameters?
Use cross‑validation combined with automated search methods (GridSearchCV, RandomizedSearchCV). Start with default values and iteratively tune key parameters such as iterationslearning_rate, and depth.

Can CatBoost be used for time‑series data?
Yes. CatBoost works well with time‑series when you create appropriate lag features and apply validation that respects temporal order.

How to interpret model results?
CatBoost provides several interpretation tools: feature importance, SHAP values, and tree visualizations. Use these to understand the impact of different factors on predictions.

Conclusion

CatBoost is a powerful, modern machine‑learning tool that excels with tabular data containing categorical features. Its main strengths include automatic categorical handling, high predictive accuracy, resistance to overfitting, and ease of use.

The library fits a broad spectrum of tasks—from classification and regression to ranking and recommendation systems. In industrial settings, where high‑quality predictions are required with minimal data‑preparation effort, CatBoost is especially valuable.

Choosing CatBoost makes sense when you work with real‑world tabular datasets rich in categorical variables and need an “out‑of‑the‑box” solution that delivers top‑tier performance with minimal tuning.

News