XGBOOST - gradient boosting

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Introduction to XGBoost

XGBoost (Extreme Gradient Boosting) is a high-performance implementation of gradient boosting that has become a de facto standard in the field of machine learning. Developed by Tianqi Chen in 2016, this library has revolutionized the world of data science and has become a key component for winning many Kaggle competitions.

XGBoost is an optimized distributed gradient boosting system designed to deliver high performance, flexibility, and portability. The algorithm is actively used in classification, regression, and ranking tasks, demonstrating exceptional stability and the ability to generalize data.

What is Gradient Boosting?

Gradient boosting is a powerful ensemble machine learning method based on the sequential training of weak models (usually decision trees). Each new model in the ensemble focuses on correcting the errors made by the previous models.

How Gradient Boosting Works

The algorithm works as follows:

  • Initialization: A base model is created (usually a constant prediction).
  • Iterative Training: At each iteration, a new model is added, which predicts the residuals (errors) of the previous models.
  • Optimization: Gradient descent is used to minimize the loss function.
  • Aggregation: The final prediction is obtained by summing the predictions of all models with corresponding weights.

XGBoost Architecture and Principles

XGBoost implements an advanced version of gradient boosting with a number of innovative optimizations:

Core Architectural Decisions

  • Second-Order Optimization: XGBoost uses information about the second derivatives (Hessian) of the loss function, which provides a more accurate approximation and faster convergence.
  • Regularization: Built-in support for L1 and L2 regularization prevents overfitting at the algorithm level.
  • Sparse Data: A sparsity-aware algorithm efficiently processes sparse data, automatically learning the optimal direction for missing values.
  • Parallel Computation: The tree building algorithm is parallelized, which significantly speeds up training.

Tree Building Process

XGBoost uses a unique tree building algorithm:

  • Histogram Approach: Instead of sorting all feature values, histograms are used, which speeds up the search for optimal splits.
  • Greedy Search: A greedy strategy is used to select the best node splits.
  • Approximate Algorithm: For large datasets, an approximate algorithm with quantiles is used.

XGBoost Installation and Setup

Installing the Library

Standard installation via pip:

pip install xgboost

For installation with GPU support:

pip install xgboost[gpu]

Installation from source code for maximum performance:

git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
python setup.py install

Import and Basic Setup

import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error

Basic XGBoost Interfaces

XGBoost provides three main interfaces for working with the library:

Learning API (Native Interface)

# Creating DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Model parameters
params = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'learning_rate': 0.1,
    'n_estimators': 100
}

# Training
model = xgb.train(params, dtrain, num_boost_round=100)

# Prediction
predictions = model.predict(dtest)

Scikit-learn API

# Creating model
model = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    objective='binary:logistic'
)

# Training
model.fit(X_train, y_train)

# Prediction
predictions = model.predict(X_test)

Sklearn API with Additional Features

# Training with validation set
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=10,
    verbose=True
)

Table of Basic XGBoost Methods and Functions

Method/Function Purpose Example Usage
xgb.train() Training a model with the native API xgb.train(params, dtrain, num_boost_round=100)
xgb.cv() Cross-validation xgb.cv(params, dtrain, nfold=5, num_boost_round=100)
XGBClassifier() Classifier for sklearn API XGBClassifier(n_estimators=100, max_depth=3)
XGBRegressor() Regressor for sklearn API XGBRegressor(n_estimators=100, max_depth=3)
fit() Model training model.fit(X_train, y_train)
predict() Prediction model.predict(X_test)
predict_proba() Predicting probabilities model.predict_proba(X_test)
save_model() Saving a model model.save_model('model.json')
load_model() Loading a model model.load_model('model.json')
get_params() Getting parameters model.get_params()
set_params() Setting parameters model.set_params(max_depth=5)
feature_importances_ Feature importance model.feature_importances_
plot_importance() Visualizing importance xgb.plot_importance(model)
plot_tree() Visualizing a tree xgb.plot_tree(model, num_trees=0)
DMatrix() Creating an optimized data structure xgb.DMatrix(X, label=y)

Detailed Description of XGBoost Parameters

Key Training Parameters

  • n_estimators (num_boost_round): The number of trees in the ensemble. More trees can improve quality, but increase training time and the risk of overfitting.
  • max_depth: The maximum depth of the trees. Controls the complexity of the model. Typical values: 3-10.
  • learning_rate (eta): The learning rate. Smaller values require more iterations but can lead to better quality. Usually: 0.01-0.3.
  • subsample: The fraction of the sample used to train each tree. Helps prevent overfitting. Recommended: 0.5-1.0.
  • colsample_bytree: The fraction of features used for each tree. Similar to subsample, but for features.

Regularization Parameters

  • gamma (min_split_loss): The minimum reduction in the loss function to split a node. The larger it is, the more conservative the algorithm.
  • lambda (reg_lambda): L2 regularization of leaf weights. Helps smooth the weights of the final leaves.
  • alpha (reg_alpha): L1 regularization of leaf weights. Can lead to sparser models.

Parameters for Working with Data

  • missing: The value to represent missing data. Default: np.nan.
  • scale_pos_weight: Balancing classes for binary classification.
  • max_delta_step: A constraint on the change in weights. Useful for imbalanced data.

Practical Examples of Using XGBoost

Example 1: Classification with Parameter Tuning

from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Loading data
data = load_breast_cancer()
X, y = data.data, data.target

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Creating and training a model
model = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Training with early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=10,
    verbose=False
)

# Prediction and evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

Example 2: Regression with Cross-Validation

from xgboost import XGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Loading data
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Creating model
model = XGBRegressor(
    n_estimators=1000,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Training
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=50,
    verbose=False
)

# Prediction
y_pred = model.predict(X_test)

# Quality assessment
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}")
print(f"R²: {r2:.4f}")

# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
print(f"CV R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

Example 3: Using the Native API

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Loading data
data = load_iris()
X, y = data.data, data.target

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Creating DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Model parameters
params = {
    'objective': 'multi:softprob',
    'num_class': 3,
    'max_depth': 4,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Training with validation
watchlist = [(dtrain, 'train'), (dtest, 'test')]
model = xgb.train(
    params, dtrain,
    num_boost_round=100,
    evals=watchlist,
    early_stopping_rounds=10,
    verbose_eval=False
)

# Prediction
predictions = model.predict(dtest)
predicted_labels = np.argmax(predictions, axis=1)

# Accuracy assessment
accuracy = accuracy_score(y_test, predicted_labels)
print(f"Accuracy: {accuracy:.4f}")

Hyperparameter Optimization

Parameter Selection with GridSearchCV

from sklearn.model_selection import GridSearchCV

# Defining parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

# Creating model
xgb_model = XGBClassifier(random_state=42)

# Finding optimal parameters
grid_search = GridSearchCV(
    xgb_model, param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)
print("Best result:", grid_search.best_score_)

Bayesian Optimization with Optuna

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 5),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 5)
    }
    
    model = XGBClassifier(**params, random_state=42)
    score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()
    return score

# Creating and running optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print("Best parameters:", study.best_params)
print("Best result:", study.best_value)

Model Visualization and Interpretation

Feature Importance

import matplotlib.pyplot as plt
from xgboost import plot_importance

# Model training
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Visualizing feature importance
fig, ax = plt.subplots(figsize=(10, 8))
plot_importance(model, ax=ax, importance_type='weight')
plt.title('Feature Importance (by usage count)')
plt.tight_layout()
plt.show()

# Getting numerical importance values
feature_importance = model.feature_importances_
print("Feature importance:")
for i, importance in enumerate(feature_importance):
    print(f"Feature {i}: {importance:.4f}")

Tree Visualization

from xgboost import plot_tree

# Visualizing the first tree
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model, num_trees=0, ax=ax)
plt.title('First tree structure')
plt.show()

SHAP for Interpretation

import shap

# Creating an object for explanation
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualization
shap.summary_plot(shap_values, X_test, feature_names=data.feature_names)

Working with Different Data Types

Time Series

import pandas as pd
from datetime import datetime, timedelta

# Creating a time series
dates = pd.date_range(start='2020-01-01', end='2023-12-31', freq='D')
np.random.seed(42)
values = np.cumsum(np.random.randn(len(dates))) + 100

df = pd.DataFrame({'date': dates, 'value': values})

# Creating features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['lag_1'] = df['value'].shift(1)
df['lag_7'] = df['value'].shift(7)
df['rolling_mean_7'] = df['value'].rolling(window=7).mean()

# Preparing data for training
df = df.dropna()
feature_columns = ['year', 'month', 'day', 'weekday', 'lag_1', 'lag_7', 'rolling_mean_7']
X = df[feature_columns]
y = df['value']

# Splitting into training and test sets
split_date = '2023-06-01'
train_mask = df['date'] < split_date
X_train, X_test = X[train_mask], X[~train_mask]
y_train, y_test = y[train_mask], y[~train_mask]

# Training model
model = XGBRegressor(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"MSE for time series: {mse:.4f}")

Processing Categorical Features

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Example data with categorical features
data = {
    'category_1': ['A', 'B', 'C', 'A', 'B'],
    'category_2': ['X', 'Y', 'Z', 'X', 'Y'],
    'numeric_1': [1, 2, 3, 4, 5],
    'numeric_2': [10, 20, 30, 40, 50],
    'target': [0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

# Method 1: Label Encoding
le = LabelEncoder()
df['category_1_encoded'] = le.fit_transform(df['category_1'])
df['category_2_encoded'] = le.fit_transform(df['category_2'])

# Method 2: One-Hot Encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', ['numeric_1', 'numeric_2']),
        ('cat', OneHotEncoder(drop='first'), ['category_1', 'category_2'])
    ]
)

X_processed = preprocessor.fit_transform(df[['numeric_1', 'numeric_2', 'category_1', 'category_2']])

Advanced Techniques and Optimizations

Using Custom Loss Functions

def custom_objective(y_pred, y_true):
    """Custom loss function"""
    grad = y_pred - y_true.get_label()
    hess = np.ones_like(grad)
    return grad, hess

def custom_eval(y_pred, y_true):
    """Custom evaluation metric"""
    labels = y_true.get_label()
    error = mean_squared_error(labels, y_pred)
    return 'custom_mse', error

# Training with custom functions
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'max_depth': 4,
    'learning_rate': 0.1,
    'disable_default_eval_metric': 1
}

model = xgb.train(
    params, dtrain,
    num_boost_round=100,
    obj=custom_objective,
    feval=custom_eval,
    evals=[(dtest, 'test')],
    verbose_eval=False
)

Working with Imbalanced Data

from sklearn.utils.class_weight import compute_sample_weight

# Automatic weight calculation
sample_weights = compute_sample_weight('balanced', y_train)

# Training with weights
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train, sample_weight=sample_weights)

# Alternative approach via scale_pos_weight
pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
model_balanced = XGBClassifier(
    n_estimators=100,
    scale_pos_weight=pos_weight,
    random_state=42
)
model_balanced.fit(X_train, y_train)

Using GPU for Acceleration

# Checking GPU availability
try:
    model_gpu = XGBClassifier(
        n_estimators=100,
        max_depth=5,
        tree_method='gpu_hist',
        gpu_id=0,
        random_state=42
    )
    model_gpu.fit(X_train, y_train)
    print("GPU acceleration available")
except:
    print("GPU unavailable, using CPU")

Model Monitoring and Debugging

Tracking Metrics During Training

from sklearn.metrics import log_loss

# Custom function for tracking metrics
def monitor_performance(model, X_train, y_train, X_val, y_val):
    results = {'train_loss': [], 'val_loss': [], 'val_accuracy': []}
    
    for i in range(1, model.n_estimators + 1):
        # Predictions at each iteration
        train_pred = model.predict_proba(X_train, iteration_range=(0, i))[:, 1]
        val_pred = model.predict_proba(X_val, iteration_range=(0, i))[:, 1]
        
        # Metric calculation
        train_loss = log_loss(y_train, train_pred)
        val_loss = log_loss(y_val, val_pred)
        val_accuracy = accuracy_score(y_val, (val_pred > 0.5).astype(int))
        
        results['train_loss'].append(train_loss)
        results['val_loss'].append(val_loss)
        results['val_accuracy'].append(val_accuracy)
    
    return results

# Usage
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
metrics = monitor_performance(model, X_train, y_train, X_test, y_test)

# Visualization
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(metrics['train_loss'], label='Train Loss')
plt.plot(metrics['val_loss'], label='Validation Loss')
plt.xlabel('Iteration')
plt.ylabel('Log Loss')
plt.legend()
plt.title('Training Progress')

plt.subplot(1, 2, 2)
plt.plot(metrics['val_accuracy'])
plt.xlabel('Iteration')
plt.ylabel('Validation Accuracy')
plt.title('Validation Accuracy')
plt.tight_layout()
plt.show()

Integration with Other Libraries

Using with Pandas

import pandas as pd

# Creating a model that works with DataFrame
class XGBPandasWrapper:
    def __init__(self, **kwargs):
        self.model = XGBClassifier(**kwargs)
        self.feature_names = None
    
    def fit(self, X, y):
        if isinstance(X, pd.DataFrame):
            self.feature_names = X.columns.tolist()
        self.model.fit(X, y)
        return self
    
    def predict(self, X):
        return self.model.predict(X)
    
    def predict_proba(self, X):
        return self.model.predict_proba(X)
    
    def get_feature_importance(self):
        if self.feature_names:
            return pd.Series(
                self.model.feature_importances_,
                index=self.feature_names
            ).sort_values(ascending=False)
        return self.model.feature_importances_

# Usage
model_wrapper = XGBPandasWrapper(n_estimators=100, random_state=42)
df_train = pd.DataFrame(X_train, columns=[f'feature_{i}' for i in range(X_train.shape[1])])
model_wrapper.fit(df_train, y_train)
importance = model_wrapper.get_feature_importance()
print(importance.head())

Creating a Pipeline with Scikit-learn

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Creating pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif, k=10)),
    ('classifier', XGBClassifier(n_estimators=100, random_state=42))
])

# Training pipeline
pipeline.fit(X_train, y_train)

# Prediction
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline accuracy: {accuracy:.4f}")

# Getting feature importance after selection
selected_features = pipeline.named_steps['feature_selection'].get_support()
xgb_importance = pipeline.named_steps['classifier'].feature_importances_
print(f"Number of selected features: {selected_features.sum()}")

Deploying a Model in Production

Saving and Loading a Model

import pickle
import joblib

# Different ways to save a model

# 1. Built-in XGBoost methods
model.save_model('xgb_model.json')
model.save_model('xgb_model.ubj')  # More compact format

# 2. Using pickle
with open('xgb_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# 3. Using joblib (recommended)
joblib.dump(model, 'xgb_model.joblib')

# Loading model
loaded_model = XGBClassifier()
loaded_model.load_model('xgb_model.json')

# Or
loaded_model = joblib.load('xgb_model.joblib')

# Checking load correctness
test_pred_original = model.predict(X_test)
test_pred_loaded = loaded_model.predict(X_test)
print(f"Models are identical: {np.array_equal(test_pred_original, test_pred_loaded)}")

Creating a REST API with Flask

from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

# Loading model when application starts
model = joblib.load('xgb_model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Getting data from request
        data = request.json
        features = np.array(data['features']).reshape(1, -1)
        
        # Prediction
        prediction = model.predict(features)[0]
        probability = model.predict_proba(features)[0].tolist()
        
        return jsonify({
            'prediction': int(prediction),
            'probability': probability,
            'status': 'success'
        })
    
    except Exception as e:
        return jsonify({
            'error': str(e),
            'status': 'error'
        }), 400

@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'healthy'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Best Practices and Recommendations

Preventing Overfitting

  • Use cross-validation to assess model quality.
  • Apply early stopping if there is a validation set.
  • Adjust regularization parameters (gamma, lambda, alpha).
  • Limit model complexity via max_depth and min_child_weight.
  • Use subsampling to reduce variance.

Performance Optimization

# Settings for maximum performance
high_performance_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',  # Fast algorithm
    'n_jobs': -1,  # Using all CPUs
    'random_state': 42
}

# For large data
large_data_params = {
    'n_estimators': 100,
    'max_depth': 4,
    'learning_rate': 0.1,
    'subsample': 0.5,
    'colsample_bytree': 0.5,
    'tree_method': 'approx',  # Approximate algorithm
    'sketch_eps': 0.1,  # Approximation accuracy
    'n_jobs': -1,
    'random_state': 42
}

Working with Memory

import gc

def train_with_memory_optimization(X_train, y_train, X_val, y_val):
    # Creating DMatrix to save memory
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)
    
    # Removing source data from memory
    del X_train, y_train, X_val, y_val
    gc.collect()
    
    params = {
        'max_depth': 4,
        'learning_rate': 0.1,
        'objective': 'binary:logistic',
        'eval_metric': 'logloss'
    }
    
    model = xgb.train(
        params, dtrain,
        num_boost_round=100,
        evals=[(dval, 'val')],
        early_stopping_rounds=10,
        verbose_eval=False
    )
    
    return model

Debugging and Troubleshooting

Diagnosing Training Issues

def diagnose_training_issues(model, X_train, y_train, X_val, y_val):
    """Diagnosing XGBoost training problems"""
    
    # Checking for overfitting
    train_pred = model.predict(X_train)
    val_pred = model.predict(X_val)
    
    train_acc = accuracy_score(y_train, train_pred)
    val_acc = accuracy_score(y_val, val_pred)
    
    print(f"Accuracy on the training set: {train_acc:.4f}")
    print(f"Accuracy on the validation set: {val_acc:.4f}")
    print(f"Difference: {train_acc - val_acc:.4f}")
    
    if train_acc - val_acc > 0.1:
        print("⚠️  Possible overfitting!")
        print("Recommendations:")
        print("- Increase regularization (gamma, lambda, alpha)")
        print("- Reduce max_depth")
        print("- Use early_stopping_rounds")
        print("- Reduce learning_rate and increase n_estimators")
    
    # Checking feature importance
    importance = model.feature_importances_
    zero_importance = (importance == 0).sum()
    
    print(f"Features with zero importance: {zero_importance}")
    if zero_importance > len(importance) * 0.5:
        print("⚠️  Many features with zero importance!")
        print("Recommendations:")
        print("- Perform feature selection")
        print("- Check data quality")
    
    # Checking class distribution
    if hasattr(model, 'classes_'):
        class_distribution = np.bincount(y_train)
        imbalance_ratio = max(class_distribution) / min(class_distribution)
        
        print(f"Class ratio: {imbalance_ratio:.2f}")
        if imbalance_ratio > 5:
            print("⚠️  Strong class imbalance!")
            print("Recommendations:")
            print("- Use scale_pos_weight")
            print("- Apply sample_weight during training")
            print("- Consider using SMOTE")

# Usage
diagnose_training_issues(model, X_train, y_train, X_test, y_test)

Handling Errors and Exceptions

def safe_xgboost_training(X_train, y_train, X_val, y_val, params):
    """Safe XGBoost training with error handling"""
    
    try:
        # Checking input data
        if X_train.shape[0] != len(y_train):
            raise ValueError("X_train and y_train sizes do not match")
        
        if X_val.shape[0] != len(y_val):
            raise ValueError("X_val and y_val sizes do not match")
        
        if X_train.shape[1] != X_val.shape[1]:
            raise ValueError("Number of features in train and val does not match")
        
        # Checking for missing values
        if np.isnan(X_train).any():
            print("⚠️  Missing values found in X_train")
        
        if np.isnan(X_val).any():
            print("⚠️  Missing values found in X_val")
        
        # Creating model
        model = XGBClassifier(**params)

News