Introduction to XGBoost
XGBoost (Extreme Gradient Boosting) is a high-performance implementation of gradient boosting that has become a de facto standard in the field of machine learning. Developed by Tianqi Chen in 2016, this library has revolutionized the world of data science and has become a key component for winning many Kaggle competitions.
XGBoost is an optimized distributed gradient boosting system designed to deliver high performance, flexibility, and portability. The algorithm is actively used in classification, regression, and ranking tasks, demonstrating exceptional stability and the ability to generalize data.
What is Gradient Boosting?
Gradient boosting is a powerful ensemble machine learning method based on the sequential training of weak models (usually decision trees). Each new model in the ensemble focuses on correcting the errors made by the previous models.
How Gradient Boosting Works
The algorithm works as follows:
- Initialization: A base model is created (usually a constant prediction).
- Iterative Training: At each iteration, a new model is added, which predicts the residuals (errors) of the previous models.
- Optimization: Gradient descent is used to minimize the loss function.
- Aggregation: The final prediction is obtained by summing the predictions of all models with corresponding weights.
XGBoost Architecture and Principles
XGBoost implements an advanced version of gradient boosting with a number of innovative optimizations:
Core Architectural Decisions
- Second-Order Optimization: XGBoost uses information about the second derivatives (Hessian) of the loss function, which provides a more accurate approximation and faster convergence.
- Regularization: Built-in support for L1 and L2 regularization prevents overfitting at the algorithm level.
- Sparse Data: A sparsity-aware algorithm efficiently processes sparse data, automatically learning the optimal direction for missing values.
- Parallel Computation: The tree building algorithm is parallelized, which significantly speeds up training.
Tree Building Process
XGBoost uses a unique tree building algorithm:
- Histogram Approach: Instead of sorting all feature values, histograms are used, which speeds up the search for optimal splits.
- Greedy Search: A greedy strategy is used to select the best node splits.
- Approximate Algorithm: For large datasets, an approximate algorithm with quantiles is used.
XGBoost Installation and Setup
Installing the Library
Standard installation via pip:
pip install xgboost
For installation with GPU support:
pip install xgboost[gpu]
Installation from source code for maximum performance:
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
python setup.py install
Import and Basic Setup
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
Basic XGBoost Interfaces
XGBoost provides three main interfaces for working with the library:
Learning API (Native Interface)
# Creating DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Model parameters
params = {
'objective': 'binary:logistic',
'max_depth': 3,
'learning_rate': 0.1,
'n_estimators': 100
}
# Training
model = xgb.train(params, dtrain, num_boost_round=100)
# Prediction
predictions = model.predict(dtest)
Scikit-learn API
# Creating model
model = XGBClassifier(
n_estimators=100,
max_depth=3,
learning_rate=0.1,
objective='binary:logistic'
)
# Training
model.fit(X_train, y_train)
# Prediction
predictions = model.predict(X_test)
Sklearn API with Additional Features
# Training with validation set
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=10,
verbose=True
)
Table of Basic XGBoost Methods and Functions
| Method/Function | Purpose | Example Usage |
|---|---|---|
xgb.train() |
Training a model with the native API | xgb.train(params, dtrain, num_boost_round=100) |
xgb.cv() |
Cross-validation | xgb.cv(params, dtrain, nfold=5, num_boost_round=100) |
XGBClassifier() |
Classifier for sklearn API | XGBClassifier(n_estimators=100, max_depth=3) |
XGBRegressor() |
Regressor for sklearn API | XGBRegressor(n_estimators=100, max_depth=3) |
fit() |
Model training | model.fit(X_train, y_train) |
predict() |
Prediction | model.predict(X_test) |
predict_proba() |
Predicting probabilities | model.predict_proba(X_test) |
save_model() |
Saving a model | model.save_model('model.json') |
load_model() |
Loading a model | model.load_model('model.json') |
get_params() |
Getting parameters | model.get_params() |
set_params() |
Setting parameters | model.set_params(max_depth=5) |
feature_importances_ |
Feature importance | model.feature_importances_ |
plot_importance() |
Visualizing importance | xgb.plot_importance(model) |
plot_tree() |
Visualizing a tree | xgb.plot_tree(model, num_trees=0) |
DMatrix() |
Creating an optimized data structure | xgb.DMatrix(X, label=y) |
Detailed Description of XGBoost Parameters
Key Training Parameters
n_estimators(num_boost_round): The number of trees in the ensemble. More trees can improve quality, but increase training time and the risk of overfitting.max_depth: The maximum depth of the trees. Controls the complexity of the model. Typical values: 3-10.learning_rate(eta): The learning rate. Smaller values require more iterations but can lead to better quality. Usually: 0.01-0.3.subsample: The fraction of the sample used to train each tree. Helps prevent overfitting. Recommended: 0.5-1.0.colsample_bytree: The fraction of features used for each tree. Similar tosubsample, but for features.
Regularization Parameters
gamma(min_split_loss): The minimum reduction in the loss function to split a node. The larger it is, the more conservative the algorithm.lambda(reg_lambda): L2 regularization of leaf weights. Helps smooth the weights of the final leaves.alpha(reg_alpha): L1 regularization of leaf weights. Can lead to sparser models.
Parameters for Working with Data
missing: The value to represent missing data. Default:np.nan.scale_pos_weight: Balancing classes for binary classification.max_delta_step: A constraint on the change in weights. Useful for imbalanced data.
Practical Examples of Using XGBoost
Example 1: Classification with Parameter Tuning
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
# Loading data
data = load_breast_cancer()
X, y = data.data, data.target
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Creating and training a model
model = XGBClassifier(
n_estimators=100,
max_depth=4,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
# Training with early stopping
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=10,
verbose=False
)
# Prediction and evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
Example 2: Regression with Cross-Validation
from xgboost import XGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Loading data
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Creating model
model = XGBRegressor(
n_estimators=1000,
max_depth=5,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
# Training
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=50,
verbose=False
)
# Prediction
y_pred = model.predict(X_test)
# Quality assessment
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}")
print(f"R²: {r2:.4f}")
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
print(f"CV R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
Example 3: Using the Native API
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Loading data
data = load_iris()
X, y = data.data, data.target
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Creating DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Model parameters
params = {
'objective': 'multi:softprob',
'num_class': 3,
'max_depth': 4,
'learning_rate': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
'seed': 42
}
# Training with validation
watchlist = [(dtrain, 'train'), (dtest, 'test')]
model = xgb.train(
params, dtrain,
num_boost_round=100,
evals=watchlist,
early_stopping_rounds=10,
verbose_eval=False
)
# Prediction
predictions = model.predict(dtest)
predicted_labels = np.argmax(predictions, axis=1)
# Accuracy assessment
accuracy = accuracy_score(y_test, predicted_labels)
print(f"Accuracy: {accuracy:.4f}")
Hyperparameter Optimization
Parameter Selection with GridSearchCV
from sklearn.model_selection import GridSearchCV
# Defining parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 4, 5, 6],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0]
}
# Creating model
xgb_model = XGBClassifier(random_state=42)
# Finding optimal parameters
grid_search = GridSearchCV(
xgb_model, param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
# Best parameters
print("Best parameters:", grid_search.best_params_)
print("Best result:", grid_search.best_score_)
Bayesian Optimization with Optuna
import optuna
from sklearn.model_selection import cross_val_score
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
'gamma': trial.suggest_float('gamma', 0, 5),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 5),
'reg_lambda': trial.suggest_float('reg_lambda', 0, 5)
}
model = XGBClassifier(**params, random_state=42)
score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()
return score
# Creating and running optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print("Best parameters:", study.best_params)
print("Best result:", study.best_value)
Model Visualization and Interpretation
Feature Importance
import matplotlib.pyplot as plt
from xgboost import plot_importance
# Model training
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Visualizing feature importance
fig, ax = plt.subplots(figsize=(10, 8))
plot_importance(model, ax=ax, importance_type='weight')
plt.title('Feature Importance (by usage count)')
plt.tight_layout()
plt.show()
# Getting numerical importance values
feature_importance = model.feature_importances_
print("Feature importance:")
for i, importance in enumerate(feature_importance):
print(f"Feature {i}: {importance:.4f}")
Tree Visualization
from xgboost import plot_tree
# Visualizing the first tree
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model, num_trees=0, ax=ax)
plt.title('First tree structure')
plt.show()
SHAP for Interpretation
import shap
# Creating an object for explanation
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Visualization
shap.summary_plot(shap_values, X_test, feature_names=data.feature_names)
Working with Different Data Types
Time Series
import pandas as pd
from datetime import datetime, timedelta
# Creating a time series
dates = pd.date_range(start='2020-01-01', end='2023-12-31', freq='D')
np.random.seed(42)
values = np.cumsum(np.random.randn(len(dates))) + 100
df = pd.DataFrame({'date': dates, 'value': values})
# Creating features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['lag_1'] = df['value'].shift(1)
df['lag_7'] = df['value'].shift(7)
df['rolling_mean_7'] = df['value'].rolling(window=7).mean()
# Preparing data for training
df = df.dropna()
feature_columns = ['year', 'month', 'day', 'weekday', 'lag_1', 'lag_7', 'rolling_mean_7']
X = df[feature_columns]
y = df['value']
# Splitting into training and test sets
split_date = '2023-06-01'
train_mask = df['date'] < split_date
X_train, X_test = X[train_mask], X[~train_mask]
y_train, y_test = y[train_mask], y[~train_mask]
# Training model
model = XGBRegressor(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
# Prediction
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"MSE for time series: {mse:.4f}")
Processing Categorical Features
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Example data with categorical features
data = {
'category_1': ['A', 'B', 'C', 'A', 'B'],
'category_2': ['X', 'Y', 'Z', 'X', 'Y'],
'numeric_1': [1, 2, 3, 4, 5],
'numeric_2': [10, 20, 30, 40, 50],
'target': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
# Method 1: Label Encoding
le = LabelEncoder()
df['category_1_encoded'] = le.fit_transform(df['category_1'])
df['category_2_encoded'] = le.fit_transform(df['category_2'])
# Method 2: One-Hot Encoding
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', ['numeric_1', 'numeric_2']),
('cat', OneHotEncoder(drop='first'), ['category_1', 'category_2'])
]
)
X_processed = preprocessor.fit_transform(df[['numeric_1', 'numeric_2', 'category_1', 'category_2']])
Advanced Techniques and Optimizations
Using Custom Loss Functions
def custom_objective(y_pred, y_true):
"""Custom loss function"""
grad = y_pred - y_true.get_label()
hess = np.ones_like(grad)
return grad, hess
def custom_eval(y_pred, y_true):
"""Custom evaluation metric"""
labels = y_true.get_label()
error = mean_squared_error(labels, y_pred)
return 'custom_mse', error
# Training with custom functions
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {
'max_depth': 4,
'learning_rate': 0.1,
'disable_default_eval_metric': 1
}
model = xgb.train(
params, dtrain,
num_boost_round=100,
obj=custom_objective,
feval=custom_eval,
evals=[(dtest, 'test')],
verbose_eval=False
)
Working with Imbalanced Data
from sklearn.utils.class_weight import compute_sample_weight
# Automatic weight calculation
sample_weights = compute_sample_weight('balanced', y_train)
# Training with weights
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train, sample_weight=sample_weights)
# Alternative approach via scale_pos_weight
pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
model_balanced = XGBClassifier(
n_estimators=100,
scale_pos_weight=pos_weight,
random_state=42
)
model_balanced.fit(X_train, y_train)
Using GPU for Acceleration
# Checking GPU availability
try:
model_gpu = XGBClassifier(
n_estimators=100,
max_depth=5,
tree_method='gpu_hist',
gpu_id=0,
random_state=42
)
model_gpu.fit(X_train, y_train)
print("GPU acceleration available")
except:
print("GPU unavailable, using CPU")
Model Monitoring and Debugging
Tracking Metrics During Training
from sklearn.metrics import log_loss
# Custom function for tracking metrics
def monitor_performance(model, X_train, y_train, X_val, y_val):
results = {'train_loss': [], 'val_loss': [], 'val_accuracy': []}
for i in range(1, model.n_estimators + 1):
# Predictions at each iteration
train_pred = model.predict_proba(X_train, iteration_range=(0, i))[:, 1]
val_pred = model.predict_proba(X_val, iteration_range=(0, i))[:, 1]
# Metric calculation
train_loss = log_loss(y_train, train_pred)
val_loss = log_loss(y_val, val_pred)
val_accuracy = accuracy_score(y_val, (val_pred > 0.5).astype(int))
results['train_loss'].append(train_loss)
results['val_loss'].append(val_loss)
results['val_accuracy'].append(val_accuracy)
return results
# Usage
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
metrics = monitor_performance(model, X_train, y_train, X_test, y_test)
# Visualization
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(metrics['train_loss'], label='Train Loss')
plt.plot(metrics['val_loss'], label='Validation Loss')
plt.xlabel('Iteration')
plt.ylabel('Log Loss')
plt.legend()
plt.title('Training Progress')
plt.subplot(1, 2, 2)
plt.plot(metrics['val_accuracy'])
plt.xlabel('Iteration')
plt.ylabel('Validation Accuracy')
plt.title('Validation Accuracy')
plt.tight_layout()
plt.show()
Integration with Other Libraries
Using with Pandas
import pandas as pd
# Creating a model that works with DataFrame
class XGBPandasWrapper:
def __init__(self, **kwargs):
self.model = XGBClassifier(**kwargs)
self.feature_names = None
def fit(self, X, y):
if isinstance(X, pd.DataFrame):
self.feature_names = X.columns.tolist()
self.model.fit(X, y)
return self
def predict(self, X):
return self.model.predict(X)
def predict_proba(self, X):
return self.model.predict_proba(X)
def get_feature_importance(self):
if self.feature_names:
return pd.Series(
self.model.feature_importances_,
index=self.feature_names
).sort_values(ascending=False)
return self.model.feature_importances_
# Usage
model_wrapper = XGBPandasWrapper(n_estimators=100, random_state=42)
df_train = pd.DataFrame(X_train, columns=[f'feature_{i}' for i in range(X_train.shape[1])])
model_wrapper.fit(df_train, y_train)
importance = model_wrapper.get_feature_importance()
print(importance.head())
Creating a Pipeline with Scikit-learn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
# Creating pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(f_classif, k=10)),
('classifier', XGBClassifier(n_estimators=100, random_state=42))
])
# Training pipeline
pipeline.fit(X_train, y_train)
# Prediction
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline accuracy: {accuracy:.4f}")
# Getting feature importance after selection
selected_features = pipeline.named_steps['feature_selection'].get_support()
xgb_importance = pipeline.named_steps['classifier'].feature_importances_
print(f"Number of selected features: {selected_features.sum()}")
Deploying a Model in Production
Saving and Loading a Model
import pickle
import joblib
# Different ways to save a model
# 1. Built-in XGBoost methods
model.save_model('xgb_model.json')
model.save_model('xgb_model.ubj') # More compact format
# 2. Using pickle
with open('xgb_model.pkl', 'wb') as f:
pickle.dump(model, f)
# 3. Using joblib (recommended)
joblib.dump(model, 'xgb_model.joblib')
# Loading model
loaded_model = XGBClassifier()
loaded_model.load_model('xgb_model.json')
# Or
loaded_model = joblib.load('xgb_model.joblib')
# Checking load correctness
test_pred_original = model.predict(X_test)
test_pred_loaded = loaded_model.predict(X_test)
print(f"Models are identical: {np.array_equal(test_pred_original, test_pred_loaded)}")
Creating a REST API with Flask
from flask import Flask, request, jsonify
import numpy as np
app = Flask(__name__)
# Loading model when application starts
model = joblib.load('xgb_model.joblib')
@app.route('/predict', methods=['POST'])
def predict():
try:
# Getting data from request
data = request.json
features = np.array(data['features']).reshape(1, -1)
# Prediction
prediction = model.predict(features)[0]
probability = model.predict_proba(features)[0].tolist()
return jsonify({
'prediction': int(prediction),
'probability': probability,
'status': 'success'
})
except Exception as e:
return jsonify({
'error': str(e),
'status': 'error'
}), 400
@app.route('/health', methods=['GET'])
def health():
return jsonify({'status': 'healthy'})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Best Practices and Recommendations
Preventing Overfitting
- Use cross-validation to assess model quality.
- Apply early stopping if there is a validation set.
- Adjust regularization parameters (gamma, lambda, alpha).
- Limit model complexity via max_depth and min_child_weight.
- Use subsampling to reduce variance.
Performance Optimization
# Settings for maximum performance
high_performance_params = {
'n_estimators': 100,
'max_depth': 6,
'learning_rate': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
'tree_method': 'hist', # Fast algorithm
'n_jobs': -1, # Using all CPUs
'random_state': 42
}
# For large data
large_data_params = {
'n_estimators': 100,
'max_depth': 4,
'learning_rate': 0.1,
'subsample': 0.5,
'colsample_bytree': 0.5,
'tree_method': 'approx', # Approximate algorithm
'sketch_eps': 0.1, # Approximation accuracy
'n_jobs': -1,
'random_state': 42
}
Working with Memory
import gc
def train_with_memory_optimization(X_train, y_train, X_val, y_val):
# Creating DMatrix to save memory
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
# Removing source data from memory
del X_train, y_train, X_val, y_val
gc.collect()
params = {
'max_depth': 4,
'learning_rate': 0.1,
'objective': 'binary:logistic',
'eval_metric': 'logloss'
}
model = xgb.train(
params, dtrain,
num_boost_round=100,
evals=[(dval, 'val')],
early_stopping_rounds=10,
verbose_eval=False
)
return model
Debugging and Troubleshooting
Diagnosing Training Issues
def diagnose_training_issues(model, X_train, y_train, X_val, y_val):
"""Diagnosing XGBoost training problems"""
# Checking for overfitting
train_pred = model.predict(X_train)
val_pred = model.predict(X_val)
train_acc = accuracy_score(y_train, train_pred)
val_acc = accuracy_score(y_val, val_pred)
print(f"Accuracy on the training set: {train_acc:.4f}")
print(f"Accuracy on the validation set: {val_acc:.4f}")
print(f"Difference: {train_acc - val_acc:.4f}")
if train_acc - val_acc > 0.1:
print("⚠️ Possible overfitting!")
print("Recommendations:")
print("- Increase regularization (gamma, lambda, alpha)")
print("- Reduce max_depth")
print("- Use early_stopping_rounds")
print("- Reduce learning_rate and increase n_estimators")
# Checking feature importance
importance = model.feature_importances_
zero_importance = (importance == 0).sum()
print(f"Features with zero importance: {zero_importance}")
if zero_importance > len(importance) * 0.5:
print("⚠️ Many features with zero importance!")
print("Recommendations:")
print("- Perform feature selection")
print("- Check data quality")
# Checking class distribution
if hasattr(model, 'classes_'):
class_distribution = np.bincount(y_train)
imbalance_ratio = max(class_distribution) / min(class_distribution)
print(f"Class ratio: {imbalance_ratio:.2f}")
if imbalance_ratio > 5:
print("⚠️ Strong class imbalance!")
print("Recommendations:")
print("- Use scale_pos_weight")
print("- Apply sample_weight during training")
print("- Consider using SMOTE")
# Usage
diagnose_training_issues(model, X_train, y_train, X_test, y_test)
Handling Errors and Exceptions
def safe_xgboost_training(X_train, y_train, X_val, y_val, params):
"""Safe XGBoost training with error handling"""
try:
# Checking input data
if X_train.shape[0] != len(y_train):
raise ValueError("X_train and y_train sizes do not match")
if X_val.shape[0] != len(y_val):
raise ValueError("X_val and y_val sizes do not match")
if X_train.shape[1] != X_val.shape[1]:
raise ValueError("Number of features in train and val does not match")
# Checking for missing values
if np.isnan(X_train).any():
print("⚠️ Missing values found in X_train")
if np.isnan(X_val).any():
print("⚠️ Missing values found in X_val")
# Creating model
model = XGBClassifier(**params)
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed