Scikit-Learn-Machine Learning

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

What is Scikit-learn?

Scikit-learn is one of the most popular and powerful open-source Python libraries for machine learning. Developed in 2007 by David Cournapeau, the library has become a de facto standard for implementing machine learning algorithms in the Python ecosystem.

Key Tasks and Capabilities of Scikit-learn

The library provides ready-made tools for solving a wide range of machine learning tasks:

Supervised Learning Tasks:

  • Classification — determining the belonging of objects to specific classes
  • Regression — predicting continuous numerical values

Unsupervised Learning Tasks:

  • Clustering — grouping similar objects
  • Dimensionality Reduction — compressing data while preserving important information
  • Anomaly Detection — identifying outliers in the data

Additional Features:

  • Data Preprocessing — normalization, scaling, encoding
  • Feature Selection — selecting the most informative characteristics
  • Model Evaluation — various quality metrics
  • Cross-Validation — checking model stability
  • Hyperparameter Tuning — automatic parameter optimization

Installation and Initial Setup

Installation via pip

pip install scikit-learn

Installation with Additional Dependencies

pip install scikit-learn pandas matplotlib seaborn jupyter

Checking Installation

import sklearn
print(sklearn.__version__)

Importing Basic Components

# Core modules
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris, make_classification

Architecture and Basic Concepts

Estimator

Estimator is the base class for all machine learning algorithms in Scikit-learn. Each estimator implements two main methods:

  • fit(X, y) — training the model on data
  • predict(X) — prediction for new data
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Transformer

Transformer is a class for data transformation. Implements the following methods:

  • fit(X) — learning transformation parameters
  • transform(X) — applying the transformation
  • fit_transform(X) — combining fit and transform
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Pipeline

Pipeline allows you to combine several data processing and training steps into a single sequence:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Working with Data

Built-in Datasets

Scikit-learn provides many ready-made datasets for training and testing:

from sklearn.datasets import load_iris, load_wine, load_breast_cancer

# Loading the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Dataset information
print(iris.DESCR)
print(f"Data dimensions: {X.shape}")
print(f"Classes: {iris.target_names}")

Generating Synthetic Data

from sklearn.datasets import make_classification, make_regression

# Generating data for classification
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_clusters_per_class=1,
    random_state=42
)

# Generating data for regression
X_reg, y_reg = make_regression(
    n_samples=1000,
    n_features=10,
    noise=0.1,
    random_state=42
)

Splitting Data

from sklearn.model_selection import train_test_split

# Splitting into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    random_state=42,
    stratify=y  # Maintaining class proportions
)

# Splitting into three parts
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Data Preprocessing

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Normalization to the range [0, 1]
min_max_scaler = MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X_train)

# Robust scaling (robust to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X_train)

Encoding Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Encoding labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_categorical)

# One-hot encoding
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X_categorical)

# For pandas DataFrame
df_encoded = pd.get_dummies(df, columns=['categorical_column'])

Machine Learning Algorithms

Linear Models

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso

# Linear regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
print(f"Coefficients: {linear_reg.coef_}")
print(f"Intercept: {linear_reg.intercept_}")

# Logistic regression
logistic_reg = LogisticRegression(random_state=42)
logistic_reg.fit(X_train, y_train)

# Regularized models
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)

Decision Trees

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Decision tree for classification
tree_clf = DecisionTreeClassifier(
    max_depth=3,
    min_samples_split=5,
    random_state=42
)
tree_clf.fit(X_train, y_train)

# Tree visualization
plt.figure(figsize=(12, 8))
plot_tree(tree_clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

Ensemble Methods

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

# Random forest
rf_clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=3,
    random_state=42
)
rf_clf.fit(X_train, y_train)

# Gradient boosting
gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42
)
gb_clf.fit(X_train, y_train)

# Feature importance
feature_importance = rf_clf.feature_importances_

Support Vector Machines

from sklearn.svm import SVC, SVR

# SVM for classification
svm_clf = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    random_state=42
)
svm_clf.fit(X_train, y_train)

# SVM for regression
svm_reg = SVR(kernel='rbf', C=1.0, gamma='scale')

Clustering Algorithms

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

# K-means
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X)

# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_labels = agg_clustering.fit_predict(X)

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

Model Quality Assessment

Metrics for Classification

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Predictions
y_pred = model.predict(X_test)

# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-score: {f1:.3f}")

# Detailed report
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

Metrics for Regression

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred_reg = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred_reg)
mae = mean_absolute_error(y_test, y_pred_reg)
r2 = r2_score(y_test, y_pred_reg)

print(f"MSE: {mse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R²: {r2:.3f}")

Cross-Validation and Hyperparameter Tuning

Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Average accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best result: {grid_search.best_score_:.3f}")

# Random search
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    n_iter=20,
    cv=5,
    random_state=42
)

Saving and Loading Models

import joblib
import pickle

# Saving using joblib (recommended)
joblib.dump(model, 'model.pkl')
loaded_model = joblib.load('model.pkl')

# Saving using pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

Complete Table of Scikit-learn Methods and Functions

Main Library Modules

Module Purpose Main Classes/Functions
sklearn.preprocessing Data preprocessing StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
sklearn.model_selection Data splitting and validation train_test_split, cross_val_score, GridSearchCV
sklearn.linear_model Linear models LinearRegression, LogisticRegression, Ridge, Lasso
sklearn.tree Decision trees DecisionTreeClassifier, DecisionTreeRegressor
sklearn.ensemble Ensemble methods RandomForestClassifier, GradientBoostingClassifier
sklearn.svm Support vector machines SVC, SVR, LinearSVC
sklearn.neighbors Nearest neighbor algorithms KNeighborsClassifier, KNeighborsRegressor
sklearn.cluster Clustering KMeans, AgglomerativeClustering, DBSCAN
sklearn.metrics Quality metrics accuracy_score, precision_score, confusion_matrix
sklearn.datasets Loading and generating data load_iris, make_classification, make_regression

Machine Learning Algorithms by Category

Category Algorithm Class Main Parameters
Classification Logistic Regression LogisticRegression C, penalty, solver
  Random Forest RandomForestClassifier n_estimators, max_depth, min_samples_split
  SVM SVC C, kernel, gamma
  K-Nearest Neighbors KNeighborsClassifier n_neighbors, weights, metric
  Naive Bayes GaussianNB var_smoothing
  Decision Tree DecisionTreeClassifier max_depth, min_samples_split, criterion
Regression Linear Regression LinearRegression fit_intercept, normalize
  Ridge Regression Ridge alpha, solver
  Lasso Regression Lasso alpha, max_iter
  Random Forest RandomForestRegressor n_estimators, max_depth
  SVM SVR C, kernel, epsilon
Clustering K-means KMeans n_clusters, init, max_iter
  Hierarchical AgglomerativeClustering n_clusters, linkage
  DBSCAN DBSCAN eps, min_samples

Data Preprocessing Methods

Method Class Purpose Parameters
Standardization StandardScaler Converting to normal distribution with_mean, with_std
Normalization MinMaxScaler Scaling to the range [0,1] feature_range
Robust Scaling RobustScaler Robustness to outliers quantile_range
One-hot Encoding OneHotEncoder Encoding categorical features sparse_output, drop
Label Encoding LabelEncoder Converting categories to numbers -
Polynomial Features PolynomialFeatures Creating polynomial combinations degree, include_bias

Quality Assessment Metrics

Task Type Metric Function Description
Classification Accuracy accuracy_score Share of correct predictions
  Precision precision_score Accuracy for the positive class
  Recall recall_score Completeness for the positive class
  F1-measure f1_score Harmonic mean of precision and recall
  ROC AUC roc_auc_score Area under the ROC curve
Regression MSE mean_squared_error Mean squared error
  MAE mean_absolute_error Mean absolute error
  r2_score Coefficient of determination
  RMSE sqrt(mean_squared_error) Root mean squared error

Advantages and Limitations of Scikit-learn

Advantages

  • Ease of use — a unified API for all algorithms
  • High-quality documentation — detailed examples and explanations
  • Wide selection of algorithms — covers most ML tasks
  • Integration with the ecosystem — works great with NumPy, Pandas, Matplotlib
  • Stability — time-tested algorithms
  • Active community — regular updates and support

Limitations

  • No deep learning — TensorFlow/PyTorch are needed for neural networks
  • Not suitable for big data — memory limitations
  • No built-in GPU support — only CPU calculations
  • Limited work with text — basic NLP capabilities

Practical Recommendations

Algorithm Selection

  • For small data (< 10000 objects): SVM, KNN
  • For medium data: Random Forest, Gradient Boosting
  • For large data: Linear models, SGD
  • For interpretability: Decision trees, linear models
  • For high accuracy: Ensemble methods

Data Workflow

# Typical workflow
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 1. Loading and splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Creating a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 3. Tuning parameters
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 4. Quality assessment
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

Conclusion

Scikit-learn remains one of the most important libraries for studying and applying machine learning in Python. Its simplicity, reliability, and completeness of functionality make it an ideal choice for most classical machine learning tasks. Despite some limitations, the library continues to actively develop and remains the industry standard for solving data analysis and machine learning problems.

News