What is Scikit-learn?
Scikit-learn is one of the most popular and powerful open-source Python libraries for machine learning. Developed in 2007 by David Cournapeau, the library has become a de facto standard for implementing machine learning algorithms in the Python ecosystem.
Key Tasks and Capabilities of Scikit-learn
The library provides ready-made tools for solving a wide range of machine learning tasks:
Supervised Learning Tasks:
- Classification — determining the belonging of objects to specific classes
- Regression — predicting continuous numerical values
Unsupervised Learning Tasks:
- Clustering — grouping similar objects
- Dimensionality Reduction — compressing data while preserving important information
- Anomaly Detection — identifying outliers in the data
Additional Features:
- Data Preprocessing — normalization, scaling, encoding
- Feature Selection — selecting the most informative characteristics
- Model Evaluation — various quality metrics
- Cross-Validation — checking model stability
- Hyperparameter Tuning — automatic parameter optimization
Installation and Initial Setup
Installation via pip
pip install scikit-learn
Installation with Additional Dependencies
pip install scikit-learn pandas matplotlib seaborn jupyter
Checking Installation
import sklearn
print(sklearn.__version__)
Importing Basic Components
# Core modules
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris, make_classification
Architecture and Basic Concepts
Estimator
Estimator is the base class for all machine learning algorithms in Scikit-learn. Each estimator implements two main methods:
fit(X, y)— training the model on datapredict(X)— prediction for new data
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Transformer
Transformer is a class for data transformation. Implements the following methods:
fit(X)— learning transformation parameterstransform(X)— applying the transformationfit_transform(X)— combining fit and transform
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Pipeline
Pipeline allows you to combine several data processing and training steps into a single sequence:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Working with Data
Built-in Datasets
Scikit-learn provides many ready-made datasets for training and testing:
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
# Loading the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Dataset information
print(iris.DESCR)
print(f"Data dimensions: {X.shape}")
print(f"Classes: {iris.target_names}")
Generating Synthetic Data
from sklearn.datasets import make_classification, make_regression
# Generating data for classification
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=10,
n_redundant=5,
n_clusters_per_class=1,
random_state=42
)
# Generating data for regression
X_reg, y_reg = make_regression(
n_samples=1000,
n_features=10,
noise=0.1,
random_state=42
)
Splitting Data
from sklearn.model_selection import train_test_split
# Splitting into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # Maintaining class proportions
)
# Splitting into three parts
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Data Preprocessing
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Normalization to the range [0, 1]
min_max_scaler = MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X_train)
# Robust scaling (robust to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X_train)
Encoding Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
# Encoding labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_categorical)
# One-hot encoding
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X_categorical)
# For pandas DataFrame
df_encoded = pd.get_dummies(df, columns=['categorical_column'])
Machine Learning Algorithms
Linear Models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
# Linear regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
print(f"Coefficients: {linear_reg.coef_}")
print(f"Intercept: {linear_reg.intercept_}")
# Logistic regression
logistic_reg = LogisticRegression(random_state=42)
logistic_reg.fit(X_train, y_train)
# Regularized models
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)
Decision Trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
# Decision tree for classification
tree_clf = DecisionTreeClassifier(
max_depth=3,
min_samples_split=5,
random_state=42
)
tree_clf.fit(X_train, y_train)
# Tree visualization
plt.figure(figsize=(12, 8))
plot_tree(tree_clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()
Ensemble Methods
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
# Random forest
rf_clf = RandomForestClassifier(
n_estimators=100,
max_depth=3,
random_state=42
)
rf_clf.fit(X_train, y_train)
# Gradient boosting
gb_clf = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
random_state=42
)
gb_clf.fit(X_train, y_train)
# Feature importance
feature_importance = rf_clf.feature_importances_
Support Vector Machines
from sklearn.svm import SVC, SVR
# SVM for classification
svm_clf = SVC(
kernel='rbf',
C=1.0,
gamma='scale',
random_state=42
)
svm_clf.fit(X_train, y_train)
# SVM for regression
svm_reg = SVR(kernel='rbf', C=1.0, gamma='scale')
Clustering Algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
# K-means
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X)
# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_labels = agg_clustering.fit_predict(X)
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
Model Quality Assessment
Metrics for Classification
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
# Predictions
y_pred = model.predict(X_test)
# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-score: {f1:.3f}")
# Detailed report
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
Metrics for Regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_pred_reg = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_reg)
mae = mean_absolute_error(y_test, y_pred_reg)
r2 = r2_score(y_test, y_pred_reg)
print(f"MSE: {mse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R²: {r2:.3f}")
Cross-Validation and Hyperparameter Tuning
Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Simple cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Average accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# Stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best result: {grid_search.best_score_:.3f}")
# Random search
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
n_iter=20,
cv=5,
random_state=42
)
Saving and Loading Models
import joblib
import pickle
# Saving using joblib (recommended)
joblib.dump(model, 'model.pkl')
loaded_model = joblib.load('model.pkl')
# Saving using pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
Complete Table of Scikit-learn Methods and Functions
Main Library Modules
| Module | Purpose | Main Classes/Functions |
|---|---|---|
| sklearn.preprocessing | Data preprocessing | StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder |
| sklearn.model_selection | Data splitting and validation | train_test_split, cross_val_score, GridSearchCV |
| sklearn.linear_model | Linear models | LinearRegression, LogisticRegression, Ridge, Lasso |
| sklearn.tree | Decision trees | DecisionTreeClassifier, DecisionTreeRegressor |
| sklearn.ensemble | Ensemble methods | RandomForestClassifier, GradientBoostingClassifier |
| sklearn.svm | Support vector machines | SVC, SVR, LinearSVC |
| sklearn.neighbors | Nearest neighbor algorithms | KNeighborsClassifier, KNeighborsRegressor |
| sklearn.cluster | Clustering | KMeans, AgglomerativeClustering, DBSCAN |
| sklearn.metrics | Quality metrics | accuracy_score, precision_score, confusion_matrix |
| sklearn.datasets | Loading and generating data | load_iris, make_classification, make_regression |
Machine Learning Algorithms by Category
| Category | Algorithm | Class | Main Parameters |
|---|---|---|---|
| Classification | Logistic Regression | LogisticRegression | C, penalty, solver |
| Random Forest | RandomForestClassifier | n_estimators, max_depth, min_samples_split | |
| SVM | SVC | C, kernel, gamma | |
| K-Nearest Neighbors | KNeighborsClassifier | n_neighbors, weights, metric | |
| Naive Bayes | GaussianNB | var_smoothing | |
| Decision Tree | DecisionTreeClassifier | max_depth, min_samples_split, criterion | |
| Regression | Linear Regression | LinearRegression | fit_intercept, normalize |
| Ridge Regression | Ridge | alpha, solver | |
| Lasso Regression | Lasso | alpha, max_iter | |
| Random Forest | RandomForestRegressor | n_estimators, max_depth | |
| SVM | SVR | C, kernel, epsilon | |
| Clustering | K-means | KMeans | n_clusters, init, max_iter |
| Hierarchical | AgglomerativeClustering | n_clusters, linkage | |
| DBSCAN | DBSCAN | eps, min_samples |
Data Preprocessing Methods
| Method | Class | Purpose | Parameters |
|---|---|---|---|
| Standardization | StandardScaler | Converting to normal distribution | with_mean, with_std |
| Normalization | MinMaxScaler | Scaling to the range [0,1] | feature_range |
| Robust Scaling | RobustScaler | Robustness to outliers | quantile_range |
| One-hot Encoding | OneHotEncoder | Encoding categorical features | sparse_output, drop |
| Label Encoding | LabelEncoder | Converting categories to numbers | - |
| Polynomial Features | PolynomialFeatures | Creating polynomial combinations | degree, include_bias |
Quality Assessment Metrics
| Task Type | Metric | Function | Description |
|---|---|---|---|
| Classification | Accuracy | accuracy_score | Share of correct predictions |
| Precision | precision_score | Accuracy for the positive class | |
| Recall | recall_score | Completeness for the positive class | |
| F1-measure | f1_score | Harmonic mean of precision and recall | |
| ROC AUC | roc_auc_score | Area under the ROC curve | |
| Regression | MSE | mean_squared_error | Mean squared error |
| MAE | mean_absolute_error | Mean absolute error | |
| R² | r2_score | Coefficient of determination | |
| RMSE | sqrt(mean_squared_error) | Root mean squared error |
Advantages and Limitations of Scikit-learn
Advantages
- Ease of use — a unified API for all algorithms
- High-quality documentation — detailed examples and explanations
- Wide selection of algorithms — covers most ML tasks
- Integration with the ecosystem — works great with NumPy, Pandas, Matplotlib
- Stability — time-tested algorithms
- Active community — regular updates and support
Limitations
- No deep learning — TensorFlow/PyTorch are needed for neural networks
- Not suitable for big data — memory limitations
- No built-in GPU support — only CPU calculations
- Limited work with text — basic NLP capabilities
Practical Recommendations
Algorithm Selection
- For small data (< 10000 objects): SVM, KNN
- For medium data: Random Forest, Gradient Boosting
- For large data: Linear models, SGD
- For interpretability: Decision trees, linear models
- For high accuracy: Ensemble methods
Data Workflow
# Typical workflow
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# 1. Loading and splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 2. Creating a pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
# 3. Tuning parameters
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [3, 5, 7]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
# 4. Quality assessment
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
Conclusion
Scikit-learn remains one of the most important libraries for studying and applying machine learning in Python. Its simplicity, reliability, and completeness of functionality make it an ideal choice for most classical machine learning tasks. Despite some limitations, the library continues to actively develop and remains the industry standard for solving data analysis and machine learning problems.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed