Statsmodels - statistical analysis

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Introduction to Statsmodels

Statsmodels — a powerful Python library designed for performing statistical data analysis. It provides tools for building regression models, estimating parameters, testing statistical hypotheses, and analyzing time series. Statsmodels is widely used in econometrics, biostatistics, and social sciences, offering deep statistical interpretation of results.

Key Features of the Statsmodels Library

Regression Analysis

Statsmodels offers extensive capabilities for regression analysis, including ordinary least squares (OLS), generalized linear models (GLM), robust regression, and panel data models. The library supports both parametric and non‑parametric estimation methods.

Time Series Analysis

The time‑series module includes ARIMA, SARIMA, exponential smoothing, vector autoregression (VAR) and many other models. Functions are available for stationarity testing, series decomposition, and forecasting.

Statistical Hypothesis Testing

The library contains a broad set of statistical tests: t‑tests, F‑tests, normality tests, variance equality tests, autocorrelation tests and many others.

Analysis of Variance

One‑ and multi‑factor ANOVA, ANCOVA, repeated measures and mixed‑effects models are supported.

Non‑Parametric Methods

Kernel density estimation, non‑parametric regression and rank‑based tests are included.

Installation and Importing Statsmodels

Installing the Library

Installation is performed with the standard command:

pip install statsmodels

For additional functionality you can install with full dependencies:

pip install statsmodels[complete]

Importing Required Components

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as tsa
import statsmodels.stats.api as sms

Library Structure and Core Modules

Primary APIs

statsmodels.api — classic interface with direct matrix specification

statsmodels.formula.api — formula‑based interface, similar to R

statsmodels.tsa — module for time‑series analysis

statsmodels.stats — statistical tests and diagnostics

statsmodels.graphics — result visualisation

Data Handling and Preparation

Statsmodels works optimally with pandas.DataFrame objects:

import pandas as pd
import numpy as np

# Load data
data = pd.read_csv("dataset.csv")

# Check for missing values
print(data.isnull().sum())

# Basic statistics
print(data.describe())

Linear Regression (OLS)

Classic Interface

import statsmodels.api as sm

# Prepare data
X = data[["x1", "x2", "x3"]]
X = sm.add_constant(X)  # add intercept
y = data["y"]

# Fit model
model = sm.OLS(y, X).fit()
print(model.summary())

# Extract coefficients
print("Coefficients:", model.params)
print("P‑values:", model.pvalues)
print("R‑squared:", model.rsquared)

Formula Interface

import statsmodels.formula.api as smf

# Use formulas
model = smf.ols("y ~ x1 + x2 + x3", data=data).fit()
print(model.summary())

# Interaction terms
model_interaction = smf.ols("y ~ x1 * x2 + x3", data=data).fit()

# Categorical variables
model_cat = smf.ols("y ~ C(category) + x1", data=data).fit()

Generalized Linear Models (GLM)

Logistic Regression

# Binary logistic regression
model = smf.logit("y ~ x1 + x2", data=data).fit()
print(model.summary())

# Predictions
predictions = model.predict(data)
print("Predicted probabilities:", predictions.head())

Poisson Regression

# For count data
model = smf.poisson("count ~ x1 + x2", data=data).fit()
print(model.summary())

Analysis of Variance (ANOVA)

One‑Factor ANOVA

# Build model
model = smf.ols("score ~ C(group)", data=data).fit()

# ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Multi‑Factor ANOVA

# Two‑factor analysis
model = smf.ols("score ~ C(factor1) + C(factor2) + C(factor1):C(factor2)", data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Time Series Analysis

ARIMA Models

from statsmodels.tsa.arima.model import ARIMA

# Build ARIMA model
model = ARIMA(data["series"], order=(1, 1, 1))
results = model.fit()
print(results.summary())

# Forecasting
forecast = results.forecast(steps=10)
print("Forecast:", forecast)

Seasonal Decomposition

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose time series
decomposition = seasonal_decompose(data["series"], model='additive', period=12)
decomposition.plot()

Stationarity Testing

from statsmodels.tsa.stattools import adfuller

# Augmented Dickey‑Fuller test
result = adfuller(data["series"])
print("ADF statistic:", result[0])
print("P‑value:", result[1])

Statistical Tests

Normality Tests

from scipy.stats import shapiro, normaltest

# Shapiro‑Wilk test
stat, p_value = shapiro(data["x"])
print(f"Statistic: {stat}, P‑value: {p_value}")

# D'Agostino test
stat, p_value = normaltest(data["x"])
print(f"Statistic: {stat}, P‑value: {p_value}")

Mean Equality Tests

from scipy.stats import ttest_ind, ttest_1samp

# Two‑sample t‑test
group1 = data[data["group"] == "A"]["value"]
group2 = data[data["group"] == "B"]["value"]
stat, p_value = ttest_ind(group1, group2)
print(f"T‑statistic: {stat}, P‑value: {p_value}")

# One‑sample t‑test
stat, p_value = ttest_1samp(data["x"], popmean=0)
print(f"T‑statistic: {stat}, P‑value: {p_value}")

Model Diagnostics

Residual Analysis

# Obtain residuals
residuals = model.resid

# Q‑Q plot
sm.qqplot(residuals, line="s")
plt.title("Q‑Q Plot of Residuals")
plt.show()

# Heteroskedasticity test
from statsmodels.stats.diagnostic import het_breuschpagan
stat, p_value, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"Breusch‑Pagan test: statistic={stat}, p‑value={p_value}")

Influential Observations

# Influence analysis
influence = model.get_influence()
summary_frame = influence.summary_frame()

# Cook's distance
cooks_distance = summary_frame["cooks_d"]
print("High‑influence observations:", cooks_distance[cooks_distance > 0.5])

Result Visualisation

Diagnostic Plots

import matplotlib.pyplot as plt

# Residuals vs. fitted values
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(model.fittedvalues, residuals)
ax.axhline(y=0, color='r', linestyle='--')
ax.set_xlabel('Fitted Values')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs. Fitted Values')
plt.show()

# Partial regression
sm.graphics.plot_partregress_grid(model, fig=plt.figure(figsize=(12, 8)))
plt.show()

Time‑Series Plots

# Autocorrelation
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, axes = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(data["series"], ax=axes[0], lags=20)
plot_pacf(data["series"], ax=axes[1], lags=20)
plt.show()

Table of Core Methods and Functions

Category Method/Function Description
Regression sm.OLS() Ordinary Least Squares
  sm.GLM() Generalized Linear Models
  sm.RLM() Robust Linear Models
  smf.ols() OLS with formula interface
  smf.logit() Logistic regression
  smf.poisson() Poisson regression
Time Series ARIMA() ARIMA model
  SARIMAX() Seasonal ARIMA with exogenous variables
  ExponentialSmoothing() Exponential smoothing
  VAR() Vector Autoregression
  seasonal_decompose() Seasonal decomposition
  adfuller() Augmented Dickey‑Fuller test
  acf() Autocorrelation function
  pacf() Partial autocorrelation function
Statistical Tests ttest_1samp() One‑sample t‑test
  ttest_ind() Two‑sample t‑test
  shapiro() Shapiro‑Wilk test
  normaltest() D'Agostino test
  levene() Levene's test
  bartlett() Bartlett's test
  chi2_contingency() Chi‑square test
Diagnostics het_breuschpagan() Heteroskedasticity test
  durbin_watson() Durbin‑Watson test
  jarque_bera() Jarque‑Bera test
  get_influence() Influence analysis
Visualization qqplot() Q‑Q plot
  plot_acf() Autocorrelation plot
  plot_pacf() Partial autocorrelation plot
  plot_partregress() Partial regression plot
  plot_regress_exog() Diagnostic plots
Utility add_constant() Add intercept
  summary() Result summary
  params Model coefficients
  pvalues P‑values
  rsquared Coefficient of determination
  predict() Predictions
  conf_int() Confidence intervals

Practical Use Cases

Marketing Campaign Effectiveness Analysis

# Multivariate regression to assess impact of different channels
model = smf.ols("sales ~ tv_spend + radio_spend + online_spend + C(season)", data=marketing_data).fit()

# Evaluate channel significance
print("Significant channels:")
for var, pval in model.pvalues.items():
    if pval < 0.05:
        print(f"{var}: p‑value = {pval:.4f}")

Financial Indicator Forecasting

# Time series model for stock prices
model = ARIMA(stock_prices, order=(2, 1, 1))
results = model.fit()

# 30‑day forecast
forecast = results.forecast(steps=30)
conf_int = results.get_forecast(steps=30).conf_int()

print("Price forecast after 30 days:", forecast[-1])

A/B Test Analysis

# Compare conversion rates between groups
contingency_table = pd.crosstab(data["group"], data["converted"])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi‑square statistic: {chi2}")
print(f"P‑value: {p_value}")

# Logistic regression to control for additional factors
model = smf.logit("converted ~ C(group) + age + income", data=data).fit()
print(model.summary())

Performance Optimization

Working with Large Datasets

# Process data in chunks for large files
def process_in_chunks(data, chunk_size=10000):
    results = []
    for chunk in pd.read_csv(data, chunksize=chunk_size):
        model = smf.ols("y ~ x1 + x2", data=chunk).fit()
        results.append(model.params)
    return pd.concat(results)

Caching Results

# Save model results
import pickle

# Save
with open('model_results.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load
with open('model_results.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

Integration with Other Libraries

Interaction with Scikit‑learn

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Data preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

# Build Statsmodels OLS model
X_train_sm = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_sm).fit()

Visualization with Seaborn

import seaborn as sns

# Correlation matrix
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Variable Correlation Matrix')
plt.show()

# Pairwise plots
sns.pairplot(data, hue='category')
plt.show()

Frequently Asked Questions

How does Statsmodels differ from Scikit‑learn?

Statsmodels focuses on statistical modeling and result interpretation, providing detailed statistics, p‑values, and confidence intervals. Scikit‑learn is centered on machine learning and predictive accuracy.

How to work with categorical variables?

In the formula interface use the syntax C(variable). In the classic interface, first convert variables to dummy variables with pd.get_dummies().

How to choose the appropriate time‑series model?

Start by examining data structure: test for stationarity, seasonality, and trend. Use ARIMA for non‑stationary series, SARIMA for seasonal data, and VAR for multiple series.

How to interpret the output of summary()?

Pay attention to coefficients, their standard errors, t‑statistics and p‑values. R‑squared indicates the proportion of explained variance. The F‑statistic tests overall model significance.

Can Statsmodels handle non‑linear relationships?

Yes—via polynomial terms in formulas, splines, or generalized additive models. Non‑linear least squares methods are also available.

How to handle missing values?

Statsmodels automatically drops observations with missing values. To control this, use the missing='drop' parameter or preprocess the data beforehand.

Best Practices

Checking Model Assumptions

# Linearity check
sm.graphics.plot_partregress_grid(model)

# Residual normality
sm.qqplot(model.resid, line='s')

# Independence check
from statsmodels.stats.diagnostic import durbin_watson
dw_stat = durbin_watson(model.resid)
print(f"Durbin‑Watson statistic: {dw_stat}")

Cross‑Validation

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Compare with sklearn for validation
lr = LinearRegression()
scores = cross_val_score(lr, X, y, cv=5, scoring='r2')
print(f"Average R² from cross‑validation: {scores.mean():.4f}")

Documenting the Analysis

# Create report
report = f"""
Model analysis:
- R²: {model.rsquared:.4f}
- Adjusted R²: {model.rsquared_adj:.4f}
- F‑statistic: {model.fvalue:.4f}
- F‑test p‑value: {model.f_pvalue:.4f}
- Number of observations: {model.nobs}
"""
print(report)

Conclusion: When and Why to Use Statsmodels

Statsmodels is an indispensable tool for rigorous statistical analysis in Python. The library is especially valuable in the following scenarios:

Exploratory Data Analysis: when you need to understand data structure, uncover patterns, and test statistical hypotheses.

Scientific Research: to obtain statistically grounded conclusions with significance levels and confidence intervals.

Econometric Analysis: for modeling economic processes, panel data, and time‑series analysis.

Business Analytics: to evaluate marketing campaign effectiveness, conduct A/B testing, and forecast key performance indicators.

Academic Projects: when detailed statistical interpretation and adherence to scientific standards are required.

Thanks to tight integration with the Python ecosystem (Pandas, NumPy, Matplotlib), rich functionality, and comprehensive documentation, Statsmodels is a powerful instrument for solving a wide range of statistical analysis tasks. Choosing between Statsmodels and other libraries should be based on project needs: if interpretability and statistical rigor are paramount, Statsmodels is the optimal choice.

News