Introduction to Statsmodels
Statsmodels — a powerful Python library designed for performing statistical data analysis. It provides tools for building regression models, estimating parameters, testing statistical hypotheses, and analyzing time series. Statsmodels is widely used in econometrics, biostatistics, and social sciences, offering deep statistical interpretation of results.
Key Features of the Statsmodels Library
Regression Analysis
Statsmodels offers extensive capabilities for regression analysis, including ordinary least squares (OLS), generalized linear models (GLM), robust regression, and panel data models. The library supports both parametric and non‑parametric estimation methods.
Time Series Analysis
The time‑series module includes ARIMA, SARIMA, exponential smoothing, vector autoregression (VAR) and many other models. Functions are available for stationarity testing, series decomposition, and forecasting.
Statistical Hypothesis Testing
The library contains a broad set of statistical tests: t‑tests, F‑tests, normality tests, variance equality tests, autocorrelation tests and many others.
Analysis of Variance
One‑ and multi‑factor ANOVA, ANCOVA, repeated measures and mixed‑effects models are supported.
Non‑Parametric Methods
Kernel density estimation, non‑parametric regression and rank‑based tests are included.
Installation and Importing Statsmodels
Installing the Library
Installation is performed with the standard command:
pip install statsmodels
For additional functionality you can install with full dependencies:
pip install statsmodels[complete]
Importing Required Components
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as tsa
import statsmodels.stats.api as sms
Library Structure and Core Modules
Primary APIs
statsmodels.api — classic interface with direct matrix specification
statsmodels.formula.api — formula‑based interface, similar to R
statsmodels.tsa — module for time‑series analysis
statsmodels.stats — statistical tests and diagnostics
statsmodels.graphics — result visualisation
Data Handling and Preparation
Statsmodels works optimally with pandas.DataFrame objects:
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv("dataset.csv")
# Check for missing values
print(data.isnull().sum())
# Basic statistics
print(data.describe())
Linear Regression (OLS)
Classic Interface
import statsmodels.api as sm
# Prepare data
X = data[["x1", "x2", "x3"]]
X = sm.add_constant(X) # add intercept
y = data["y"]
# Fit model
model = sm.OLS(y, X).fit()
print(model.summary())
# Extract coefficients
print("Coefficients:", model.params)
print("P‑values:", model.pvalues)
print("R‑squared:", model.rsquared)
Formula Interface
import statsmodels.formula.api as smf
# Use formulas
model = smf.ols("y ~ x1 + x2 + x3", data=data).fit()
print(model.summary())
# Interaction terms
model_interaction = smf.ols("y ~ x1 * x2 + x3", data=data).fit()
# Categorical variables
model_cat = smf.ols("y ~ C(category) + x1", data=data).fit()
Generalized Linear Models (GLM)
Logistic Regression
# Binary logistic regression
model = smf.logit("y ~ x1 + x2", data=data).fit()
print(model.summary())
# Predictions
predictions = model.predict(data)
print("Predicted probabilities:", predictions.head())
Poisson Regression
# For count data
model = smf.poisson("count ~ x1 + x2", data=data).fit()
print(model.summary())
Analysis of Variance (ANOVA)
One‑Factor ANOVA
# Build model
model = smf.ols("score ~ C(group)", data=data).fit()
# ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Multi‑Factor ANOVA
# Two‑factor analysis
model = smf.ols("score ~ C(factor1) + C(factor2) + C(factor1):C(factor2)", data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Time Series Analysis
ARIMA Models
from statsmodels.tsa.arima.model import ARIMA
# Build ARIMA model
model = ARIMA(data["series"], order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecasting
forecast = results.forecast(steps=10)
print("Forecast:", forecast)
Seasonal Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose time series
decomposition = seasonal_decompose(data["series"], model='additive', period=12)
decomposition.plot()
Stationarity Testing
from statsmodels.tsa.stattools import adfuller
# Augmented Dickey‑Fuller test
result = adfuller(data["series"])
print("ADF statistic:", result[0])
print("P‑value:", result[1])
Statistical Tests
Normality Tests
from scipy.stats import shapiro, normaltest
# Shapiro‑Wilk test
stat, p_value = shapiro(data["x"])
print(f"Statistic: {stat}, P‑value: {p_value}")
# D'Agostino test
stat, p_value = normaltest(data["x"])
print(f"Statistic: {stat}, P‑value: {p_value}")
Mean Equality Tests
from scipy.stats import ttest_ind, ttest_1samp
# Two‑sample t‑test
group1 = data[data["group"] == "A"]["value"]
group2 = data[data["group"] == "B"]["value"]
stat, p_value = ttest_ind(group1, group2)
print(f"T‑statistic: {stat}, P‑value: {p_value}")
# One‑sample t‑test
stat, p_value = ttest_1samp(data["x"], popmean=0)
print(f"T‑statistic: {stat}, P‑value: {p_value}")
Model Diagnostics
Residual Analysis
# Obtain residuals
residuals = model.resid
# Q‑Q plot
sm.qqplot(residuals, line="s")
plt.title("Q‑Q Plot of Residuals")
plt.show()
# Heteroskedasticity test
from statsmodels.stats.diagnostic import het_breuschpagan
stat, p_value, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"Breusch‑Pagan test: statistic={stat}, p‑value={p_value}")
Influential Observations
# Influence analysis
influence = model.get_influence()
summary_frame = influence.summary_frame()
# Cook's distance
cooks_distance = summary_frame["cooks_d"]
print("High‑influence observations:", cooks_distance[cooks_distance > 0.5])
Result Visualisation
Diagnostic Plots
import matplotlib.pyplot as plt
# Residuals vs. fitted values
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(model.fittedvalues, residuals)
ax.axhline(y=0, color='r', linestyle='--')
ax.set_xlabel('Fitted Values')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs. Fitted Values')
plt.show()
# Partial regression
sm.graphics.plot_partregress_grid(model, fig=plt.figure(figsize=(12, 8)))
plt.show()
Time‑Series Plots
# Autocorrelation
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(data["series"], ax=axes[0], lags=20)
plot_pacf(data["series"], ax=axes[1], lags=20)
plt.show()
Table of Core Methods and Functions
| Category | Method/Function | Description |
|---|---|---|
| Regression | sm.OLS() |
Ordinary Least Squares |
sm.GLM() |
Generalized Linear Models | |
sm.RLM() |
Robust Linear Models | |
smf.ols() |
OLS with formula interface | |
smf.logit() |
Logistic regression | |
smf.poisson() |
Poisson regression | |
| Time Series | ARIMA() |
ARIMA model |
SARIMAX() |
Seasonal ARIMA with exogenous variables | |
ExponentialSmoothing() |
Exponential smoothing | |
VAR() |
Vector Autoregression | |
seasonal_decompose() |
Seasonal decomposition | |
adfuller() |
Augmented Dickey‑Fuller test | |
acf() |
Autocorrelation function | |
pacf() |
Partial autocorrelation function | |
| Statistical Tests | ttest_1samp() |
One‑sample t‑test |
ttest_ind() |
Two‑sample t‑test | |
shapiro() |
Shapiro‑Wilk test | |
normaltest() |
D'Agostino test | |
levene() |
Levene's test | |
bartlett() |
Bartlett's test | |
chi2_contingency() |
Chi‑square test | |
| Diagnostics | het_breuschpagan() |
Heteroskedasticity test |
durbin_watson() |
Durbin‑Watson test | |
jarque_bera() |
Jarque‑Bera test | |
get_influence() |
Influence analysis | |
| Visualization | qqplot() |
Q‑Q plot |
plot_acf() |
Autocorrelation plot | |
plot_pacf() |
Partial autocorrelation plot | |
plot_partregress() |
Partial regression plot | |
plot_regress_exog() |
Diagnostic plots | |
| Utility | add_constant() |
Add intercept |
summary() |
Result summary | |
params |
Model coefficients | |
pvalues |
P‑values | |
rsquared |
Coefficient of determination | |
predict() |
Predictions | |
conf_int() |
Confidence intervals |
Practical Use Cases
Marketing Campaign Effectiveness Analysis
# Multivariate regression to assess impact of different channels
model = smf.ols("sales ~ tv_spend + radio_spend + online_spend + C(season)", data=marketing_data).fit()
# Evaluate channel significance
print("Significant channels:")
for var, pval in model.pvalues.items():
if pval < 0.05:
print(f"{var}: p‑value = {pval:.4f}")
Financial Indicator Forecasting
# Time series model for stock prices
model = ARIMA(stock_prices, order=(2, 1, 1))
results = model.fit()
# 30‑day forecast
forecast = results.forecast(steps=30)
conf_int = results.get_forecast(steps=30).conf_int()
print("Price forecast after 30 days:", forecast[-1])
A/B Test Analysis
# Compare conversion rates between groups
contingency_table = pd.crosstab(data["group"], data["converted"])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi‑square statistic: {chi2}")
print(f"P‑value: {p_value}")
# Logistic regression to control for additional factors
model = smf.logit("converted ~ C(group) + age + income", data=data).fit()
print(model.summary())
Performance Optimization
Working with Large Datasets
# Process data in chunks for large files
def process_in_chunks(data, chunk_size=10000):
results = []
for chunk in pd.read_csv(data, chunksize=chunk_size):
model = smf.ols("y ~ x1 + x2", data=chunk).fit()
results.append(model.params)
return pd.concat(results)
Caching Results
# Save model results
import pickle
# Save
with open('model_results.pkl', 'wb') as f:
pickle.dump(model, f)
# Load
with open('model_results.pkl', 'rb') as f:
loaded_model = pickle.load(f)
Integration with Other Libraries
Interaction with Scikit‑learn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Data preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
# Build Statsmodels OLS model
X_train_sm = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_sm).fit()
Visualization with Seaborn
import seaborn as sns
# Correlation matrix
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Variable Correlation Matrix')
plt.show()
# Pairwise plots
sns.pairplot(data, hue='category')
plt.show()
Frequently Asked Questions
How does Statsmodels differ from Scikit‑learn?
Statsmodels focuses on statistical modeling and result interpretation, providing detailed statistics, p‑values, and confidence intervals. Scikit‑learn is centered on machine learning and predictive accuracy.
How to work with categorical variables?
In the formula interface use the syntax C(variable). In the classic interface, first convert variables to dummy variables with pd.get_dummies().
How to choose the appropriate time‑series model?
Start by examining data structure: test for stationarity, seasonality, and trend. Use ARIMA for non‑stationary series, SARIMA for seasonal data, and VAR for multiple series.
How to interpret the output of summary()?
Pay attention to coefficients, their standard errors, t‑statistics and p‑values. R‑squared indicates the proportion of explained variance. The F‑statistic tests overall model significance.
Can Statsmodels handle non‑linear relationships?
Yes—via polynomial terms in formulas, splines, or generalized additive models. Non‑linear least squares methods are also available.
How to handle missing values?
Statsmodels automatically drops observations with missing values. To control this, use the missing='drop' parameter or preprocess the data beforehand.
Best Practices
Checking Model Assumptions
# Linearity check
sm.graphics.plot_partregress_grid(model)
# Residual normality
sm.qqplot(model.resid, line='s')
# Independence check
from statsmodels.stats.diagnostic import durbin_watson
dw_stat = durbin_watson(model.resid)
print(f"Durbin‑Watson statistic: {dw_stat}")
Cross‑Validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
# Compare with sklearn for validation
lr = LinearRegression()
scores = cross_val_score(lr, X, y, cv=5, scoring='r2')
print(f"Average R² from cross‑validation: {scores.mean():.4f}")
Documenting the Analysis
# Create report
report = f"""
Model analysis:
- R²: {model.rsquared:.4f}
- Adjusted R²: {model.rsquared_adj:.4f}
- F‑statistic: {model.fvalue:.4f}
- F‑test p‑value: {model.f_pvalue:.4f}
- Number of observations: {model.nobs}
"""
print(report)
Conclusion: When and Why to Use Statsmodels
Statsmodels is an indispensable tool for rigorous statistical analysis in Python. The library is especially valuable in the following scenarios:
Exploratory Data Analysis: when you need to understand data structure, uncover patterns, and test statistical hypotheses.
Scientific Research: to obtain statistically grounded conclusions with significance levels and confidence intervals.
Econometric Analysis: for modeling economic processes, panel data, and time‑series analysis.
Business Analytics: to evaluate marketing campaign effectiveness, conduct A/B testing, and forecast key performance indicators.
Academic Projects: when detailed statistical interpretation and adherence to scientific standards are required.
Thanks to tight integration with the Python ecosystem (Pandas, NumPy, Matplotlib), rich functionality, and comprehensive documentation, Statsmodels is a powerful instrument for solving a wide range of statistical analysis tasks. Choosing between Statsmodels and other libraries should be based on project needs: if interpretability and statistical rigor are paramount, Statsmodels is the optimal choice.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed