Architecture and Working Principles of TSFresh
TSFresh is built on a statistically grounded feature‑selection approach. The library uses a two‑step process: first it extracts the maximal possible number of features from time series, then it applies statistical tests to determine their relevance to the target variable.
The internal architecture includes several key components:
- Feature‑extraction module with more than 60 different functions
- Statistical filtering system based on p‑values
- Optimized algorithms for handling large datasets
- Integration with popular machine‑learning libraries
Benefits of Using TSFresh
TSFresh offers a range of substantial advantages for working with time series:
Automation of feature engineering – the library generates hundreds of features automatically, saving developers significant time.
Statistical rigor – all selected features pass strict statistical significance tests.
Scalability – parallel processing support enables efficient handling of large datasets.
Configurable flexibility – parameters for feature extraction can be tuned to the specific task.
Broad function spectrum – from simple statistical measures to complex non‑linearity and chaos indicators.
Compatibility with the Python ecosystem – seamless integration with pandas, scikit‑learn, NumPy, and other popular libraries.
Installation and Environment Setup
Working with TSFresh requires installing several components. The core installation is performed via pip:
pip install tsfresh
For full functionality, it is also recommended to install additional dependencies:
pip install scikit-learn pandas numpy matplotlib seaborn
If you work with big data, you may need extra packages for performance optimization:
pip install dask[complete] numba
Basic imports of the required modules:
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters, EfficientFCParameters
from tsfresh.feature_selection.relevance import calculate_relevance_table
import pandas as pd
import numpy as np
Data Structure and Format Requirements
TSFresh works with data in a special “long format” table. This format requires a specific DataFrame structure:
import pandas as pd
# Example of a correct data layout
df = pd.DataFrame({
'id': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'time': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'value': [10, 12, 11, 13, 20, 22, 21, 23, 30, 32, 31, 33],
'sensor_type': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
})
Key columns:
- id – unique identifier of the time series
- time – timestamp or sequential index
- value – measured value
- sensor_type (optional) – sensor or channel type
Main Feature‑Extraction Methods
Basic Feature Extraction
The core function extract_features is the central element of the library:
from tsfresh import extract_features
# Extract all available features
features = extract_features(df,
column_id='id',
column_sort='time',
column_value='value')
Using Pre‑defined Configurations
TSFresh provides several ready‑made parameter sets:
from tsfresh.feature_extraction import EfficientFCParameters, ComprehensiveFCParameters, MinimalFCParameters
# Fast processing with core features
features_efficient = extract_features(df,
column_id='id',
column_sort='time',
default_fc_parameters=EfficientFCParameters())
# Full set of features
features_comprehensive = extract_features(df,
column_id='id',
column_sort='time',
default_fc_parameters=ComprehensiveFCParameters())
Multichannel Data
When working with data from multiple sources or sensors:
# Feature extraction for multivariate data
features_multivariate = extract_features(df,
column_id='id',
column_sort='time',
column_value='value',
column_kind='sensor_type')
Filtering and Selecting Relevant Features
After extracting features, you need to select the most relevant ones for your specific task:
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
# Create target variable
y = pd.Series([0, 1, 0], index=[1, 2, 3])
# Impute missing values
features_imputed = impute(features)
# Select relevant features
features_filtered = select_features(features_imputed, y)
Customizing Filter Parameters
from tsfresh.feature_selection.relevance import calculate_relevance_table
# Generate relevance table
relevance_table = calculate_relevance_table(features_imputed, y)
# Filter with a custom significance level
features_filtered = select_features(features_imputed, y,
fdr_level=0.05, # False discovery rate
hypotheses_independent=False)
Performance Optimization
TSFresh offers several ways to optimise performance for large datasets:
Parallel Processing
# Use all available cores
features = extract_features(df,
column_id='id',
column_sort='time',
n_jobs=-1, # All cores
chunksize=10) # Chunk size for processing
Disabling the Progress Bar
# For automated scripts
features = extract_features(df,
column_id='id',
column_sort='time',
disable_progressbar=True)
Caching Results
# Save intermediate results
features.to_pickle('extracted_features.pkl')
# Load saved features
features_loaded = pd.read_pickle('extracted_features.pkl')
Creating Custom Feature Functions
TSFresh allows you to define your own feature‑calculation functions:
from tsfresh.feature_extraction.feature_calculators import set_property
@set_property("fctype", "simple")
def custom_mean_absolute_change(x):
"""Calculate mean absolute change"""
return np.mean(np.abs(np.diff(x)))
@set_property("fctype", "combiner")
def custom_ratio_beyond_r_sigma(x, r):
"""Proportion of values beyond r sigma"""
mean_x = np.mean(x)
std_x = np.std(x)
return np.mean(np.abs(x - mean_x) > r * std_x)
# Use custom functions
from tsfresh.feature_extraction.settings import from_columns
custom_fc_parameters = from_columns([custom_mean_absolute_change, custom_ratio_beyond_r_sigma])
Integration with Machine Learning
Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(features_filtered, y, test_size=0.3, random_state=42)
# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
Regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# For regression tasks
y_continuous = pd.Series([1.5, 2.7, 3.2], index=[1, 2, 3])
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_continuous)
y_pred_reg = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_reg)
r2 = r2_score(y_test, y_pred_reg)
Working with Real‑World Data
Example with Sensor Data
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures
# Download example dataset
download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()
# Full processing pipeline
X = extract_features(timeseries, column_id="id", column_sort="time")
X_imputed = impute(X)
X_filtered = select_features(X_imputed, y)
# Train model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_filtered, y)
# Evaluate
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_filtered, y, cv=5)
print(f"Cross‑validation accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Visualization and Result Analysis
Feature Importance Analysis
import matplotlib.pyplot as plt
import seaborn as sns
# Get feature importance
feature_importance = pd.DataFrame({
'feature': X_filtered.columns,
'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top‑20 features
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(20), x='importance', y='feature')
plt.title('Top 20 Most Important Features')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()
Correlation Analysis
# Correlation matrix for selected features
correlation_matrix = X_filtered.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
Complete Reference Table of TSFresh Methods and Functions
| Category | Function / Method | Description | Main Parameters |
|---|---|---|---|
| Feature Extraction | extract_features() |
Main function for extracting features | column_id, column_sort, column_value, default_fc_parameters |
extract_relevant_features() |
Extract only relevant features | X, y, ml_task, fdr_level |
|
from_columns() |
Create parameters from a list of functions | columns, columns_to_ignore |
|
| Feature Filtering | select_features() |
Select significant features | X, y, fdr_level, hypotheses_independent |
calculate_relevance_table() |
Compute relevance table | X, y, ml_task, n_jobs |
|
get_feature_names() |
Retrieve feature names | fc_parameters, column_id, column_sort |
|
| Data Handling | impute() |
Fill missing values | dataframe, col_to_max, col_to_min |
normalize() |
Normalize data | dataframe, kind |
|
restrict() |
Restrict feature set | dataframe, restriction_list |
|
| Configurations | ComprehensiveFCParameters() |
Full set of parameters | - |
EfficientFCParameters() |
Efficient set of parameters | - | |
MinimalFCParameters() |
Minimal set of parameters | - | |
| Statistical Functions | mean() |
Mean value | x |
median() |
Median | x |
|
std() |
Standard deviation | x |
|
var() |
Variance | x |
|
skewness() |
Skewness | x |
|
kurtosis() |
Kurtosis | x |
|
minimum() |
Minimum value | x |
|
maximum() |
Maximum value | x |
|
| Autocorrelation Functions | autocorrelation() |
Autocorrelation | x, lag |
partial_autocorrelation() |
Partial autocorrelation | x, lag |
|
agg_autocorrelation() |
Aggregated autocorrelation | x, f_agg, maxlag |
|
| Spectral Functions | fft_coefficient() |
FFT coefficients | x, coeff, attr |
fft_aggregated() |
Aggregated FFT | x, aggtype |
|
power_spectral_density() |
Power spectral density | x, coeff |
|
| Entropy Measures | sample_entropy() |
Sample entropy | x, m |
permutation_entropy() |
Permutation entropy | x, tau, dimension |
|
approximate_entropy() |
Approximate entropy | x, m, r |
|
| Complexity Functions | lempel_ziv_complexity() |
Lempel‑Ziv complexity | x, bins |
fourier_entropy() |
Fourier entropy | x, bins |
|
svd_entropy() |
SVD entropy | x, tau, de |
|
| Non‑Linear Functions | largest_lyapunov_exponent() |
Largest Lyapunov exponent | x, tau, de, knn |
hurst_exponent() |
Hurst exponent | x |
|
detrended_fluctuation_analysis() |
DFA (detrended fluctuation analysis) | x |
|
| Change Functions | mean_change() |
Mean change | x |
mean_abs_change() |
Mean absolute change | x |
|
mean_second_derivative_central() |
Mean second derivative (central) | x |
|
absolute_sum_of_changes() |
Absolute sum of changes | x |
|
| Crossing Functions | number_crossing_m() |
Number of level crossings | x, m |
number_peaks() |
Number of peaks | x, n |
|
mean_n_absolute_max() |
Mean of n highest values | x, number_of_maxima |
|
| Quantile Functions | quantile() |
Quantile | x, q |
range_count() |
Count within range | x, min, max |
|
ratio_beyond_r_sigma() |
Proportion beyond r sigma | x, r |
|
| Distribution Functions | count_above_mean() |
Count above mean | x |
count_below_mean() |
Count below mean | x |
|
percentage_of_reoccurring_values() |
Percentage of recurring values | x |
|
percentage_of_reoccurring_datapoints() |
Percentage of recurring datapoints | x |
|
| Trend Functions | linear_trend() |
Linear trend | x, attr |
agg_linear_trend() |
Aggregated linear trend | x, attr, chunk_len, f_agg |
|
augmented_dickey_fuller() |
Dickey‑Fuller test | x, attr, autolag |
|
| AR Model Functions | ar_coefficient() |
AR model coefficients | x, coeff, k |
max_langevin_fixed_point() |
Maximum Langevin fixed point | x, r, m |
|
| CWT Functions | cwt_coefficients() |
CWT coefficients | x, widths, coeff, w |
spkt_welch_density() |
Welch spectral density | x, coeff |
|
| Matrix Functions | matrix_profile() |
Matrix profile | x, threshold, feature |
change_quantiles() |
Change quantiles | x, ql, qh, isabs, f_agg |
|
| Utilities | set_property() |
Decorator for custom functions | fctype, input, index_type |
get_feature_names_from_fc_parameters() |
Get feature names from FC parameters | fc_parameters, column_id, column_sort |
|
roll_time_series() |
Create rolling windows | df_or_dict, column_id, column_sort, column_kind, rolling_direction, max_timeshift |
Practical Recommendations for Using TSFresh
Selecting Configuration Parameters
Different tasks benefit from different parameter sets:
- MinimalFCParameters() – fast prototyping and small datasets
- EfficientFCParameters() – suitable for most practical applications
- ComprehensiveFCParameters() – for research‑level tasks and maximal completeness
Memory Optimization
# For large data, use iterative processing
def process_large_dataset(df, chunk_size=1000):
unique_ids = df['id'].unique()
all_features = []
for i in range(0, len(unique_ids), chunk_size):
chunk_ids = unique_ids[i:i+chunk_size]
chunk_df = df[df['id'].isin(chunk_ids)]
features = extract_features(chunk_df,
column_id='id',
column_sort='time',
n_jobs=1) # Limit parallelism
all_features.append(features)
return pd.concat(all_features, ignore_index=True)
Validation and Testing
from sklearn.model_selection import TimeSeriesSplit
# Use time‑series cross‑validation
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X_filtered):
X_train, X_test = X_filtered.iloc[train_index], X_filtered.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model = RandomForestClassifier()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Fold accuracy: {score:.3f}")
Integration with Other Libraries
Working with Dask for Large Data
import dask.dataframe as dd
from tsfresh.utilities.distribution import MultiprocessingDistributor
# Parallel processing with Dask
distributor = MultiprocessingDistributor(n_workers=4)
features = extract_features(df,
column_id='id',
column_sort='time',
distributor=distributor)
Integration with MLflow
import mlflow
import mlflow.sklearn
with mlflow.start_run():
# Feature extraction
features = extract_features(df, column_id='id', column_sort='time')
features = impute(features)
features_filtered = select_features(features, y)
# Model training
model = RandomForestClassifier()
model.fit(features_filtered, y)
# Log metrics
mlflow.log_metric("n_features_extracted", len(features.columns))
mlflow.log_metric("n_features_selected", len(features_filtered.columns))
mlflow.sklearn.log_model(model, "model")
Error Handling and Debugging
Common Issues and Solutions
import warnings
warnings.filterwarnings('ignore')
try:
features = extract_features(df, column_id='id', column_sort='time')
except Exception as e:
print(f"Feature extraction error: {e}")
# Inspect data format
print("Inspecting data structure:")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types: {df.dtypes}")
print(f"Unique IDs: {df['id'].nunique()}")
print(f"Time range: {df['time'].min()} - {df['time'].max()}")
Process Logging
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def extract_features_with_logging(df, **kwargs):
logger.info(f"Starting feature extraction for {df['id'].nunique()} time series")
features = extract_features(df, **kwargs)
logger.info(f"Extracted {len(features.columns)} features")
logger.info(f"Resulting matrix size: {features.shape}")
return features
Application Examples in Various Domains
Financial Data Analysis
# Example with financial time series
financial_df = pd.DataFrame({
'id': np.repeat(range(1, 101), 252), # 100 stocks, 252 trading days
'time': np.tile(range(1, 253), 100),
'price': np.random.randn(25200).cumsum() + 100,
'volume': np.random.exponential(1000, 25200)
})
# Extract features for price and volume
features = extract_features(financial_df,
column_id='id',
column_sort='time',
column_value='price')
IoT Data Analysis
# Example with IoT sensor data
iot_df = pd.DataFrame({
'device_id': np.repeat(range(1, 51), 1440), # 50 devices, 1440 minutes per day
'timestamp': np.tile(range(1, 1441), 50),
'temperature': 20 + 5 * np.sin(np.tile(np.linspace(0, 2*np.pi, 1440), 50)) + np.random.normal(0, 1, 72000),
'humidity': 50 + 10 * np.cos(np.tile(np.linspace(0, 2*np.pi, 1440), 50)) + np.random.normal(0, 2, 72000)
})
# Multivariate feature extraction
features = extract_features(iot_df,
column_id='device_id',
column_sort='timestamp',
column_value='temperature')
Conclusion
TSFresh is a powerful and versatile tool for working with time series in machine‑learning projects. The library dramatically simplifies the feature‑engineering workflow by automating the extraction and selection of features using statistical methods. With its rich function set, flexible configuration, and seamless integration into the Python ecosystem, TSFresh finds wide application across domains—from financial analytics to industrial monitoring.
Key advantages include automation of labor‑intensive processes, statistically sound results, and scalability. TSFresh is especially valuable when the structure of time series is unknown beforehand or when rapid prototyping of temporal data analysis solutions is required.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed