Tsfresh - extracting signs from time rows

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Architecture and Working Principles of TSFresh

TSFresh is built on a statistically grounded feature‑selection approach. The library uses a two‑step process: first it extracts the maximal possible number of features from time series, then it applies statistical tests to determine their relevance to the target variable.

The internal architecture includes several key components:

  • Feature‑extraction module with more than 60 different functions
  • Statistical filtering system based on p‑values
  • Optimized algorithms for handling large datasets
  • Integration with popular machine‑learning libraries

Benefits of Using TSFresh

TSFresh offers a range of substantial advantages for working with time series:

Automation of feature engineering – the library generates hundreds of features automatically, saving developers significant time.

Statistical rigor – all selected features pass strict statistical significance tests.

Scalability – parallel processing support enables efficient handling of large datasets.

Configurable flexibility – parameters for feature extraction can be tuned to the specific task.

Broad function spectrum – from simple statistical measures to complex non‑linearity and chaos indicators.

Compatibility with the Python ecosystem – seamless integration with pandas, scikit‑learn, NumPy, and other popular libraries.

Installation and Environment Setup

Working with TSFresh requires installing several components. The core installation is performed via pip:

pip install tsfresh

For full functionality, it is also recommended to install additional dependencies:

pip install scikit-learn pandas numpy matplotlib seaborn

If you work with big data, you may need extra packages for performance optimization:

pip install dask[complete] numba

Basic imports of the required modules:

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters, EfficientFCParameters
from tsfresh.feature_selection.relevance import calculate_relevance_table
import pandas as pd
import numpy as np

Data Structure and Format Requirements

TSFresh works with data in a special “long format” table. This format requires a specific DataFrame structure:

import pandas as pd

# Example of a correct data layout
df = pd.DataFrame({
    'id': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    'time': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
    'value': [10, 12, 11, 13, 20, 22, 21, 23, 30, 32, 31, 33],
    'sensor_type': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
})

Key columns:

  • id – unique identifier of the time series
  • time – timestamp or sequential index
  • value – measured value
  • sensor_type (optional) – sensor or channel type

Main Feature‑Extraction Methods

Basic Feature Extraction

The core function extract_features is the central element of the library:

from tsfresh import extract_features

# Extract all available features
features = extract_features(df, 
                          column_id='id', 
                          column_sort='time',
                          column_value='value')

Using Pre‑defined Configurations

TSFresh provides several ready‑made parameter sets:

from tsfresh.feature_extraction import EfficientFCParameters, ComprehensiveFCParameters, MinimalFCParameters

# Fast processing with core features
features_efficient = extract_features(df, 
                                    column_id='id', 
                                    column_sort='time',
                                    default_fc_parameters=EfficientFCParameters())

# Full set of features
features_comprehensive = extract_features(df, 
                                        column_id='id', 
                                        column_sort='time',
                                        default_fc_parameters=ComprehensiveFCParameters())

Multichannel Data

When working with data from multiple sources or sensors:

# Feature extraction for multivariate data
features_multivariate = extract_features(df, 
                                        column_id='id', 
                                        column_sort='time',
                                        column_value='value',
                                        column_kind='sensor_type')

Filtering and Selecting Relevant Features

After extracting features, you need to select the most relevant ones for your specific task:

from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

# Create target variable
y = pd.Series([0, 1, 0], index=[1, 2, 3])

# Impute missing values
features_imputed = impute(features)

# Select relevant features
features_filtered = select_features(features_imputed, y)

Customizing Filter Parameters

from tsfresh.feature_selection.relevance import calculate_relevance_table

# Generate relevance table
relevance_table = calculate_relevance_table(features_imputed, y)

# Filter with a custom significance level
features_filtered = select_features(features_imputed, y, 
                                  fdr_level=0.05,  # False discovery rate
                                  hypotheses_independent=False)

Performance Optimization

TSFresh offers several ways to optimise performance for large datasets:

Parallel Processing

# Use all available cores
features = extract_features(df, 
                          column_id='id', 
                          column_sort='time',
                          n_jobs=-1,  # All cores
                          chunksize=10)  # Chunk size for processing

Disabling the Progress Bar

# For automated scripts
features = extract_features(df, 
                          column_id='id', 
                          column_sort='time',
                          disable_progressbar=True)

Caching Results

# Save intermediate results
features.to_pickle('extracted_features.pkl')

# Load saved features
features_loaded = pd.read_pickle('extracted_features.pkl')

Creating Custom Feature Functions

TSFresh allows you to define your own feature‑calculation functions:

from tsfresh.feature_extraction.feature_calculators import set_property

@set_property("fctype", "simple")
def custom_mean_absolute_change(x):
    """Calculate mean absolute change"""
    return np.mean(np.abs(np.diff(x)))

@set_property("fctype", "combiner")
def custom_ratio_beyond_r_sigma(x, r):
    """Proportion of values beyond r sigma"""
    mean_x = np.mean(x)
    std_x = np.std(x)
    return np.mean(np.abs(x - mean_x) > r * std_x)

# Use custom functions
from tsfresh.feature_extraction.settings import from_columns
custom_fc_parameters = from_columns([custom_mean_absolute_change, custom_ratio_beyond_r_sigma])

Integration with Machine Learning

Classification

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(features_filtered, y, test_size=0.3, random_state=42)

# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

Regression

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# For regression tasks
y_continuous = pd.Series([1.5, 2.7, 3.2], index=[1, 2, 3])

regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_continuous)

y_pred_reg = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_reg)
r2 = r2_score(y_test, y_pred_reg)

Working with Real‑World Data

Example with Sensor Data

from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures

# Download example dataset
download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()

# Full processing pipeline
X = extract_features(timeseries, column_id="id", column_sort="time")
X_imputed = impute(X)
X_filtered = select_features(X_imputed, y)

# Train model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_filtered, y)

# Evaluate
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_filtered, y, cv=5)
print(f"Cross‑validation accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Visualization and Result Analysis

Feature Importance Analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_filtered.columns,
    'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top‑20 features
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(20), x='importance', y='feature')
plt.title('Top 20 Most Important Features')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

Correlation Analysis

# Correlation matrix for selected features
correlation_matrix = X_filtered.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

Complete Reference Table of TSFresh Methods and Functions

Category Function / Method Description Main Parameters
Feature Extraction extract_features() Main function for extracting features column_id, column_sort, column_value, default_fc_parameters
  extract_relevant_features() Extract only relevant features X, y, ml_task, fdr_level
  from_columns() Create parameters from a list of functions columns, columns_to_ignore
Feature Filtering select_features() Select significant features X, y, fdr_level, hypotheses_independent
  calculate_relevance_table() Compute relevance table X, y, ml_task, n_jobs
  get_feature_names() Retrieve feature names fc_parameters, column_id, column_sort
Data Handling impute() Fill missing values dataframe, col_to_max, col_to_min
  normalize() Normalize data dataframe, kind
  restrict() Restrict feature set dataframe, restriction_list
Configurations ComprehensiveFCParameters() Full set of parameters -
  EfficientFCParameters() Efficient set of parameters -
  MinimalFCParameters() Minimal set of parameters -
Statistical Functions mean() Mean value x
  median() Median x
  std() Standard deviation x
  var() Variance x
  skewness() Skewness x
  kurtosis() Kurtosis x
  minimum() Minimum value x
  maximum() Maximum value x
Autocorrelation Functions autocorrelation() Autocorrelation x, lag
  partial_autocorrelation() Partial autocorrelation x, lag
  agg_autocorrelation() Aggregated autocorrelation x, f_agg, maxlag
Spectral Functions fft_coefficient() FFT coefficients x, coeff, attr
  fft_aggregated() Aggregated FFT x, aggtype
  power_spectral_density() Power spectral density x, coeff
Entropy Measures sample_entropy() Sample entropy x, m
  permutation_entropy() Permutation entropy x, tau, dimension
  approximate_entropy() Approximate entropy x, m, r
Complexity Functions lempel_ziv_complexity() Lempel‑Ziv complexity x, bins
  fourier_entropy() Fourier entropy x, bins
  svd_entropy() SVD entropy x, tau, de
Non‑Linear Functions largest_lyapunov_exponent() Largest Lyapunov exponent x, tau, de, knn
  hurst_exponent() Hurst exponent x
  detrended_fluctuation_analysis() DFA (detrended fluctuation analysis) x
Change Functions mean_change() Mean change x
  mean_abs_change() Mean absolute change x
  mean_second_derivative_central() Mean second derivative (central) x
  absolute_sum_of_changes() Absolute sum of changes x
Crossing Functions number_crossing_m() Number of level crossings x, m
  number_peaks() Number of peaks x, n
  mean_n_absolute_max() Mean of n highest values x, number_of_maxima
Quantile Functions quantile() Quantile x, q
  range_count() Count within range x, min, max
  ratio_beyond_r_sigma() Proportion beyond r sigma x, r
Distribution Functions count_above_mean() Count above mean x
  count_below_mean() Count below mean x
  percentage_of_reoccurring_values() Percentage of recurring values x
  percentage_of_reoccurring_datapoints() Percentage of recurring datapoints x
Trend Functions linear_trend() Linear trend x, attr
  agg_linear_trend() Aggregated linear trend x, attr, chunk_len, f_agg
  augmented_dickey_fuller() Dickey‑Fuller test x, attr, autolag
AR Model Functions ar_coefficient() AR model coefficients x, coeff, k
  max_langevin_fixed_point() Maximum Langevin fixed point x, r, m
CWT Functions cwt_coefficients() CWT coefficients x, widths, coeff, w
  spkt_welch_density() Welch spectral density x, coeff
Matrix Functions matrix_profile() Matrix profile x, threshold, feature
  change_quantiles() Change quantiles x, ql, qh, isabs, f_agg
Utilities set_property() Decorator for custom functions fctype, input, index_type
  get_feature_names_from_fc_parameters() Get feature names from FC parameters fc_parameters, column_id, column_sort
  roll_time_series() Create rolling windows df_or_dict, column_id, column_sort, column_kind, rolling_direction, max_timeshift

Practical Recommendations for Using TSFresh

Selecting Configuration Parameters

Different tasks benefit from different parameter sets:

  • MinimalFCParameters() – fast prototyping and small datasets
  • EfficientFCParameters() – suitable for most practical applications
  • ComprehensiveFCParameters() – for research‑level tasks and maximal completeness

Memory Optimization

# For large data, use iterative processing
def process_large_dataset(df, chunk_size=1000):
    unique_ids = df['id'].unique()
    all_features = []
    
    for i in range(0, len(unique_ids), chunk_size):
        chunk_ids = unique_ids[i:i+chunk_size]
        chunk_df = df[df['id'].isin(chunk_ids)]
        
        features = extract_features(chunk_df, 
                                  column_id='id', 
                                  column_sort='time',
                                  n_jobs=1)  # Limit parallelism
        all_features.append(features)
    
    return pd.concat(all_features, ignore_index=True)

Validation and Testing

from sklearn.model_selection import TimeSeriesSplit

# Use time‑series cross‑validation
tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X_filtered):
    X_train, X_test = X_filtered.iloc[train_index], X_filtered.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"Fold accuracy: {score:.3f}")

Integration with Other Libraries

Working with Dask for Large Data

import dask.dataframe as dd
from tsfresh.utilities.distribution import MultiprocessingDistributor

# Parallel processing with Dask
distributor = MultiprocessingDistributor(n_workers=4)
features = extract_features(df, 
                          column_id='id', 
                          column_sort='time',
                          distributor=distributor)

Integration with MLflow

import mlflow
import mlflow.sklearn

with mlflow.start_run():
    # Feature extraction
    features = extract_features(df, column_id='id', column_sort='time')
    features = impute(features)
    features_filtered = select_features(features, y)
    
    # Model training
    model = RandomForestClassifier()
    model.fit(features_filtered, y)
    
    # Log metrics
    mlflow.log_metric("n_features_extracted", len(features.columns))
    mlflow.log_metric("n_features_selected", len(features_filtered.columns))
    mlflow.sklearn.log_model(model, "model")

Error Handling and Debugging

Common Issues and Solutions

import warnings
warnings.filterwarnings('ignore')

try:
    features = extract_features(df, column_id='id', column_sort='time')
except Exception as e:
    print(f"Feature extraction error: {e}")
    # Inspect data format
    print("Inspecting data structure:")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Data types: {df.dtypes}")
    print(f"Unique IDs: {df['id'].nunique()}")
    print(f"Time range: {df['time'].min()} - {df['time'].max()}")

Process Logging

import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def extract_features_with_logging(df, **kwargs):
    logger.info(f"Starting feature extraction for {df['id'].nunique()} time series")
    
    features = extract_features(df, **kwargs)
    
    logger.info(f"Extracted {len(features.columns)} features")
    logger.info(f"Resulting matrix size: {features.shape}")
    
    return features

Application Examples in Various Domains

Financial Data Analysis

# Example with financial time series
financial_df = pd.DataFrame({
    'id': np.repeat(range(1, 101), 252),  # 100 stocks, 252 trading days
    'time': np.tile(range(1, 253), 100),
    'price': np.random.randn(25200).cumsum() + 100,
    'volume': np.random.exponential(1000, 25200)
})

# Extract features for price and volume
features = extract_features(financial_df, 
                          column_id='id', 
                          column_sort='time',
                          column_value='price')

IoT Data Analysis

# Example with IoT sensor data
iot_df = pd.DataFrame({
    'device_id': np.repeat(range(1, 51), 1440),  # 50 devices, 1440 minutes per day
    'timestamp': np.tile(range(1, 1441), 50),
    'temperature': 20 + 5 * np.sin(np.tile(np.linspace(0, 2*np.pi, 1440), 50)) + np.random.normal(0, 1, 72000),
    'humidity': 50 + 10 * np.cos(np.tile(np.linspace(0, 2*np.pi, 1440), 50)) + np.random.normal(0, 2, 72000)
})

# Multivariate feature extraction
features = extract_features(iot_df, 
                          column_id='device_id', 
                          column_sort='timestamp',
                          column_value='temperature')

Conclusion

TSFresh is a powerful and versatile tool for working with time series in machine‑learning projects. The library dramatically simplifies the feature‑engineering workflow by automating the extraction and selection of features using statistical methods. With its rich function set, flexible configuration, and seamless integration into the Python ecosystem, TSFresh finds wide application across domains—from financial analytics to industrial monitoring.

Key advantages include automation of labor‑intensive processes, statistically sound results, and scalability. TSFresh is especially valuable when the structure of time series is unknown beforehand or when rapid prototyping of temporal data analysis solutions is required.

News