Pandas - processing and analysis of data

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Pandas: The Ultimate Guide to Data Analysis in Python

Pandas is a fundamental library for data analysis in Python that has become an industry standard for processing structured information. Created by Wes McKinney in 2008, it got its name from the term "Panel Data" in econometrics.

What is Pandas and Why is it So Important?

Pandas is a powerful open-source library built on NumPy, which provides high-level data structures and tools for fast and efficient data analysis. The library is especially effective when working with tabular data, time series, and any structured information.

Key Benefits of Pandas:

  • Versatility: Pandas supports working with various data formats - from CSV and Excel to JSON, SQL, HTML, and many others.
  • Performance: Thanks to optimized algorithms based on Cython and NumPy, the library provides high-speed processing of large amounts of data.
  • Intuition: The syntax of the library is as close as possible to natural language, which makes the code understandable and easy to learn.
  • Integration: Pandas integrates perfectly with other Python ecosystem libraries - NumPy, Matplotlib, Scikit-learn, SciPy.

Areas of Application of Pandas:

  • Data Analysis and Visualization - creating reports, dashboards, exploratory analysis
  • Machine Learning - data preprocessing, feature engineering, dataset preparation
  • Financial Analytics - time series analysis, calculation of financial indicators
  • Business Intelligence - report preparation, sales analysis, KPI
  • Scientific Research - processing experimental data, statistical analysis
  • Web Scraping - structuring and cleaning collected data

Installation and Import of Pandas

Installation via pip:

pip install pandas

Installation via conda:

conda install pandas

Importing the Library:

import pandas as pd
import numpy as np  # often used with pandas

The abbreviation pd has become a universal standard in the Python community and is used in all manuals and documentation.

Основные структуры данных: Series и DataFrame

Series — One-Dimensional Data with Indexes

Series is a one-dimensional array with labels (indexes) that can contain any data types: numbers, strings, boolean values, Python objects.

# Creating a Series from a list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

# Creating a Series from a dictionary
data = {'Moscow': 12500000, 'Saint Petersburg': 5400000, 'Novosibirsk': 1600000}
population = pd.Series(data)

# Accessing elements
print(population['Moscow'])  # 12500000
print(population.iloc[0])    # first element

DataFrame — Two-Dimensional Tabular Data

DataFrame is a two-dimensional data structure with row and column labels, similar to a table in a relational database or an Excel spreadsheet.

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Anna', 'Boris', 'Vika', 'Dmitry'],
    'Age': [25, 30, 22, 28],
    'City': ['Moscow', 'St. Petersburg', 'Kazan', 'Moscow'],
    'Salary': [80000, 95000, 60000, 75000]
}
df = pd.DataFrame(data)

# Creating a DataFrame with a custom index
df = pd.DataFrame(data, index=['emp1', 'emp2', 'emp3', 'emp4'])

Loading and Saving Data

Loading from Different Formats

# CSV files
df_csv = pd.read_csv('data.csv', encoding='utf-8', sep=',')

# Excel files
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON files
df_json = pd.read_json('data.json')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', conn)

# HTML tables
df_html = pd.read_html('https://example.com/table')

# Parquet files (for big data)
df_parquet = pd.read_parquet('data.parquet')

Additional Loading Parameters

# CSV with parameters
df = pd.read_csv('data.csv', 
                 sep=';',           # separator
                 header=0,          # row with headers
                 index_col=0,       # column for index
                 parse_dates=True,  # automatic date conversion
                 na_values=['N/A', 'NULL'])  # values for NaN

Saving Data

# Saving to CSV
df.to_csv('output.csv', index=False, encoding='utf-8')

# Saving to Excel
df.to_excel('output.xlsx', sheet_name='Data', index=False)

# Saving to JSON
df.to_json('output.json', orient='records', force_ascii=False)

# Saving to SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

Indexing, Filtering, and Data Selection

Column Selection

# One column
names = df['Name']

# Multiple columns
subset = df[['Name', 'Age']]

# All columns except specified
df_without_salary = df.drop('Salary', axis=1)

Row Filtering

# Logical filtering
young_employees = df[df['Age'] < 30]

# Multiple conditions
moscow_young = df[(df['City'] == 'Moscow') & (df['Age'] < 30)]

# Using the query method
high_salary = df.query('Salary > 70000')

# Filtering by a list of values
moscow_spb = df[df['City'].isin(['Moscow', 'St. Petersburg'])]

Indexing loc and iloc

# loc - by labels
df.loc[0, 'Name']                    # specific cell
df.loc[0:2, ['Name', 'Age']]     # range of rows and columns
df.loc[df['Age'] > 25, 'Salary'] = 100000  # conditional assignment

# iloc - by positions
df.iloc[0, 1]                       # first row, second column
df.iloc[1:3, 0:2]                   # slice by positions
df.iloc[-1]                         # last row

Working with Missing Values

Detecting Missing Values

# Checking for missing values
df.isnull()                         # boolean mask
df.isnull().sum()                   # number of missing values per column
df.info()                           # general information about missing values

Handling Missing Values

# Removing rows with missing values
df.dropna()                         # delete all rows with NaN
df.dropna(subset=['Name'])           # delete rows with NaN in the 'Name' column
df.dropna(thresh=2)                 # keep rows with at least 2 non-empty values

# Filling in missing values
df.fillna(0)                        # fill with zeros
df.fillna(method='ffill')           # fill with the previous value
df.fillna(method='bfill')           # fill with the next value
df.fillna(df.mean())                # fill with the mean value

Grouping and Aggregating Data

Grouping with groupby

# Simple grouping
city_groups = df.groupby('City')

# Aggregation by groups
average_salary_by_city = df.groupby('City')['Salary'].mean()

# Multiple grouping
df.groupby(['City', 'Age'])['Salary'].sum()

# Multiple aggregations
df.groupby('City').agg({
    'Salary': ['mean', 'max', 'min'],
    'Age': 'mean'
})

Pivot Tables

# Creating a pivot table
pivot_table = df.pivot_table(
    values='Salary',
    index='City',
    columns='Age',
    aggfunc='mean'
)

# More complex pivot table
pivot_complex = df.pivot_table(
    values=['Salary', 'Age'],
    index='City',
    aggfunc={'Salary': 'mean', 'Age': 'count'}
)

Merging, Joining, and Concatenating Tables

Joining Methods

# Vertical concatenation (concat)
df_combined = pd.concat([df1, df2], ignore_index=True)

# Horizontal concatenation
df_wide = pd.concat([df1, df2], axis=1)

# Merging by keys (merge)
df_merged = pd.merge(df1, df2, on='ID', how='inner')

# Different types of joins
left_join = pd.merge(df1, df2, on='ID', how='left')
right_join = pd.merge(df1, df2, on='ID', how='right')
outer_join = pd.merge(df1, df2, on='ID', how='outer')

# Joining by indexes
df_joined = df1.join(df2, how='left')

Data Modification and Cleaning

Data Transformation

# Renaming columns
df.rename(columns={'Name': 'Full Name', 'Age': 'Age'}, inplace=True)

# Replacing values
df.replace('Moscow', 'Moscow', inplace=True)
df.replace({'City': {'St. Petersburg': 'Saint Petersburg'}}, inplace=True)

# Converting data types
df['Age'] = df['Age'].astype('int64')
df['Date'] = pd.to_datetime(df['Date'])

# Applying functions
df['Name_upper'] = df['Name'].str.upper()
df['Salary_thousands'] = df['Salary'] / 1000

Working with Strings

# String operations
df['Name'].str.len()                 # string length
df['Name'].str.contains('An')        # contains substring
df['Name'].str.extract('([A-Z])')    # extraction by regular expression
df['Name'].str.split(' ')            # string split

Working with Time Series

# Creating a time index
dates = pd.date_range('2023-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(100), index=dates)

# Resampling (changing frequency)
monthly = ts.resample('M').mean()   # by months
weekly = ts.resample('W').sum()     # by weeks

# Time shifts
ts.shift(1)                         # shift by one period
ts.shift(-1)                        # shift back

# Moving windows
rolling_mean = ts.rolling(window=7).mean()  # moving average

Data Visualization with Pandas

import matplotlib.pyplot as plt

# Simple graphs
df['Age'].plot(kind='hist', bins=10)
df['Salary'].plot(kind='box')
df.plot(x='Age', y='Salary', kind='scatter')

# Grouping and visualization
df.groupby('City')['Salary'].mean().plot(kind='bar')

# Time series
ts.plot()
plt.show()

Comprehensive Table of Pandas Methods and Functions

Category Function/Method Description Usage Example
Data Creation pd.Series() Create a one-dimensional array pd.Series([1,2,3])
  pd.DataFrame() Create a table pd.DataFrame({'A': [1,2,3]})
  pd.read_csv() Load CSV pd.read_csv('file.csv')
  pd.read_excel() Load Excel pd.read_excel('file.xlsx')
  pd.read_json() Load JSON pd.read_json('file.json')
  pd.read_sql() Load from SQL pd.read_sql(query, connection)
  pd.read_html() Load HTML tables pd.read_html('url')
  pd.read_parquet() Load Parquet pd.read_parquet('file.parquet')
Data Viewing df.head() First n rows df.head(10)
  df.tail() Last n rows df.tail(5)
  df.info() Data information df.info()
  df.describe() Descriptive statistics df.describe()
  df.shape Dimensionality df.shape
  df.columns Column names df.columns
  df.index Index df.index
  df.dtypes Data types df.dtypes
  df.values Values as numpy array df.values
  df.sample() Random sample df.sample(n=10)
Indexing df[col] Column selection df['name']
  df[[cols]] Column selection df[['name', 'age']]
  df.loc[] Selection by labels df.loc[0, 'name']
  df.iloc[] Selection by positions df.iloc[0, 1]
  df.at[] Scalar access by labels df.at[0, 'name']
  df.iat[] Scalar access by positions df.iat[0, 1]
  df.query() Filtering by expression df.query('age > 25')
Filtering df[condition] Boolean filtering df[df['age'] > 25]
  df.where() Conditional assignment df.where(df > 0, 0)
  df.mask() Inverse conditional filtering df.mask(df < 0)
  df.isin() Inclusion check df['city'].isin(['Moscow'])
  df.between() Range check df['age'].between(20, 30)
Structure Change df.set_index() Setting the index df.set_index('name')
  df.reset_index() Resetting the index df.reset_index()
  df.rename() Renaming df.rename(columns={'old': 'new'})
  df.reindex() Reindexing df.reindex(['a', 'b', 'c'])
  df.sort_values() Sorting by values df.sort_values('age')
  df.sort_index() Sorting by index df.sort_index()
  df.transpose() / df.T Transposition df.T
Data Modification df.drop() Deleting rows/columns df.drop('column', axis=1)
  df.drop_duplicates() Deleting duplicates df.drop_duplicates()
  df.assign() Adding columns df.assign(new_col=lambda x: x['a'] + x['b'])
  df.insert() Inserting a column df.insert(0, 'new_col', values)
  df.replace() Replacing values df.replace('old', 'new')
  df.fillna() Filling missing values df.fillna(0)
  df.dropna() Deleting missing values df.dropna()
Applying Functions df.apply() Applying a function df.apply(lambda x: x.max() - x.min())
  df.applymap() Applying to each element df.applymap(lambda x: x.upper())
  df.map() Applying to Series df['col'].map(lambda x: x * 2)
  df.pipe() Applying a function to DataFrame df.pipe(func)
Aggregation df.sum() Sum df.sum()
  df.mean() Average df.mean()
  df.median() Median df.median()
  df.min() / df.max() Minimum/maximum df.min()
  df.std() Standard deviation df.std()
  df.var() Variance df.var()
  df.count() Number of non-empty df.count()
  df.nunique() Number of unique df.nunique()
  df.value_counts() Value count df['col'].value_counts()
  df.agg() Multiple aggregation df.agg(['sum', 'mean'])
Grouping df.groupby() Grouping df.groupby('city')
  df.pivot() Simple pivot table df.pivot(index='A', columns='B', values='C')
  df.pivot_table() Pivot table with aggregation df.pivot_table(values='C', index='A', columns='B', aggfunc='mean')
  df.crosstab() Cross-table pd.crosstab(df['A'], df['B'])
Combination pd.concat() Concatenation pd.concat([df1, df2])
  df.merge() Merging df1.merge(df2, on='key')
  df.join() Joining df1.join(df2)
  df.append() Adding rows df1.append(df2)
Changing Forms df.melt() Converting to long format df.melt(id_vars=['A'], value_vars=['B', 'C'])
  df.stack() Converting columns to rows df.stack()
  df.unstack() Converting rows to columns df.unstack()
  df.explode() Unfolding lists df.explode('col_with_lists')
Missing Values df.isna() / df.isnull() Checking for missing values df.isna()
  df.notna() / df.notnull() Checking for non-missing values df.notna()
  df.fillna() Filling missing values df.fillna(method='ffill')
  df.dropna() Deleting missing values df.dropna(axis=1)
  df.interpolate() Interpolation df.interpolate()
Time Series pd.to_datetime() Converting to datetime pd.to_datetime(df['date'])
  pd.date_range() Creating a date range pd.date_range('2020-01-01', periods=10)
  df.resample() Resampling df.resample('M').mean()
  df.shift() Shift in time df.shift(1)
  df.rolling() Moving windows df.rolling(window=3).mean()
  df.expanding() Expanding windows df.expanding().sum()
Working with Strings df.str.len() String length df['name'].str.len()
  df.str.upper() / df.str.lower() Case df['name'].str.upper()
  df.str.contains() Contains substring df['name'].str.contains('John')
  df.str.replace() String replacement df['name'].str.replace('old', 'new')
  df.str.split() String separation df['name'].str.split(' ')
  df.str.extract() Extraction by regex df['name'].str.extract('([A-Z]+)')
  df.str.strip() Deleting spaces df['name'].str.strip()
Saving df.to_csv() Saving to CSV df.to_csv('file.csv')
  df.to_excel() Saving to Excel df.to_excel('file.xlsx')
  df.to_json() Saving to JSON df.to_json('file.json')
  df.to_sql() Saving to SQL df.to_sql('table', connection)
  df.to_html() Saving to HTML df.to_html('file.html')
  df.to_parquet() Saving to Parquet df.to_parquet('file.parquet')
Visualization df.plot() Basic graph df.plot()
  df.plot.line() Line chart df.plot.line()
  df.plot.bar() Bar chart df.plot.bar()
  df.plot.hist() Histogram df.plot.hist()
  df.plot.box() Box with whiskers df.plot.box()
  df.plot.scatter() Scatterplot df.plot.scatter(x='A', y='B')
Additional df.copy() Copying df.copy()
  df.equals() Comparing DataFrame df1.equals(df2)
  df.memory_usage() Memory usage df.memory_usage()
  df.select_dtypes() Selection by data types df.select_dtypes(include=['number'])
  df.convert_dtypes() Automatic type conversion df.convert_dtypes()

Performance Optimization

Tips for Accelerating Pandas:

  • Use vectorized operations instead of loops
  • Apply the query() method for complex filters
  • Use the categorical type for string data with repeating values
  • Read data in chunks when working with large files
  • Use inplace=True for data modification operations
# Optimized reading of large files
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)

# Using categorical data
df['category'] = df['category'].astype('category')

Frequently Asked Questions

What is Pandas?

Pandas is a Python library for analyzing and manipulating structured data. It provides powerful tools for working with tables, time series, and other forms of data.

What is the Difference Between Series and DataFrame?

Series is a one-dimensional data structure (like a column in a table), while DataFrame is a two-dimensional structure (a complete table with multiple columns).

How to Load an Excel File with Multiple Sheets?

# Loading all sheets
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)

# Loading a specific sheet
df = pd.read_excel('file.xlsx', sheet_name='Sheet2')

How to Combine Multiple CSV Files?

import glob

files = glob.glob('*.csv')
dfs = [pd.read_csv(file) for file in files]
combined = pd.concat(dfs, ignore_index=True)

How to Work with Large Files That Don't Fit into Memory?

# Reading in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Processing each chunk
    process_chunk(chunk)

How to Speed Up Pandas?

  • Use vectorized operations
  • Apply the correct data types
  • Use query() for filtering
  • Avoid loops
  • Use inplace=True to save memory

Conclusion: Why Pandas is Indispensable in Data Work

Pandas has become an integral part of the Python ecosystem for working with data due to its versatility, performance, and ease of use. The library provides an intuitive interface for solving a wide variety of tasks - from simply loading a CSV file to complex time series analysis.

Key Reasons for Pandas Popularity:

  • Rich functionality: more than 200 methods and functions for any data processing tasks
  • High performance: optimized algorithms for working with large amounts of data
  • Flexibility: support for many data formats and ways to process them
  • Integration: excellent compatibility with other Python libraries
  • Active community: continuous development and support

For data scientists, analysts, researchers, and developers, Pandas is a fundamental tool that significantly simplifies and speeds up working with data. Learning this library is an investment in professional development, which pays off by saving time and the ability to solve more complex tasks with less effort.

News