Pandas: The Ultimate Guide to Data Analysis in Python
Pandas is a fundamental library for data analysis in Python that has become an industry standard for processing structured information. Created by Wes McKinney in 2008, it got its name from the term "Panel Data" in econometrics.
What is Pandas and Why is it So Important?
Pandas is a powerful open-source library built on NumPy, which provides high-level data structures and tools for fast and efficient data analysis. The library is especially effective when working with tabular data, time series, and any structured information.
Key Benefits of Pandas:
- Versatility: Pandas supports working with various data formats - from CSV and Excel to JSON, SQL, HTML, and many others.
- Performance: Thanks to optimized algorithms based on Cython and NumPy, the library provides high-speed processing of large amounts of data.
- Intuition: The syntax of the library is as close as possible to natural language, which makes the code understandable and easy to learn.
- Integration: Pandas integrates perfectly with other Python ecosystem libraries - NumPy, Matplotlib, Scikit-learn, SciPy.
Areas of Application of Pandas:
- Data Analysis and Visualization - creating reports, dashboards, exploratory analysis
- Machine Learning - data preprocessing, feature engineering, dataset preparation
- Financial Analytics - time series analysis, calculation of financial indicators
- Business Intelligence - report preparation, sales analysis, KPI
- Scientific Research - processing experimental data, statistical analysis
- Web Scraping - structuring and cleaning collected data
Installation and Import of Pandas
Installation via pip:
pip install pandas
Installation via conda:
conda install pandas
Importing the Library:
import pandas as pd
import numpy as np # often used with pandas
The abbreviation pd has become a universal standard in the Python community and is used in all manuals and documentation.
Основные структуры данных: Series и DataFrame
Series — One-Dimensional Data with Indexes
Series is a one-dimensional array with labels (indexes) that can contain any data types: numbers, strings, boolean values, Python objects.
# Creating a Series from a list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# Creating a Series from a dictionary
data = {'Moscow': 12500000, 'Saint Petersburg': 5400000, 'Novosibirsk': 1600000}
population = pd.Series(data)
# Accessing elements
print(population['Moscow']) # 12500000
print(population.iloc[0]) # first element
DataFrame — Two-Dimensional Tabular Data
DataFrame is a two-dimensional data structure with row and column labels, similar to a table in a relational database or an Excel spreadsheet.
# Creating a DataFrame from a dictionary
data = {
'Name': ['Anna', 'Boris', 'Vika', 'Dmitry'],
'Age': [25, 30, 22, 28],
'City': ['Moscow', 'St. Petersburg', 'Kazan', 'Moscow'],
'Salary': [80000, 95000, 60000, 75000]
}
df = pd.DataFrame(data)
# Creating a DataFrame with a custom index
df = pd.DataFrame(data, index=['emp1', 'emp2', 'emp3', 'emp4'])
Loading and Saving Data
Loading from Different Formats
# CSV files
df_csv = pd.read_csv('data.csv', encoding='utf-8', sep=',')
# Excel files
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON files
df_json = pd.read_json('data.json')
# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', conn)
# HTML tables
df_html = pd.read_html('https://example.com/table')
# Parquet files (for big data)
df_parquet = pd.read_parquet('data.parquet')
Additional Loading Parameters
# CSV with parameters
df = pd.read_csv('data.csv',
sep=';', # separator
header=0, # row with headers
index_col=0, # column for index
parse_dates=True, # automatic date conversion
na_values=['N/A', 'NULL']) # values for NaN
Saving Data
# Saving to CSV
df.to_csv('output.csv', index=False, encoding='utf-8')
# Saving to Excel
df.to_excel('output.xlsx', sheet_name='Data', index=False)
# Saving to JSON
df.to_json('output.json', orient='records', force_ascii=False)
# Saving to SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)
Indexing, Filtering, and Data Selection
Column Selection
# One column
names = df['Name']
# Multiple columns
subset = df[['Name', 'Age']]
# All columns except specified
df_without_salary = df.drop('Salary', axis=1)
Row Filtering
# Logical filtering
young_employees = df[df['Age'] < 30]
# Multiple conditions
moscow_young = df[(df['City'] == 'Moscow') & (df['Age'] < 30)]
# Using the query method
high_salary = df.query('Salary > 70000')
# Filtering by a list of values
moscow_spb = df[df['City'].isin(['Moscow', 'St. Petersburg'])]
Indexing loc and iloc
# loc - by labels
df.loc[0, 'Name'] # specific cell
df.loc[0:2, ['Name', 'Age']] # range of rows and columns
df.loc[df['Age'] > 25, 'Salary'] = 100000 # conditional assignment
# iloc - by positions
df.iloc[0, 1] # first row, second column
df.iloc[1:3, 0:2] # slice by positions
df.iloc[-1] # last row
Working with Missing Values
Detecting Missing Values
# Checking for missing values
df.isnull() # boolean mask
df.isnull().sum() # number of missing values per column
df.info() # general information about missing values
Handling Missing Values
# Removing rows with missing values
df.dropna() # delete all rows with NaN
df.dropna(subset=['Name']) # delete rows with NaN in the 'Name' column
df.dropna(thresh=2) # keep rows with at least 2 non-empty values
# Filling in missing values
df.fillna(0) # fill with zeros
df.fillna(method='ffill') # fill with the previous value
df.fillna(method='bfill') # fill with the next value
df.fillna(df.mean()) # fill with the mean value
Grouping and Aggregating Data
Grouping with groupby
# Simple grouping
city_groups = df.groupby('City')
# Aggregation by groups
average_salary_by_city = df.groupby('City')['Salary'].mean()
# Multiple grouping
df.groupby(['City', 'Age'])['Salary'].sum()
# Multiple aggregations
df.groupby('City').agg({
'Salary': ['mean', 'max', 'min'],
'Age': 'mean'
})
Pivot Tables
# Creating a pivot table
pivot_table = df.pivot_table(
values='Salary',
index='City',
columns='Age',
aggfunc='mean'
)
# More complex pivot table
pivot_complex = df.pivot_table(
values=['Salary', 'Age'],
index='City',
aggfunc={'Salary': 'mean', 'Age': 'count'}
)
Merging, Joining, and Concatenating Tables
Joining Methods
# Vertical concatenation (concat)
df_combined = pd.concat([df1, df2], ignore_index=True)
# Horizontal concatenation
df_wide = pd.concat([df1, df2], axis=1)
# Merging by keys (merge)
df_merged = pd.merge(df1, df2, on='ID', how='inner')
# Different types of joins
left_join = pd.merge(df1, df2, on='ID', how='left')
right_join = pd.merge(df1, df2, on='ID', how='right')
outer_join = pd.merge(df1, df2, on='ID', how='outer')
# Joining by indexes
df_joined = df1.join(df2, how='left')
Data Modification and Cleaning
Data Transformation
# Renaming columns
df.rename(columns={'Name': 'Full Name', 'Age': 'Age'}, inplace=True)
# Replacing values
df.replace('Moscow', 'Moscow', inplace=True)
df.replace({'City': {'St. Petersburg': 'Saint Petersburg'}}, inplace=True)
# Converting data types
df['Age'] = df['Age'].astype('int64')
df['Date'] = pd.to_datetime(df['Date'])
# Applying functions
df['Name_upper'] = df['Name'].str.upper()
df['Salary_thousands'] = df['Salary'] / 1000
Working with Strings
# String operations
df['Name'].str.len() # string length
df['Name'].str.contains('An') # contains substring
df['Name'].str.extract('([A-Z])') # extraction by regular expression
df['Name'].str.split(' ') # string split
Working with Time Series
# Creating a time index
dates = pd.date_range('2023-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(100), index=dates)
# Resampling (changing frequency)
monthly = ts.resample('M').mean() # by months
weekly = ts.resample('W').sum() # by weeks
# Time shifts
ts.shift(1) # shift by one period
ts.shift(-1) # shift back
# Moving windows
rolling_mean = ts.rolling(window=7).mean() # moving average
Data Visualization with Pandas
import matplotlib.pyplot as plt
# Simple graphs
df['Age'].plot(kind='hist', bins=10)
df['Salary'].plot(kind='box')
df.plot(x='Age', y='Salary', kind='scatter')
# Grouping and visualization
df.groupby('City')['Salary'].mean().plot(kind='bar')
# Time series
ts.plot()
plt.show()
Comprehensive Table of Pandas Methods and Functions
| Category | Function/Method | Description | Usage Example |
|---|---|---|---|
| Data Creation | pd.Series() |
Create a one-dimensional array | pd.Series([1,2,3]) |
pd.DataFrame() |
Create a table | pd.DataFrame({'A': [1,2,3]}) |
|
pd.read_csv() |
Load CSV | pd.read_csv('file.csv') |
|
pd.read_excel() |
Load Excel | pd.read_excel('file.xlsx') |
|
pd.read_json() |
Load JSON | pd.read_json('file.json') |
|
pd.read_sql() |
Load from SQL | pd.read_sql(query, connection) |
|
pd.read_html() |
Load HTML tables | pd.read_html('url') |
|
pd.read_parquet() |
Load Parquet | pd.read_parquet('file.parquet') |
|
| Data Viewing | df.head() |
First n rows | df.head(10) |
df.tail() |
Last n rows | df.tail(5) |
|
df.info() |
Data information | df.info() |
|
df.describe() |
Descriptive statistics | df.describe() |
|
df.shape |
Dimensionality | df.shape |
|
df.columns |
Column names | df.columns |
|
df.index |
Index | df.index |
|
df.dtypes |
Data types | df.dtypes |
|
df.values |
Values as numpy array | df.values |
|
df.sample() |
Random sample | df.sample(n=10) |
|
| Indexing | df[col] |
Column selection | df['name'] |
df[[cols]] |
Column selection | df[['name', 'age']] |
|
df.loc[] |
Selection by labels | df.loc[0, 'name'] |
|
df.iloc[] |
Selection by positions | df.iloc[0, 1] |
|
df.at[] |
Scalar access by labels | df.at[0, 'name'] |
|
df.iat[] |
Scalar access by positions | df.iat[0, 1] |
|
df.query() |
Filtering by expression | df.query('age > 25') |
|
| Filtering | df[condition] |
Boolean filtering | df[df['age'] > 25] |
df.where() |
Conditional assignment | df.where(df > 0, 0) |
|
df.mask() |
Inverse conditional filtering | df.mask(df < 0) |
|
df.isin() |
Inclusion check | df['city'].isin(['Moscow']) |
|
df.between() |
Range check | df['age'].between(20, 30) |
|
| Structure Change | df.set_index() |
Setting the index | df.set_index('name') |
df.reset_index() |
Resetting the index | df.reset_index() |
|
df.rename() |
Renaming | df.rename(columns={'old': 'new'}) |
|
df.reindex() |
Reindexing | df.reindex(['a', 'b', 'c']) |
|
df.sort_values() |
Sorting by values | df.sort_values('age') |
|
df.sort_index() |
Sorting by index | df.sort_index() |
|
df.transpose() / df.T |
Transposition | df.T |
|
| Data Modification | df.drop() |
Deleting rows/columns | df.drop('column', axis=1) |
df.drop_duplicates() |
Deleting duplicates | df.drop_duplicates() |
|
df.assign() |
Adding columns | df.assign(new_col=lambda x: x['a'] + x['b']) |
|
df.insert() |
Inserting a column | df.insert(0, 'new_col', values) |
|
df.replace() |
Replacing values | df.replace('old', 'new') |
|
df.fillna() |
Filling missing values | df.fillna(0) |
|
df.dropna() |
Deleting missing values | df.dropna() |
|
| Applying Functions | df.apply() |
Applying a function | df.apply(lambda x: x.max() - x.min()) |
df.applymap() |
Applying to each element | df.applymap(lambda x: x.upper()) |
|
df.map() |
Applying to Series | df['col'].map(lambda x: x * 2) |
|
df.pipe() |
Applying a function to DataFrame | df.pipe(func) |
|
| Aggregation | df.sum() |
Sum | df.sum() |
df.mean() |
Average | df.mean() |
|
df.median() |
Median | df.median() |
|
df.min() / df.max() |
Minimum/maximum | df.min() |
|
df.std() |
Standard deviation | df.std() |
|
df.var() |
Variance | df.var() |
|
df.count() |
Number of non-empty | df.count() |
|
df.nunique() |
Number of unique | df.nunique() |
|
df.value_counts() |
Value count | df['col'].value_counts() |
|
df.agg() |
Multiple aggregation | df.agg(['sum', 'mean']) |
|
| Grouping | df.groupby() |
Grouping | df.groupby('city') |
df.pivot() |
Simple pivot table | df.pivot(index='A', columns='B', values='C') |
|
df.pivot_table() |
Pivot table with aggregation | df.pivot_table(values='C', index='A', columns='B', aggfunc='mean') |
|
df.crosstab() |
Cross-table | pd.crosstab(df['A'], df['B']) |
|
| Combination | pd.concat() |
Concatenation | pd.concat([df1, df2]) |
df.merge() |
Merging | df1.merge(df2, on='key') |
|
df.join() |
Joining | df1.join(df2) |
|
df.append() |
Adding rows | df1.append(df2) |
|
| Changing Forms | df.melt() |
Converting to long format | df.melt(id_vars=['A'], value_vars=['B', 'C']) |
df.stack() |
Converting columns to rows | df.stack() |
|
df.unstack() |
Converting rows to columns | df.unstack() |
|
df.explode() |
Unfolding lists | df.explode('col_with_lists') |
|
| Missing Values | df.isna() / df.isnull() |
Checking for missing values | df.isna() |
df.notna() / df.notnull() |
Checking for non-missing values | df.notna() |
|
df.fillna() |
Filling missing values | df.fillna(method='ffill') |
|
df.dropna() |
Deleting missing values | df.dropna(axis=1) |
|
df.interpolate() |
Interpolation | df.interpolate() |
|
| Time Series | pd.to_datetime() |
Converting to datetime | pd.to_datetime(df['date']) |
pd.date_range() |
Creating a date range | pd.date_range('2020-01-01', periods=10) |
|
df.resample() |
Resampling | df.resample('M').mean() |
|
df.shift() |
Shift in time | df.shift(1) |
|
df.rolling() |
Moving windows | df.rolling(window=3).mean() |
|
df.expanding() |
Expanding windows | df.expanding().sum() |
|
| Working with Strings | df.str.len() |
String length | df['name'].str.len() |
df.str.upper() / df.str.lower() |
Case | df['name'].str.upper() |
|
df.str.contains() |
Contains substring | df['name'].str.contains('John') |
|
df.str.replace() |
String replacement | df['name'].str.replace('old', 'new') |
|
df.str.split() |
String separation | df['name'].str.split(' ') |
|
df.str.extract() |
Extraction by regex | df['name'].str.extract('([A-Z]+)') |
|
df.str.strip() |
Deleting spaces | df['name'].str.strip() |
|
| Saving | df.to_csv() |
Saving to CSV | df.to_csv('file.csv') |
df.to_excel() |
Saving to Excel | df.to_excel('file.xlsx') |
|
df.to_json() |
Saving to JSON | df.to_json('file.json') |
|
df.to_sql() |
Saving to SQL | df.to_sql('table', connection) |
|
df.to_html() |
Saving to HTML | df.to_html('file.html') |
|
df.to_parquet() |
Saving to Parquet | df.to_parquet('file.parquet') |
|
| Visualization | df.plot() |
Basic graph | df.plot() |
df.plot.line() |
Line chart | df.plot.line() |
|
df.plot.bar() |
Bar chart | df.plot.bar() |
|
df.plot.hist() |
Histogram | df.plot.hist() |
|
df.plot.box() |
Box with whiskers | df.plot.box() |
|
df.plot.scatter() |
Scatterplot | df.plot.scatter(x='A', y='B') |
|
| Additional | df.copy() |
Copying | df.copy() |
df.equals() |
Comparing DataFrame | df1.equals(df2) |
|
df.memory_usage() |
Memory usage | df.memory_usage() |
|
df.select_dtypes() |
Selection by data types | df.select_dtypes(include=['number']) |
|
df.convert_dtypes() |
Automatic type conversion | df.convert_dtypes() |
Performance Optimization
Tips for Accelerating Pandas:
- Use vectorized operations instead of loops
- Apply the
query()method for complex filters - Use the
categoricaltype for string data with repeating values - Read data in chunks when working with large files
- Use
inplace=Truefor data modification operations
# Optimized reading of large files
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
# Using categorical data
df['category'] = df['category'].astype('category')
Frequently Asked Questions
What is Pandas?
Pandas is a Python library for analyzing and manipulating structured data. It provides powerful tools for working with tables, time series, and other forms of data.
What is the Difference Between Series and DataFrame?
Series is a one-dimensional data structure (like a column in a table), while DataFrame is a two-dimensional structure (a complete table with multiple columns).
How to Load an Excel File with Multiple Sheets?
# Loading all sheets
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)
# Loading a specific sheet
df = pd.read_excel('file.xlsx', sheet_name='Sheet2')
How to Combine Multiple CSV Files?
import glob
files = glob.glob('*.csv')
dfs = [pd.read_csv(file) for file in files]
combined = pd.concat(dfs, ignore_index=True)
How to Work with Large Files That Don't Fit into Memory?
# Reading in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
# Processing each chunk
process_chunk(chunk)
How to Speed Up Pandas?
- Use vectorized operations
- Apply the correct data types
- Use
query()for filtering - Avoid loops
- Use
inplace=Trueto save memory
Conclusion: Why Pandas is Indispensable in Data Work
Pandas has become an integral part of the Python ecosystem for working with data due to its versatility, performance, and ease of use. The library provides an intuitive interface for solving a wide variety of tasks - from simply loading a CSV file to complex time series analysis.
Key Reasons for Pandas Popularity:
- Rich functionality: more than 200 methods and functions for any data processing tasks
- High performance: optimized algorithms for working with large amounts of data
- Flexibility: support for many data formats and ways to process them
- Integration: excellent compatibility with other Python libraries
- Active community: continuous development and support
For data scientists, analysts, researchers, and developers, Pandas is a fundamental tool that significantly simplifies and speeds up working with data. Learning this library is an investment in professional development, which pays off by saving time and the ability to solve more complex tasks with less effort.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed