Working with CSV Files in Python
CSV (Comma-Separated Values) is one of the most common formats for storing tabular data. This format is used in various fields, from exporting Excel spreadsheets to exchanging information between web services and databases.
Python provides effective tools for working with CSV files. You can use the basic csv module or the powerful pandas library for data analysis. This article covers methods for reading and writing CSV files, processing data, preventing errors, and optimizing performance when working with large files.
CSV Format Basics
CSV File Structure
CSV is a simple text-based format. Each line in the file represents a single record. Values within a record are separated by special characters. The default separator is a comma, but a semicolon, tab, or pipe can also be used.
The content of a standard CSV file looks like this:
Name,Age,City Anna,25,Moscow Ivan,30,Saint Petersburg
Advantages of the CSV Format
The CSV format has several key advantages:
It features a simple, human-readable structure. It is supported by most applications and programming languages. It results in small file sizes compared to other formats. It can be edited in any text editor. It is versatile for data exchange between different systems.
Working with CSVs Using the csv Module
Basic CSV File Reading
Python's standard csv module is suitable for basic operations with CSV files. To simply read data, use the following approach:
import csv
with open("data.csv", newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
Execution result:
['Name', 'Age', 'City']
['Anna', '25', 'Moscow']
['Ivan', '30', 'Saint Petersburg']
Reading CSVs with Headers
For convenient work with named columns, use csv.DictReader:
with open("data.csv", encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row["Name"], row["City"])
This method automatically creates a dictionary for each row. The keys are the values from the first row of the file.
Writing Data to a CSV File
Simple Row Writing
The basic method for writing data uses csv.writer:
with open("output.csv", mode='w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(["Name", "Age"])
writer.writerow(["Oleg", 35])
Writing Dictionary Data
To write structured data, it's convenient to use csv.DictWriter:
data = [
{"Name": "Maria", "Age": 28},
{"Name": "Alexey", "Age": 40},
]
with open("people.csv", "w", newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=["Name", "Age"])
writer.writeheader()
writer.writerows(data)
Customizing Reading and Writing Parameters
The csv module allows you to customize various parameters:
delimiter: specifies the field separator quotechar: the character used to quote fields containing special characters skipinitialspace: skips whitespace following the delimiter lineterminator: the character used to terminate lines
Handling CSV Files with the pandas Library
Installing and Importing pandas
The pandas library provides more powerful capabilities for working with CSV files:
pip install pandas
Reading CSV Files in pandas
The primary method for reading CSV files is:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
Key Parameters of read_csv
The read_csv function supports numerous parameters to customize reading:
sep: field separator (comma, semicolon, tab) encoding: file encoding (e.g., utf-8, cp1251, latin1) nrows: number of rows to read usecols: select specific columns skiprows: skip initial rows na_values: values to be interpreted as missing (NaN)
Writing Data to CSV
Saving a DataFrame to a CSV file is done with the to_csv method:
df.to_csv("new_data.csv", index=False)
The index=False parameter prevents writing row indices to the file.
Iterating Over DataFrame Columns
The items() method allows you to iterate over DataFrame columns:
for label, content in df.items():
print(f"Column: {label}")
print(content.head())
Handling Non-Standard Formats
Reading Files with Different Delimiters
Many CSV files use delimiters other than a comma:
# File with a semicolon delimiter
df = pd.read_csv("data_semicolon.csv", sep=";")
# File with a tab delimiter
df = pd.read_csv("data_tab.csv", sep="\t")
Handling Different Encodings
The correct encoding is critical for reading data accurately:
# UTF-8 encoding
df = pd.read_csv("data.csv", encoding='utf-8')
# Windows-1251 encoding
df = pd.read_csv("data.csv", encoding='cp1251')
# Latin-1 encoding
df = pd.read_csv("data.csv", encoding='latin1')
Solving Common Problems
Encoding Issues
An incorrect encoding leads to garbled text. The solution is to specify the correct encoding when reading the file.
Unwanted Indexes in the Output File
When saving a DataFrame, row indices are added automatically. To exclude them, use the index=False parameter.
Handling Blank Lines
By default, pandas skips blank lines. To change this behavior, use the skip_blank_lines parameter.
Slow Loading of Large Files
For large files, it is recommended to use the chunksize parameter for processing data in pieces.
Optimizing Performance with Large Files
Reading Data in Chunks
Processing large CSV files in chunks prevents memory overflow:
chunk_size = 10000
for chunk in pd.read_csv("bigdata.csv", chunksize=chunk_size):
# Process each data chunk
processed_chunk = process_data(chunk)
# Save the results
save_results(processed_chunk)
Selecting Specific Columns
Loading only the necessary columns reduces memory consumption:
df = pd.read_csv("data.csv", usecols=["Name", "City", "Age"])
Optimizing Data Types
Specifying appropriate data types decreases memory usage:
dtype_dict = {
'Age': 'int8',
'Salary': 'float32',
'City': 'category'
}
df = pd.read_csv("data.csv", dtype=dtype_dict)
Practical Applications
Filtering Data Based on Conditions
Pandas makes it easy to filter data:
df = pd.read_csv("orders.csv")
completed_orders = df[df["Status"] == "Completed"]
high_value_orders = df[df["Amount"] > 1000]
completed_orders.to_csv("done_orders.csv", index=False)
Merging Multiple CSV Files
Combine data from multiple files into one:
import glob
# Find all files with a specific pattern
files = glob.glob("sales_*.csv")
# Read and concatenate all files
dataframes = [pd.read_csv(file) for file in files]
combined_df = pd.concat(dataframes, ignore_index=True)
# Save the combined data
combined_df.to_csv("all_sales.csv", index=False)
Aggregating and Grouping Data
Pandas provides powerful tools for data analysis:
# Group by city and calculate aggregate statistics
city_stats = df.groupby('City').agg({
'Age': 'mean',
'Salary': ['mean', 'sum', 'count']
})
# Save the aggregated data
city_stats.to_csv("city_statistics.csv")
Best Practices for Working with CSV
Choosing the Right Tool
For simple reading and writing tasks, the standard csv module is sufficient. For data analysis, filtering, or aggregation, you should use pandas.
Ensuring Correct Encoding
Always specify the encoding when working with files. For compatibility with Excel, utf-8-sig is recommended.
Verifying Delimiters
CSV files can use various delimiters. Check the file's structure before processing it.
Handling Whitespace
Use the skipinitialspace=True parameter to automatically remove extra spaces after delimiters.
Data Validation
Verify the correctness of the data after loading:
# Check for missing values
print(df.isnull().sum())
# Check data types
print(df.dtypes)
# Basic statistics
print(df.describe())
Frequently Asked Questions (FAQ)
How do you read a CSV file in Python?
There are two main methods: using the built-in csv module with the csv.reader() function or the pandas library with the pd.read_csv() function.
How do you write data to a CSV file?
For writing, you can use csv.writer() from the standard library or the DataFrame.to_csv() method in pandas.
What is the difference between pandas and the csv module?
Pandas offers richer functionality, including fast processing of large datasets, built-in functions for filtering, aggregation, and data visualization.
How do you prevent adding indexes when writing?
When calling the to_csv() method, specify the parameter index=False.
How do you work efficiently with large CSV files?
Use the chunksize parameter to process data in stages and load only the necessary columns with the usecols parameter.
How do you handle files with non-standard delimiters?
Specify the appropriate delimiter in the sep parameter when reading the file, such as sep=";", sep="\t", or sep="|".
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed