Working with CSV-files in Python: reading, writing, processing

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Working with CSV Files in Python

CSV (Comma-Separated Values) is one of the most common formats for storing tabular data. This format is used in various fields, from exporting Excel spreadsheets to exchanging information between web services and databases.

Python provides effective tools for working with CSV files. You can use the basic csv module or the powerful pandas library for data analysis. This article covers methods for reading and writing CSV files, processing data, preventing errors, and optimizing performance when working with large files.

CSV Format Basics

CSV File Structure

CSV is a simple text-based format. Each line in the file represents a single record. Values within a record are separated by special characters. The default separator is a comma, but a semicolon, tab, or pipe can also be used.

The content of a standard CSV file looks like this:

Name,Age,City Anna,25,Moscow Ivan,30,Saint Petersburg

Advantages of the CSV Format

The CSV format has several key advantages:

It features a simple, human-readable structure. It is supported by most applications and programming languages. It results in small file sizes compared to other formats. It can be edited in any text editor. It is versatile for data exchange between different systems.

Working with CSVs Using the csv Module

Basic CSV File Reading

Python's standard csv module is suitable for basic operations with CSV files. To simply read data, use the following approach:

import csv

with open("data.csv", newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Execution result:

['Name', 'Age', 'City']
['Anna', '25', 'Moscow']
['Ivan', '30', 'Saint Petersburg']

Reading CSVs with Headers

For convenient work with named columns, use csv.DictReader:

with open("data.csv", encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["Name"], row["City"])

This method automatically creates a dictionary for each row. The keys are the values from the first row of the file.

Writing Data to a CSV File

Simple Row Writing

The basic method for writing data uses csv.writer:

with open("output.csv", mode='w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(["Name", "Age"])
    writer.writerow(["Oleg", 35])

Writing Dictionary Data

To write structured data, it's convenient to use csv.DictWriter:

data = [
    {"Name": "Maria", "Age": 28},
    {"Name": "Alexey", "Age": 40},
]

with open("people.csv", "w", newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=["Name", "Age"])
    writer.writeheader()
    writer.writerows(data)

Customizing Reading and Writing Parameters

The csv module allows you to customize various parameters:

delimiter: specifies the field separator quotechar: the character used to quote fields containing special characters skipinitialspace: skips whitespace following the delimiter lineterminator: the character used to terminate lines

Handling CSV Files with the pandas Library

Installing and Importing pandas

The pandas library provides more powerful capabilities for working with CSV files:

pip install pandas

Reading CSV Files in pandas

The primary method for reading CSV files is:

import pandas as pd

df = pd.read_csv("data.csv")
print(df.head())

Key Parameters of read_csv

The read_csv function supports numerous parameters to customize reading:

sep: field separator (comma, semicolon, tab) encoding: file encoding (e.g., utf-8, cp1251, latin1) nrows: number of rows to read usecols: select specific columns skiprows: skip initial rows na_values: values to be interpreted as missing (NaN)

Writing Data to CSV

Saving a DataFrame to a CSV file is done with the to_csv method:

df.to_csv("new_data.csv", index=False)

The index=False parameter prevents writing row indices to the file.

Iterating Over DataFrame Columns

The items() method allows you to iterate over DataFrame columns:

for label, content in df.items():
    print(f"Column: {label}")
    print(content.head())

Handling Non-Standard Formats

Reading Files with Different Delimiters

Many CSV files use delimiters other than a comma:

# File with a semicolon delimiter
df = pd.read_csv("data_semicolon.csv", sep=";")

# File with a tab delimiter
df = pd.read_csv("data_tab.csv", sep="\t")

Handling Different Encodings

The correct encoding is critical for reading data accurately:

# UTF-8 encoding
df = pd.read_csv("data.csv", encoding='utf-8')

# Windows-1251 encoding
df = pd.read_csv("data.csv", encoding='cp1251')

# Latin-1 encoding
df = pd.read_csv("data.csv", encoding='latin1')

Solving Common Problems

Encoding Issues

An incorrect encoding leads to garbled text. The solution is to specify the correct encoding when reading the file.

Unwanted Indexes in the Output File

When saving a DataFrame, row indices are added automatically. To exclude them, use the index=False parameter.

Handling Blank Lines

By default, pandas skips blank lines. To change this behavior, use the skip_blank_lines parameter.

Slow Loading of Large Files

For large files, it is recommended to use the chunksize parameter for processing data in pieces.

Optimizing Performance with Large Files

Reading Data in Chunks

Processing large CSV files in chunks prevents memory overflow:

chunk_size = 10000
for chunk in pd.read_csv("bigdata.csv", chunksize=chunk_size):
    # Process each data chunk
    processed_chunk = process_data(chunk)
    # Save the results
    save_results(processed_chunk)

Selecting Specific Columns

Loading only the necessary columns reduces memory consumption:

df = pd.read_csv("data.csv", usecols=["Name", "City", "Age"])

Optimizing Data Types

Specifying appropriate data types decreases memory usage:

dtype_dict = {
    'Age': 'int8',
    'Salary': 'float32',
    'City': 'category'
}
df = pd.read_csv("data.csv", dtype=dtype_dict)

Practical Applications

Filtering Data Based on Conditions

Pandas makes it easy to filter data:

df = pd.read_csv("orders.csv")
completed_orders = df[df["Status"] == "Completed"]
high_value_orders = df[df["Amount"] > 1000]
completed_orders.to_csv("done_orders.csv", index=False)

Merging Multiple CSV Files

Combine data from multiple files into one:

import glob

# Find all files with a specific pattern
files = glob.glob("sales_*.csv")

# Read and concatenate all files
dataframes = [pd.read_csv(file) for file in files]
combined_df = pd.concat(dataframes, ignore_index=True)

# Save the combined data
combined_df.to_csv("all_sales.csv", index=False)

Aggregating and Grouping Data

Pandas provides powerful tools for data analysis:

# Group by city and calculate aggregate statistics
city_stats = df.groupby('City').agg({
    'Age': 'mean',
    'Salary': ['mean', 'sum', 'count']
})

# Save the aggregated data
city_stats.to_csv("city_statistics.csv")

Best Practices for Working with CSV

Choosing the Right Tool

For simple reading and writing tasks, the standard csv module is sufficient. For data analysis, filtering, or aggregation, you should use pandas.

Ensuring Correct Encoding

Always specify the encoding when working with files. For compatibility with Excel, utf-8-sig is recommended.

Verifying Delimiters

CSV files can use various delimiters. Check the file's structure before processing it.

Handling Whitespace

Use the skipinitialspace=True parameter to automatically remove extra spaces after delimiters.

Data Validation

Verify the correctness of the data after loading:

# Check for missing values
print(df.isnull().sum())

# Check data types
print(df.dtypes)

# Basic statistics
print(df.describe())

Frequently Asked Questions (FAQ)

How do you read a CSV file in Python?

There are two main methods: using the built-in csv module with the csv.reader() function or the pandas library with the pd.read_csv() function.

How do you write data to a CSV file?

For writing, you can use csv.writer() from the standard library or the DataFrame.to_csv() method in pandas.

What is the difference between pandas and the csv module?

Pandas offers richer functionality, including fast processing of large datasets, built-in functions for filtering, aggregation, and data visualization.

How do you prevent adding indexes when writing?

When calling the to_csv() method, specify the parameter index=False.

How do you work efficiently with large CSV files?

Use the chunksize parameter to process data in stages and load only the necessary columns with the usecols parameter.

How do you handle files with non-standard delimiters?

Specify the appropriate delimiter in the sep parameter when reading the file, such as sep=";", sep="\t", or sep="|".

News