Pypdf2 PDF processing

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

What Is PyPDF2

PyPDF2 is a Python library for working with PDF files that was developed as an improved version of the original PyPDF library. It provides a high‑level API for reading, writing, and modifying PDF documents without needing to understand the internal PDF format.

Main Features of the Library

  • Reading and extracting text from PDF files
  • Merging multiple PDF documents
  • Splitting PDFs into individual pages
  • Adding watermarks and overlaying pages
  • Encrypting and decrypting PDF files
  • Working with document metadata
  • Rotating and scaling pages
  • Handling bookmarks and outlines

Installing PyPDF2

Install the library via pip:

pip install PyPDF2

For more stable operation, it is also recommended to install additional dependencies:

pip install PyPDF2[crypto]

Core Classes and Components

PdfReader

The PdfReader class is designed for reading PDF files. It provides access to pages, metadata, and other document information.

PdfWriter

The PdfWriter class is used to create new PDF files or modify existing ones. It allows adding pages, applying encryption, and changing metadata.

PdfMerger

The PdfMerger class specializes in merging multiple PDF files into a single document with configurable page order.

Basic PDF Operations

Reading PDF Files

from PyPDF2 import PdfReader

# Load a PDF file
reader = PdfReader("example.pdf")

# Get the number of pages
print(f"Number of pages: {len(reader.pages)}")

# Extract text from the first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(text)

Extracting Text from All Pages

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
full_text = ""

for page_num, page in enumerate(reader.pages):
    text = page.extract_text()
    full_text += f"--- Page {page_num + 1} ---\n{text}\n\n"

print(full_text)

Checking Document Information

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")

# Check for encryption
if reader.is_encrypted:
    print("File is encrypted")
else:
    print("File is not encrypted")

# Get dimensions of the first page
page = reader.pages[0]
print(f"Page size: {page.mediabox}")
print(f"Rotation: {page.rotation} degrees")

Merging PDF Files

Simple Merge

from PyPDF2 import PdfMerger

merger = PdfMerger()

# Add files
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.append("file3.pdf")

# Save the result
merger.write("merged_document.pdf")
merger.close()

Merge with Page Ranges

from PyPDF2 import PdfMerger

merger = PdfMerger()

# Add only specific pages
merger.append("file1.pdf", pages=(0, 3))  # First 3 pages
merger.append("file2.pdf", pages=(2, 5))  # Pages 3‑5

merger.write("selective_merge.pdf")
merger.close()

Splitting PDFs into Pages

Extracting Individual Pages

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")

# Extract the first page
writer = PdfWriter()
writer.add_page(reader.pages[0])

with open("page1.pdf", "wb") as output_file:
    writer.write(output_file)

Splitting into Multiple Files

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")

for page_num in range(len(reader.pages)):
    writer = PdfWriter()
    writer.add_page(reader.pages[page_num])
    
    with open(f"page_{page_num + 1}.pdf", "wb") as output_file:
        writer.write(output_file)

Working with Watermarks

Adding a Watermark

from PyPDF2 import PdfReader, PdfWriter

# Load the main document and the watermark
reader = PdfReader("document.pdf")
watermark = PdfReader("watermark.pdf")
writer = PdfWriter()

# Overlay the watermark on each page
for page in reader.pages:
    page.merge_page(watermark.pages[0])
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output_file:
    writer.write(output_file)

Creating a Watermark Programmatically

from PyPDF2 import PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import io

# Create a watermark using ReportLab
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
can.setFillAlpha(0.3)  # Transparency
can.drawString(100, 100, "CONFIDENTIAL")
can.save()

# Use the generated watermark
packet.seek(0)
watermark = PdfReader(packet)
# Then use it as in the previous example

Encryption and Security

Encrypting a PDF File

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

# Copy pages
for page in reader.pages:
    writer.add_page(page)

# Apply encryption
writer.encrypt(
    user_pwd="userpassword",      # Password for viewing
    owner_pwd="ownerpassword",    # Password for editing
    use_128bit=True               # Use 128‑bit encryption
)

with open("protected.pdf", "wb") as output_file:
    writer.write(output_file)

Decrypting a PDF File

from PyPDF2 import PdfReader

reader = PdfReader("protected.pdf")

if reader.is_encrypted:
    success = reader.decrypt("userpassword")
    if success:
        print("File successfully decrypted")
        # Now you can work with the content
        text = reader.pages[0].extract_text()
        print(text)
    else:
        print("Incorrect password")

Working with Metadata

Extracting Metadata

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
metadata = reader.metadata

if metadata:
    print(f"Author: {metadata.author}")
    print(f"Title: {metadata.title}")
    print(f"Subject: {metadata.subject}")
    print(f"Creator: {metadata.creator}")
    print(f"Producer: {metadata.producer}")
    print(f"Creation date: {metadata.creation_date}")
    print(f"Modification date: {metadata.modification_date}")

Adding Metadata

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

# Copy pages
for page in reader.pages:
    writer.add_page(page)

# Add new metadata
writer.add_metadata({
    "/Author": "Ivan Ivanov",
    "/Title": "Updated Document",
    "/Subject": "Technical Documentation",
    "/Creator": "PyPDF2 Script"
})

with open("updated_metadata.pdf", "wb") as output_file:
    writer.write(output_file)

Additional Operations

Rotating Pages

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)  # Rotate 90 degrees clockwise
    writer.add_page(page)

with open("rotated.pdf", "wb") as output_file:
    writer.write(output_file)

Scaling Pages

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.scale(0.5, 0.5)  # Reduce size by half
    writer.add_page(page)

with open("scaled.pdf", "wb") as output_file:
    writer.write(output_file)

Removing Pages

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

pages_to_remove = [2, 4, 6]  # Pages to delete (0‑based index)

for i, page in enumerate(reader.pages):
    if i not in pages_to_remove:
        writer.add_page(page)

with open("pages_removed.pdf", "wb") as output_file:
    writer.write(output_file)

Working with Bookmarks

Extracting Bookmarks

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
bookmarks = reader.outline

def print_bookmarks(bookmarks, level=0):
    for bookmark in bookmarks:
        if isinstance(bookmark, list):
            print_bookmarks(bookmark, level + 1)
        else:
            print("  " * level + str(bookmark.title))

print_bookmarks(bookmarks)

Adding Bookmarks

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

# Copy pages
for page in reader.pages:
    writer.add_page(page)

# Add bookmarks
writer.add_outline_item("Introduction", 0)
writer.add_outline_item("Chapter 1", 1)
writer.add_outline_item("Chapter 2", 3)

with open("bookmarked.pdf", "wb") as output_file:
    writer.write(output_file)

Complete Table of PyPDF2 Methods and Functions

Category Method / Function Description
PDF Reading PdfReader(file) Loads a PDF file for reading
  reader.pages List of all pages in the PDF
  reader.metadata Returns document metadata
  reader.is_encrypted Checks whether the file is encrypted
  reader.decrypt(password) Decrypts the file with a password
  reader.outline Returns the bookmark outline structure
PDF Writing PdfWriter() Creates a new writer object
  writer.add_page(page) Adds a page to the document
  writer.add_blank_page(width, height) Creates a blank page
  writer.insert_page(page, index) Inserts a page at a specific position
  writer.write(stream) Saves the PDF to a stream
  writer.encrypt(user_pwd, owner_pwd) Encrypts the document
  writer.add_metadata(metadata) Adds metadata
  writer.add_outline_item(title, pagenum) Adds a bookmark
Page Operations page.extract_text() Extracts text from a page
  page.mediabox Page dimensions
  page.rotation Page rotation angle
  page.rotate(angle) Rotates the page
  page.scale(sx, sy) Scales the page
  page.merge_page(page2) Overlays one page onto another
PDF Merging PdfMerger() Class for merging PDFs
  merger.append(file) Appends a file to the end
  merger.merge(position, file) Inserts a file at a specific position
  merger.write(filename) Saves the merged PDF
  merger.close() Closes the object and releases resources

Advanced Techniques

Batch Processing of Files

import os
from PyPDF2 import PdfReader, PdfWriter

def process_pdf_directory(directory_path):
    """Process all PDF files in a directory"""
    for filename in os.listdir(directory_path):
        if filename.endswith('.pdf'):
            filepath = os.path.join(directory_path, filename)
            try:
                reader = PdfReader(filepath)
                print(f"File: {filename}")
                print(f"Pages: {len(reader.pages)}")
                print(f"Encrypted: {reader.is_encrypted}")
                print("-" * 30)
            except Exception as e:
                print(f"Error processing {filename}: {e}")

# Usage
process_pdf_directory("/path/to/pdf/files")

Extracting Images

from PyPDF2 import PdfReader
import io
from PIL import Image

def extract_images_from_pdf(pdf_path):
    """Extract images from a PDF"""
    reader = PdfReader(pdf_path)
    images = []
    
    for page_num, page in enumerate(reader.pages):
        if '/XObject' in page.get('/Resources', {}):
            xObject = page['/Resources']['/XObject'].get_object()
            
            for obj in xObject:
                if xObject[obj]['/Subtype'] == '/Image':
                    size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                    data = xObject[obj]._data
                    
                    if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                        mode = "RGB"
                    else:
                        mode = "P"
                    
                    if '/Filter' in xObject[obj]:
                        if xObject[obj]['/Filter'] == '/FlateDecode':
                            img = Image.frombytes(mode, size, data)
                            images.append(img)
    
    return images

Error Handling and Debugging

Common Error Handling

from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.errors import PdfReadError, PdfReadWarning

def safe_pdf_processing(pdf_path):
    """Safely process a PDF with error handling"""
    try:
        reader = PdfReader(pdf_path)
        
        # Check for file corruption
        if reader.is_encrypted:
            print("File is encrypted, password required")
            return None
        
        # Attempt text extraction
        text = ""
        for page_num, page in enumerate(reader.pages):
            try:
                page_text = page.extract_text()
                text += f"Page {page_num + 1}:\n{page_text}\n\n"
            except Exception as e:
                print(f"Error processing page {page_num + 1}: {e}")
                continue
        
        return text
    
    except PdfReadError as e:
        print(f"PDF read error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Usage
result = safe_pdf_processing("example.pdf")
if result:
    print(result)

Performance Optimization

Handling Large Files

from PyPDF2 import PdfReader, PdfWriter
import gc

def process_large_pdf(input_path, output_path):
    """Optimized processing of large PDF files"""
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    # Process in batches to save memory
    batch_size = 10
    total_pages = len(reader.pages)
    
    for i in range(0, total_pages, batch_size):
        batch_end = min(i + batch_size, total_pages)
        
        for page_num in range(i, batch_end):
            page = reader.pages[page_num]
            writer.add_page(page)
        
        # Force garbage collection
        gc.collect()
        
        print(f"Processed pages: {batch_end}/{total_pages}")
    
    with open(output_path, "wb") as output_file:
        writer.write(output_file)

Comparison with Other Libraries

Library Read Write Encrypt Watermark Text Extraction OCR
PyPDF2 Partial
PyMuPDF
PDFPlumber ✓ (excellent)
ReportLab Basic
pdfminer ✓ (excellent)

Best Practices

Usage Recommendations

  1. Always use context managers when working with files
  2. Check encryption before processing files
  3. Handle exceptions to improve stability
  4. Release resources after using PdfMerger
  5. Employ batch processing for large files

Optimal Code Examples

from PyPDF2 import PdfReader, PdfWriter
from contextlib import contextmanager

@contextmanager
def pdf_processor(input_path, output_path):
    """Context manager for safe PDF processing"""
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    try:
        yield reader, writer
    finally:
        # Automatic saving of the result
        with open(output_path, "wb") as output_file:
            writer.write(output_file)

# Usage
with pdf_processor("input.pdf", "output.pdf") as (reader, writer):
    for page in reader.pages:
        writer.add_page(page)

Practical Application Examples

Automating Document Workflows

from PyPDF2 import PdfReader, PdfWriter, PdfMerger
import os
from datetime import datetime

def create_report_package(reports_dir, output_path):
    """Create a report package with a title page"""
    merger = PdfMerger()
    
    # Add title page
    merger.append("title_page.pdf")
    
    # Sort and add reports
    report_files = sorted([f for f in os.listdir(reports_dir) if f.endswith('.pdf')])
    
    for report_file in report_files:
        report_path = os.path.join(reports_dir, report_file)
        merger.append(report_path)
    
    # Save the result
    merger.write(output_path)
    merger.close()
    
    print(f"Report package created: {output_path}")

# Usage
create_report_package("/path/to/reports", "monthly_reports.pdf")

Processing Invoices and Contracts

def process_invoices(invoice_dir, processed_dir):
    """Process invoices: add stamps and archive"""
    stamp = PdfReader("approved_stamp.pdf")
    
    for filename in os.listdir(invoice_dir):
        if filename.endswith('.pdf'):
            input_path = os.path.join(invoice_dir, filename)
            output_path = os.path.join(processed_dir, f"processed_{filename}")
            
            reader = PdfReader(input_path)
            writer = PdfWriter()
            
            for page in reader.pages:
                # Add stamp to each page
                page.merge_page(stamp.pages[0])
                writer.add_page(page)
            
            # Add metadata
            writer.add_metadata({
                "/ProcessedDate": datetime.now().isoformat(),
                "/Status": "Approved"
            })
            
            with open(output_path, "wb") as output_file:
                writer.write(output_file)

Frequently Asked Questions and Solutions

Why does PyPDF2 extract text poorly from some PDFs?

PyPDF2 works well with programmatically generated PDFs but has limitations with scanned documents or files that use complex layouts. For such cases, using PDFPlumber or PyMuPDF is recommended.

How can I handle an encrypted PDF without a password?

Decryption without a password is impossible. PyPDF2 provides the decrypt() method, which returns a result code:

  • 0: failure
  • 1: success with user password
  • 2: success with owner password

Can I edit existing text in a PDF?

PyPDF2 does not support editing existing text. The library can only add new elements, merge pages, and apply transformations.

How can I improve text extraction quality?

To improve extraction quality, consider:

  • Combining PyPDF2 with PDFPlumber
  • Using OCR libraries for scanned documents
  • Pre‑processing PDFs to remove artifacts

Does PyPDF2 support PDF forms?

PyPDF2 has limited support for PDF forms. For full form handling, specialized libraries are recommended.

Conclusion

PyPDF2 is a powerful and versatile tool for working with PDF documents in Python. Despite some limitations in text extraction and form handling, the library excels at core PDF tasks such as merging, splitting, encrypting, and adding watermarks.

The library is especially well suited for:

  • Automating routine document operations
  • Building document workflow systems
  • Batch processing of PDF files
  • Integrating with web applications

When choosing between PyPDF2 and alternatives, consider the specific requirements of your project: for simple PDF operations, PyPDF2 is the optimal choice, while complex text extraction may benefit from specialized tools combined with PyPDF2.

News