What Is PyPDF2
PyPDF2 is a Python library for working with PDF files that was developed as an improved version of the original PyPDF library. It provides a high‑level API for reading, writing, and modifying PDF documents without needing to understand the internal PDF format.
Main Features of the Library
- Reading and extracting text from PDF files
- Merging multiple PDF documents
- Splitting PDFs into individual pages
- Adding watermarks and overlaying pages
- Encrypting and decrypting PDF files
- Working with document metadata
- Rotating and scaling pages
- Handling bookmarks and outlines
Installing PyPDF2
Install the library via pip:
pip install PyPDF2
For more stable operation, it is also recommended to install additional dependencies:
pip install PyPDF2[crypto]
Core Classes and Components
PdfReader
The PdfReader class is designed for reading PDF files. It provides access to pages, metadata, and other document information.
PdfWriter
The PdfWriter class is used to create new PDF files or modify existing ones. It allows adding pages, applying encryption, and changing metadata.
PdfMerger
The PdfMerger class specializes in merging multiple PDF files into a single document with configurable page order.
Basic PDF Operations
Reading PDF Files
from PyPDF2 import PdfReader
# Load a PDF file
reader = PdfReader("example.pdf")
# Get the number of pages
print(f"Number of pages: {len(reader.pages)}")
# Extract text from the first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(text)
Extracting Text from All Pages
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
full_text = ""
for page_num, page in enumerate(reader.pages):
text = page.extract_text()
full_text += f"--- Page {page_num + 1} ---\n{text}\n\n"
print(full_text)
Checking Document Information
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
# Check for encryption
if reader.is_encrypted:
print("File is encrypted")
else:
print("File is not encrypted")
# Get dimensions of the first page
page = reader.pages[0]
print(f"Page size: {page.mediabox}")
print(f"Rotation: {page.rotation} degrees")
Merging PDF Files
Simple Merge
from PyPDF2 import PdfMerger
merger = PdfMerger()
# Add files
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.append("file3.pdf")
# Save the result
merger.write("merged_document.pdf")
merger.close()
Merge with Page Ranges
from PyPDF2 import PdfMerger
merger = PdfMerger()
# Add only specific pages
merger.append("file1.pdf", pages=(0, 3)) # First 3 pages
merger.append("file2.pdf", pages=(2, 5)) # Pages 3‑5
merger.write("selective_merge.pdf")
merger.close()
Splitting PDFs into Pages
Extracting Individual Pages
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
# Extract the first page
writer = PdfWriter()
writer.add_page(reader.pages[0])
with open("page1.pdf", "wb") as output_file:
writer.write(output_file)
Splitting into Multiple Files
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
for page_num in range(len(reader.pages)):
writer = PdfWriter()
writer.add_page(reader.pages[page_num])
with open(f"page_{page_num + 1}.pdf", "wb") as output_file:
writer.write(output_file)
Working with Watermarks
Adding a Watermark
from PyPDF2 import PdfReader, PdfWriter
# Load the main document and the watermark
reader = PdfReader("document.pdf")
watermark = PdfReader("watermark.pdf")
writer = PdfWriter()
# Overlay the watermark on each page
for page in reader.pages:
page.merge_page(watermark.pages[0])
writer.add_page(page)
with open("watermarked.pdf", "wb") as output_file:
writer.write(output_file)
Creating a Watermark Programmatically
from PyPDF2 import PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import io
# Create a watermark using ReportLab
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
can.setFillAlpha(0.3) # Transparency
can.drawString(100, 100, "CONFIDENTIAL")
can.save()
# Use the generated watermark
packet.seek(0)
watermark = PdfReader(packet)
# Then use it as in the previous example
Encryption and Security
Encrypting a PDF File
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
# Copy pages
for page in reader.pages:
writer.add_page(page)
# Apply encryption
writer.encrypt(
user_pwd="userpassword", # Password for viewing
owner_pwd="ownerpassword", # Password for editing
use_128bit=True # Use 128‑bit encryption
)
with open("protected.pdf", "wb") as output_file:
writer.write(output_file)
Decrypting a PDF File
from PyPDF2 import PdfReader
reader = PdfReader("protected.pdf")
if reader.is_encrypted:
success = reader.decrypt("userpassword")
if success:
print("File successfully decrypted")
# Now you can work with the content
text = reader.pages[0].extract_text()
print(text)
else:
print("Incorrect password")
Working with Metadata
Extracting Metadata
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
metadata = reader.metadata
if metadata:
print(f"Author: {metadata.author}")
print(f"Title: {metadata.title}")
print(f"Subject: {metadata.subject}")
print(f"Creator: {metadata.creator}")
print(f"Producer: {metadata.producer}")
print(f"Creation date: {metadata.creation_date}")
print(f"Modification date: {metadata.modification_date}")
Adding Metadata
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
# Copy pages
for page in reader.pages:
writer.add_page(page)
# Add new metadata
writer.add_metadata({
"/Author": "Ivan Ivanov",
"/Title": "Updated Document",
"/Subject": "Technical Documentation",
"/Creator": "PyPDF2 Script"
})
with open("updated_metadata.pdf", "wb") as output_file:
writer.write(output_file)
Additional Operations
Rotating Pages
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output_file:
writer.write(output_file)
Scaling Pages
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
page.scale(0.5, 0.5) # Reduce size by half
writer.add_page(page)
with open("scaled.pdf", "wb") as output_file:
writer.write(output_file)
Removing Pages
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
pages_to_remove = [2, 4, 6] # Pages to delete (0‑based index)
for i, page in enumerate(reader.pages):
if i not in pages_to_remove:
writer.add_page(page)
with open("pages_removed.pdf", "wb") as output_file:
writer.write(output_file)
Working with Bookmarks
Extracting Bookmarks
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
bookmarks = reader.outline
def print_bookmarks(bookmarks, level=0):
for bookmark in bookmarks:
if isinstance(bookmark, list):
print_bookmarks(bookmark, level + 1)
else:
print(" " * level + str(bookmark.title))
print_bookmarks(bookmarks)
Adding Bookmarks
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
# Copy pages
for page in reader.pages:
writer.add_page(page)
# Add bookmarks
writer.add_outline_item("Introduction", 0)
writer.add_outline_item("Chapter 1", 1)
writer.add_outline_item("Chapter 2", 3)
with open("bookmarked.pdf", "wb") as output_file:
writer.write(output_file)
Complete Table of PyPDF2 Methods and Functions
| Category | Method / Function | Description |
|---|---|---|
| PDF Reading | PdfReader(file) |
Loads a PDF file for reading |
reader.pages |
List of all pages in the PDF | |
reader.metadata |
Returns document metadata | |
reader.is_encrypted |
Checks whether the file is encrypted | |
reader.decrypt(password) |
Decrypts the file with a password | |
reader.outline |
Returns the bookmark outline structure | |
| PDF Writing | PdfWriter() |
Creates a new writer object |
writer.add_page(page) |
Adds a page to the document | |
writer.add_blank_page(width, height) |
Creates a blank page | |
writer.insert_page(page, index) |
Inserts a page at a specific position | |
writer.write(stream) |
Saves the PDF to a stream | |
writer.encrypt(user_pwd, owner_pwd) |
Encrypts the document | |
writer.add_metadata(metadata) |
Adds metadata | |
writer.add_outline_item(title, pagenum) |
Adds a bookmark | |
| Page Operations | page.extract_text() |
Extracts text from a page |
page.mediabox |
Page dimensions | |
page.rotation |
Page rotation angle | |
page.rotate(angle) |
Rotates the page | |
page.scale(sx, sy) |
Scales the page | |
page.merge_page(page2) |
Overlays one page onto another | |
| PDF Merging | PdfMerger() |
Class for merging PDFs |
merger.append(file) |
Appends a file to the end | |
merger.merge(position, file) |
Inserts a file at a specific position | |
merger.write(filename) |
Saves the merged PDF | |
merger.close() |
Closes the object and releases resources |
Advanced Techniques
Batch Processing of Files
import os
from PyPDF2 import PdfReader, PdfWriter
def process_pdf_directory(directory_path):
"""Process all PDF files in a directory"""
for filename in os.listdir(directory_path):
if filename.endswith('.pdf'):
filepath = os.path.join(directory_path, filename)
try:
reader = PdfReader(filepath)
print(f"File: {filename}")
print(f"Pages: {len(reader.pages)}")
print(f"Encrypted: {reader.is_encrypted}")
print("-" * 30)
except Exception as e:
print(f"Error processing {filename}: {e}")
# Usage
process_pdf_directory("/path/to/pdf/files")
Extracting Images
from PyPDF2 import PdfReader
import io
from PIL import Image
def extract_images_from_pdf(pdf_path):
"""Extract images from a PDF"""
reader = PdfReader(pdf_path)
images = []
for page_num, page in enumerate(reader.pages):
if '/XObject' in page.get('/Resources', {}):
xObject = page['/Resources']['/XObject'].get_object()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj]._data
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if '/Filter' in xObject[obj]:
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
images.append(img)
return images
Error Handling and Debugging
Common Error Handling
from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.errors import PdfReadError, PdfReadWarning
def safe_pdf_processing(pdf_path):
"""Safely process a PDF with error handling"""
try:
reader = PdfReader(pdf_path)
# Check for file corruption
if reader.is_encrypted:
print("File is encrypted, password required")
return None
# Attempt text extraction
text = ""
for page_num, page in enumerate(reader.pages):
try:
page_text = page.extract_text()
text += f"Page {page_num + 1}:\n{page_text}\n\n"
except Exception as e:
print(f"Error processing page {page_num + 1}: {e}")
continue
return text
except PdfReadError as e:
print(f"PDF read error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
# Usage
result = safe_pdf_processing("example.pdf")
if result:
print(result)
Performance Optimization
Handling Large Files
from PyPDF2 import PdfReader, PdfWriter
import gc
def process_large_pdf(input_path, output_path):
"""Optimized processing of large PDF files"""
reader = PdfReader(input_path)
writer = PdfWriter()
# Process in batches to save memory
batch_size = 10
total_pages = len(reader.pages)
for i in range(0, total_pages, batch_size):
batch_end = min(i + batch_size, total_pages)
for page_num in range(i, batch_end):
page = reader.pages[page_num]
writer.add_page(page)
# Force garbage collection
gc.collect()
print(f"Processed pages: {batch_end}/{total_pages}")
with open(output_path, "wb") as output_file:
writer.write(output_file)
Comparison with Other Libraries
| Library | Read | Write | Encrypt | Watermark | Text Extraction | OCR |
|---|---|---|---|---|---|---|
| PyPDF2 | ✓ | ✓ | ✓ | ✓ | Partial | ✗ |
| PyMuPDF | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ |
| PDFPlumber | ✓ | ✗ | ✗ | ✗ | ✓ (excellent) | ✗ |
| ReportLab | ✓ | ✓ | ✓ | ✓ | Basic | ✗ |
| pdfminer | ✓ | ✗ | ✗ | ✗ | ✓ (excellent) | ✗ |
Best Practices
Usage Recommendations
- Always use context managers when working with files
- Check encryption before processing files
- Handle exceptions to improve stability
- Release resources after using PdfMerger
- Employ batch processing for large files
Optimal Code Examples
from PyPDF2 import PdfReader, PdfWriter
from contextlib import contextmanager
@contextmanager
def pdf_processor(input_path, output_path):
"""Context manager for safe PDF processing"""
reader = PdfReader(input_path)
writer = PdfWriter()
try:
yield reader, writer
finally:
# Automatic saving of the result
with open(output_path, "wb") as output_file:
writer.write(output_file)
# Usage
with pdf_processor("input.pdf", "output.pdf") as (reader, writer):
for page in reader.pages:
writer.add_page(page)
Practical Application Examples
Automating Document Workflows
from PyPDF2 import PdfReader, PdfWriter, PdfMerger
import os
from datetime import datetime
def create_report_package(reports_dir, output_path):
"""Create a report package with a title page"""
merger = PdfMerger()
# Add title page
merger.append("title_page.pdf")
# Sort and add reports
report_files = sorted([f for f in os.listdir(reports_dir) if f.endswith('.pdf')])
for report_file in report_files:
report_path = os.path.join(reports_dir, report_file)
merger.append(report_path)
# Save the result
merger.write(output_path)
merger.close()
print(f"Report package created: {output_path}")
# Usage
create_report_package("/path/to/reports", "monthly_reports.pdf")
Processing Invoices and Contracts
def process_invoices(invoice_dir, processed_dir):
"""Process invoices: add stamps and archive"""
stamp = PdfReader("approved_stamp.pdf")
for filename in os.listdir(invoice_dir):
if filename.endswith('.pdf'):
input_path = os.path.join(invoice_dir, filename)
output_path = os.path.join(processed_dir, f"processed_{filename}")
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
# Add stamp to each page
page.merge_page(stamp.pages[0])
writer.add_page(page)
# Add metadata
writer.add_metadata({
"/ProcessedDate": datetime.now().isoformat(),
"/Status": "Approved"
})
with open(output_path, "wb") as output_file:
writer.write(output_file)
Frequently Asked Questions and Solutions
Why does PyPDF2 extract text poorly from some PDFs?
PyPDF2 works well with programmatically generated PDFs but has limitations with scanned documents or files that use complex layouts. For such cases, using PDFPlumber or PyMuPDF is recommended.
How can I handle an encrypted PDF without a password?
Decryption without a password is impossible. PyPDF2 provides the decrypt() method, which returns a result code:
- 0: failure
- 1: success with user password
- 2: success with owner password
Can I edit existing text in a PDF?
PyPDF2 does not support editing existing text. The library can only add new elements, merge pages, and apply transformations.
How can I improve text extraction quality?
To improve extraction quality, consider:
- Combining PyPDF2 with PDFPlumber
- Using OCR libraries for scanned documents
- Pre‑processing PDFs to remove artifacts
Does PyPDF2 support PDF forms?
PyPDF2 has limited support for PDF forms. For full form handling, specialized libraries are recommended.
Conclusion
PyPDF2 is a powerful and versatile tool for working with PDF documents in Python. Despite some limitations in text extraction and form handling, the library excels at core PDF tasks such as merging, splitting, encrypting, and adding watermarks.
The library is especially well suited for:
- Automating routine document operations
- Building document workflow systems
- Batch processing of PDF files
- Integrating with web applications
When choosing between PyPDF2 and alternatives, consider the specific requirements of your project: for simple PDF operations, PyPDF2 is the optimal choice, while complex text extraction may benefit from specialized tools combined with PyPDF2.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed