How to use regular expressions in Python?

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Introduction to Python Regular Expressions

Working with text is one of the most sought-after tasks in programming. Regular expressions become an indispensable tool for solving various problems. This could be parsing HTML pages, validating data formats, or processing logs.

In Python, the built-in re module is used for working with regular expressions. In this guide, we will thoroughly analyze the application of regular expressions in Python. We will examine popular methods such as re.search() and re.sub(). We will demonstrate their practical application with real examples.

Regular Expression Basics

Definition and Purpose

Regular expressions (RegEx) are a specialized language for describing search patterns in text. They allow you to search, validate, and replace text fragments according to specific rules.

Typical Tasks for Regular Expressions

Regular expressions in Python solve a wide range of tasks:

  • Validation of email addresses and phone numbers
  • Finding all numeric values in text
  • Replacing unwanted characters and cleaning data
  • Extracting specific words or phrases from large texts
  • Parsing structured data
  • Processing logs and system files

Getting Started

Importing the re Module

Before starting to work with regular expressions, you need to import the special module:

import re

Basic Symbols and Constructs

For effective work with regular expressions, it is important to know the basic symbols and their purpose:

Symbol Purpose
. Any character except a newline
\d Any digit (0-9)
\D Any non-digit character
\w Letter, digit, or underscore
\W All characters except \w
\s Space, tab, newline
\S All characters except spaces
^ Start of string
$ End of string
[] Character from the specified set
* Zero or more repetitions
+ One or more repetitions
{n,m} From n to m repetitions

Main Methods for Working with Regular Expressions

The re.search() Method for Finding the First Match

The re.search() method searches for the first occurrence of a given pattern in a string. It returns a Match object upon successful search or None if no match is found.

import re

text = "Email: example@mail.com"
match = re.search(r'\w+@\w+\.\w+', text)

if match:
    print("Found email:", match.group())

Execution result:

Found email: example@mail.com

Explanation of the Search Pattern

Let's break down the used pattern into parts:

  • \w+ — one or more letters, digits, or underscore characters
  • @ — at symbol (required email element)
  • . — dot (escaped with a backslash)
  • \w+ — top-level domain

The re.findall() Method for Finding All Matches

When you need to find all pattern matches in the text, the findall() method is used. It returns a list of all found matches.

text = "Prices: 100 rubles, 250 rubles, 350 rubles"
numbers = re.findall(r'\d+', text)
print(numbers)  # ['100', '250', '350']

The re.sub() Method for Substitution by Pattern

The re.sub() method replaces all occurrences of the specified pattern with the given text. This is a powerful tool for processing and cleaning data.

text = "Phone: 123-456-7890"
new_text = re.sub(r'\d', 'X', text)
print(new_text)

Result:

Phone: XXX-XXX-XXXX

Practical Applications of re.sub()

The re.sub() method is especially useful in the following cases:

  • Replacing personal data for anonymization
  • Cleaning text from unnecessary characters
  • Formatting data to specific standards
  • Normalization of input data

Advanced Techniques

Grouping Elements

Grouping allows you to select individual parts of the found match. Groups are created using parentheses.

text = "Date: 2024-05-09"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)

if match:
    year, month, day = match.groups()
    print(f"Year: {year}, Month: {month}, Day: {day}")

Result:

Year: 2024, Month: 05, Day: 09

Using Flags

Flags allow you to modify the behavior of regular expressions and make the search more flexible.

Flag Purpose
re.IGNORECASE (re.I) Ignore character case
re.MULTILINE (re.M) Multiline search mode
re.DOTALL (re.S) The dot character includes a newline character
text = "Python is great.\nPYTHON is powerful."
matches = re.findall(r'python', text, re.IGNORECASE)
print(matches)  # ['Python', 'PYTHON']

Compiling Patterns with re.compile()

To improve performance when frequently using the same pattern, it is recommended to pre-compile it:

pattern = re.compile(r'\d+')
result = pattern.findall("123 456 789")
print(result)  # ['123', '456', '789']

Compilation is especially effective when processing large amounts of data or repeatedly applying the same pattern.

Practical Examples of Use

Checking the Correctness of an Email Address

email = "user@example.com"
if re.match(r'^\w+@\w+\.\w+$', email):
    print("Email is correct")
else:
    print("Email is incorrect")

Cleaning HTML Tags from Text

html = "<p>Hello, world!</p>"
clean_text = re.sub(r'<.*?>', '', html)
print(clean_text)  # Hello, world!

Extracting Phone Numbers

text = "Contacts: +7-912-345-6789, +7-999-888-7766"
phones = re.findall(r'\+7-\d{3}-\d{3}-\d{4}', text)
print(phones)  # ['+7-912-345-6789', '+7-999-888-7766']

Validating IP Addresses

ip_pattern = r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$'
ip_address = "192.168.1.1"

if re.match(ip_pattern, ip_address):
    print("IP address is correct")
else:
    print("IP address is incorrect")

Frequently Asked Questions

Difference Between re.search() and re.match()

The main difference lies in the search area:

  • re.match() searches for a match only at the beginning of the string
  • re.search() searches for a match anywhere in the string

Replacing Only the First Occurrence

To replace only the first match, use the count parameter:

text = "foo bar foo bar"
result = re.sub('foo', 'baz', text, count=1)
print(result)  # baz bar foo bar

Avoiding Multiple Escaping

Use raw strings with the r prefix to simplify writing:

pattern = r'\d+'  # instead of '\\d+'

Greedy and Non-Greedy Quantifiers

Understanding the differences between greedy and non-greedy quantifiers is critical:

  • A greedy quantifier (*, +) captures the maximum possible number of characters
  • A non-greedy quantifier (*?, +?) captures the minimum possible number of characters
text = "<b>text</b><b>more</b>"
print(re.findall(r'<b>.*?</b>', text))  # ['<b>text</b>', '<b>more</b>']

Working with Files

Regular expressions are effective when processing files, for example, for filtering logs:

with open('log.txt') as f:
    for line in f:
        if re.search(r'ERROR', line):
            print(line.strip())

Replacing Spaces with Commas

text = "apple banana orange"
new_text = re.sub(r'\s+', ',', text)
print(new_text)  # apple,banana,orange

Performance Optimization

Recommendations for Effective Use

When working with large amounts of data, consider several important aspects:

  • Compile frequently used patterns
  • Use specific patterns instead of general ones
  • Avoid excessive grouping
  • Use anchors (^ and $) to speed up the search

Profiling Regular Expressions

To analyze the performance of various approaches, use the time module:

import time

# Measuring execution time
start_time = time.time()
results = re.findall(pattern, large_text)
end_time = time.time()

print(f"Execution time: {end_time - start_time} seconds")

Conclusion

Regular expressions are a powerful tool for processing texts in Python. Mastering the methods re.search(), re.findall(), re.sub() and the correct use of flags allows you to solve even the most complex string processing tasks.

The key to successful application of regular expressions is constant practice. Experiment with different patterns, analyze the results, and gradually develop your skills. Understanding the principles of regular expressions opens up wide opportunities for automating the processing of textual data in your projects.

News