What are regular expressions in Python
Regular Expressions (regex) in Python are a powerful tool for working with text data. They allow you to search, extract, check, and replace text based on predefined templates. The re module in Python provides a complete set of functions for working with regular expressions.
Syntax of regular expressions
Special characters
symbol
| The | Description |
|---|---|
. |
Any character other than the newline character (\n) |
^ |
The beginning of the line |
$ |
End of line |
* |
Zero or more repetitions of the previous character |
+ |
One or more repetitions of the previous character |
? |
Zero or one repetition of the previous character |
{n} |
Exactly n repetitions of the previous character |
{n,} |
At least n repetitions of the previous character |
{n,m} |
From n to m repetitions of the previous character |
\ |
Escaping special characters |
[] |
Character set |
| ` | ` |
Special sequences
| The sequence | Description |
|---|---|
\d |
Digit (0-9) |
\D |
Not a number |
\w |
Alphanumeric character (a-z, A-Z, 0-9, _) |
\W |
Not an alphanumeric character |
\s |
Space character (space, tab, newline) |
\S |
Non-whitespace character |
\b |
The border of the word |
\B |
Not a word boundary |
Grouping and modifiers
element
| The | Description |
|---|---|
() |
Character grouping |
\1, \2, ... |
Backlinks to groups |
re.IGNORECASE (or re.I) |
Ignore case when matching |
re.MULTILINE (or re.M) |
Allow ^ and $ to match the beginning and end of each line |
re.DOTALL (or re.S) |
The character. matches any character, including a newline |
re.VERBOSE (or re.X) |
Allows the use of spaces and comments in the template |
Basic methods of the re module
1. re.compile(pattern, flags=0)
Compiles a regular expression into a Pattern object for reuse.
import re
# Compiling a regular expression to search for numbers
pattern = re.compile(r'\d+')
text = 'Price: 1,500 rubles'
result = pattern.search(text)
print(result.group()) # Output: 1500
2. re.search(pattern, string, flags=0)
Searches for the first regular expression match in a string.
import re
text = 'Phone: +7 (123) 456-78-90'
match = re.search(r'\+\d{1,3}', text)
if match:
print(match.group()) # Output: +7
3. re.match(pattern, string, flags=0)
It searches for a regular expression match only at the beginning of the string.
import re
text = '2023-12-25 - date of the event'
match = re.match(r'\d{4}-\d{2}-\d{2}', text)
if match:
print(match.group()) # Output: 2023-12-25
4. re.findall(pattern, string, flags=0)
Finds all the mappings and returns them as a list.
import re
text = 'Email: user@example.com, admin@site.ru'
emails = re.findall(r'\w+@\w+\.\w+', text)
print(emails) # Output: ['user@example.com ', 'admin@site.ru ']
5. re.finditer(pattern, string, flags=0)
Returns an iterator of Match objects for all mappings.
import re
text = 'Prices: 100p, 250p, 500p'
matches = re.finditer(r'\d+', text)
for match in matches:
print(f' Number found: {match.group()} at position {match.start()}')
6. re.sub(pattern, repl, string, count=0, flags=0)
Replaces all mappings with the specified substitution.
import re
text = 'Date: 12/25/2023'
# Replacing the date format from dots to hyphens
new_text = re.sub(r'(\d{2})\.(\d{2})\.(\d{4})', r'\3-\2-\1', text)
print(new_text) # Output: Date: 2023-12-25
7. re.split(pattern, string, maxsplit=0, flags=0)
Splits a string using a regular expression.
import re
text = 'apples,pears;oranges:tangerines'
fruits = re.split(r'[,;:]', text)
print(fruits) # Output: ['apples', 'pears', 'oranges', 'tangerines']
Practical usage examples
Validation of email addresses
import re
def validate_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, email) is not None
print(validate_email('user@example.com')) # True
print(validate_email('invalid-email')) # False
Extracting phone numbers
import re
text = '''
Contacts:
+7 (123) 456-78-90
8-800-555-35-35
+7 987 654 32 10
'''
phone_pattern = r'\+?[78][-\s]?(?:\(\d{3}\)|\d{3})[-\s]?\d{3}[-\s]?\d{2}[-\s]?\d{2}'
phones = re.findall(phone_pattern, text)
print(phones)
Working with HTML tags
import re
html = '
This is a paragraph
'
# Delete all HTML tags
clean_text = re.sub(r'<[^>]+>', ", html)
print(clean_text) # Output: This paragraph is a div
Search for words of a certain length
import re
text = 'Python is a powerful programming language'
# Find all words from 5 to 8 characters long
long_words = re.findall(r'\b\w{5,8}\b', text)
print(long_words) # Output: ['Python', 'powerful']
Capture groups and named groups
import re
text = 'Date of birth: 03/15/1990'
# Using capture groups
match = re.search(r'(\d{2})\.(\d{2})\.(\d{4})', text)
if match:
day, month, year = match.groups()
print(f'Day: {day}, Month: {month}, Year: {year}')
# Named groups
pattern = r'(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{4})'
match = re.search(pattern, text)
if match:
print(f' Year: {match.group("year")}')
Optimization tips
1. Compile regular expressions for reuse
import re
# Inefficient - compiles every time
for text in texts:
re.search(r'\d+', text)
# Efficient - compiles once
pattern = re.compile(r'\d+')
for text in texts:
pattern.search(text)
2. Use raw strings
# Correct
pattern = r'\d+\.\d+'
# Incorrect - requires double escaping
pattern ='\\d+\\.\\d+'