What Is Scrapy and Why You Need It
Scrapy is a powerful asynchronous framework for web scraping written in Python, providing high‑performance and scalable data collection from websites. This tool automates the extraction of structured information from web pages, making it indispensable for data analysis, price monitoring, news parsing, and many other tasks.
Key Features and Benefits of Scrapy
Asynchronous Architecture
Scrapy is built on the Twisted library, enabling asynchronous request handling and high performance when working with a large number of web pages. This allows you to process many requests concurrently without blocking the program.
Flexible Data Extraction
The framework supports various data extraction methods:
- CSS selectors for simple element selection
- XPath selectors for more complex queries
- Regular expressions for precise data extraction
- Support for JSON and XML formats
Scalability and Performance
Scrapy is designed for projects of any size—from small scripts to large‑scale industrial solutions. Built‑in caching, parallel processing, and memory‑management mechanisms ensure stable operation even when handling millions of pages.
Extensibility via Middleware
The middleware system makes it easy to extend the framework’s functionality by adding custom request/response handling logic, header management, proxy support, and more.
Installation and Configuration of Scrapy
Installing the Framework
pip install scrapy
For additional data formats, it is recommended to install the extra packages:
pip install scrapy[all]
Creating a New Project
scrapy startproject myproject
cd myproject
Generating a Spider
scrapy genspider example example.com
Scrapy Project Structure
After creating a project, the following directory layout is generated:
myproject/
├── myproject/
│ ├── __init__.py
│ ├── items.py # Data structure definitions
│ ├── middlewares.py # Middleware for request/response processing
│ ├── pipelines.py # Data processing and storage
│ ├── settings.py # Project configuration
│ └── spiders/ # Directory with spiders
│ ├── __init__.py
│ └── example.py
└── scrapy.cfg # Project configuration file
Main File Purposes
- spiders/ – contains parsing logic and data extraction
- items.py – defines the structure of extracted data
- pipelines.py – processes and stores extracted data
- middlewares.py – configures request and response handling
- settings.py – holds all project settings
Creating and Configuring a Spider
Basic Spider
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
def parse(self, response):
# Extract data
title = response.css('title::text').get()
# Return data
yield {
'title': title,
'url': response.url,
'status': response.status
}
Advanced Spider with Navigation
import scrapy
class AdvancedSpider(scrapy.Spider):
name = "advanced"
start_urls = ["https://example.com/catalog"]
def parse(self, response):
# Extract product links
product_links = response.css('a.product-link::attr(href)').getall()
for link in product_links:
yield response.follow(link, callback=self.parse_product)
# Go to next page
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_product(self, response):
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get(),
'url': response.url
}
Data Extraction Methods
CSS Selectors
# Get the first element
title = response.css('h1::text').get()
# Get all elements
links = response.css('a::attr(href)').getall()
# Get cleaned text
clean_text = response.css('p::text').get().strip()
XPath Selectors
# Search by attribute
price = response.xpath('//span[@class="price"]/text()').get()
# Complex queries
items = response.xpath('//div[contains(@class, "item")]//h2/text()').getall()
# Conditional expressions
discount = response.xpath('//span[contains(text(), "Discount")]/following-sibling::span/text()').get()
Combining Methods
def parse_product(self, response):
# Use CSS for main elements
name = response.css('h1::text').get()
# Use XPath for complex queries
features = response.xpath('//div[@class="features"]//li/text()').getall()
# Process data
yield {
'name': name.strip() if name else None,
'features': [f.strip() for f in features if f.strip()],
'url': response.url
}
Working with Items and ItemLoader
Defining Data Structure
# items.py
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
images = scrapy.Field()
availability = scrapy.Field()
category = scrapy.Field()
Using ItemLoader
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from myproject.items import ProductItem
class ProductLoader(ItemLoader):
default_item_class = ProductItem
default_output_processor = TakeFirst()
price_in = MapCompose(lambda x: x.replace('$', ''), float)
description_in = MapCompose(str.strip)
# Using it in a spider
def parse_product(self, response):
loader = ProductLoader(selector=response)
loader.add_css('name', 'h1::text')
loader.add_css('price', '.price::text')
loader.add_css('description', '.description::text')
return loader.load_item()
Pipeline System for Data Processing
Basic Pipeline
# pipelines.py
import json
class JsonPipeline:
def open_spider(self, spider):
self.file = open('items.json', 'w', encoding='utf-8')
self.file.write('[\n')
self.first_item = True
def close_spider(self, spider):
self.file.write('\n]')
self.file.close()
def process_item(self, item, spider):
if not self.first_item:
self.file.write(',\n')
else:
self.first_item = False
line = json.dumps(dict(item), ensure_ascii=False, indent=2)
self.file.write(line)
return item
Data Cleaning Pipeline
class CleanDataPipeline:
def process_item(self, item, spider):
# Clean price
if item.get('price'):
item['price'] = item['price'].replace('$', '').replace(',', '')
try:
item['price'] = float(item['price'])
except ValueError:
item['price'] = None
# Clean description
if item.get('description'):
item['description'] = item['description'].strip()
return item
Database Storage Pipeline
import sqlite3
class DatabasePipeline:
def open_spider(self, spider):
self.connection = sqlite3.connect('items.db')
self.cursor = self.connection.cursor()
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
name TEXT,
price REAL,
description TEXT,
url TEXT
)
''')
self.connection.commit()
def close_spider(self, spider):
self.connection.close()
def process_item(self, item, spider):
self.cursor.execute('''
INSERT INTO products (name, price, description, url)
VALUES (?, ?, ?, ?)
''', (
item.get('name'),
item.get('price'),
item.get('description'),
item.get('url')
))
self.connection.commit()
return item
Project Settings and Configuration
Core Settings in settings.py
# Basic settings
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
# Respect robots.txt
ROBOTSTXT_OBEY = False
# User-Agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# Default request headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate',
}
# Performance settings
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 1
# AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Pipelines
ITEM_PIPELINES = {
'myproject.pipelines.CleanDataPipeline': 300,
'myproject.pipelines.DatabasePipeline': 400,
}
# Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'myproject.middlewares.CustomUserAgentMiddleware': 400,
}
Working with Forms and Authentication
Filling Forms
def parse_login(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={
'username': 'myuser',
'password': 'mypassword'
},
callback=self.after_login
)
def after_login(self, response):
# Verify successful login
if "Welcome" in response.text:
# Proceed to protected pages
yield response.follow('/protected-page', callback=self.parse_protected)
Handling CSRF Tokens
def parse_form(self, response):
csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()
return scrapy.FormRequest(
url='https://example.com/submit',
formdata={
'csrf_token': csrf_token,
'data': 'value'
},
callback=self.handle_response
)
Processing JavaScript and Dynamic Content
Integration with Splash
# Install scrapy-splash
# pip install scrapy-splash
# settings.py
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Use in a spider
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'args': {'wait': 0.5, 'png': 1, 'width': 1024, 'height': 768}
}
})
Integration with Selenium
from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse
class SeleniumMiddleware:
def __init__(self):
self.driver = webdriver.Chrome()
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
if 'selenium' in request.meta:
self.driver.get(request.url)
body = self.driver.page_source
return HtmlResponse(request.url, body=body, encoding='utf-8', request=request)
def spider_closed(self):
self.driver.quit()
Session and Cookie Management
Working with Cookies
def parse(self, response):
# Get cookies
cookies = response.headers.getlist('Set-Cookie')
# Send request with cookies
yield scrapy.Request(
url='https://example.com/page',
cookies={'session_id': 'abc123'},
callback=self.parse_with_session
)
Preserving Sessions
class SessionSpider(scrapy.Spider):
name = 'session'
def start_requests(self):
# Initial request to obtain a session
yield scrapy.Request(
url='https://example.com/login',
callback=self.parse_login,
meta={'cookiejar': 1}
)
def parse_login(self, response):
# Login while preserving the session
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'user', 'password': 'pass'},
callback=self.parse_protected,
meta={'cookiejar': response.meta['cookiejar']}
)
def parse_protected(self, response):
# Use the saved session
yield scrapy.Request(
url='https://example.com/data',
callback=self.parse_data,
meta={'cookiejar': response.meta['cookiejar']}
)
Working with APIs and JSON Data
Parsing a JSON API
import json
class ApiSpider(scrapy.Spider):
name = 'api'
def start_requests(self):
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer token123'
}
yield scrapy.Request(
url='https://api.example.com/data',
headers=headers,
callback=self.parse_json
)
def parse_json(self, response):
data = json.loads(response.text)
for item in data.get('items', []):
yield {
'id': item.get('id'),
'name': item.get('name'),
'value': item.get('value')
}
# Pagination
next_page = data.get('next_page')
if next_page:
yield response.follow(next_page, callback=self.parse_json)
POST Requests to an API
def make_api_request(self, response):
payload = {
'query': 'search term',
'page': 1,
'limit': 50
}
yield scrapy.Request(
url='https://api.example.com/search',
method='POST',
body=json.dumps(payload),
headers={'Content-Type': 'application/json'},
callback=self.parse_api_response
)
Error Handling and Logging
Logging Configuration
# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'
# Custom logger settings
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)
Spider Error Handling
class ErrorHandlingSpider(scrapy.Spider):
name = 'error_handling'
def parse(self, response):
try:
title = response.css('title::text').get()
if not title:
self.logger.warning(f"No title found for {response.url}")
return
yield {'title': title, 'url': response.url}
except Exception as e:
self.logger.error(f"Error parsing {response.url}: {str(e)}")
def errback(self, failure):
self.logger.error(f"Request failed: {failure.request.url}")
if failure.check(HttpError):
response = failure.value.response
self.logger.error(f"HTTP error {response.status}: {response.url}")
Table of Core Scrapy Methods and Functions
| Method / Function | Description | Usage Example |
|---|---|---|
scrapy.Spider |
Base class for creating spiders | class MySpider(scrapy.Spider) |
start_requests() |
Generates initial requests | yield scrapy.Request(url, callback=self.parse) |
parse() |
Main response‑handling method | def parse(self, response) |
response.css() |
Extract data using CSS selectors | response.css('h1::text').get() |
response.xpath() |
Extract data using XPath | response.xpath('//h1/text()').get() |
response.follow() |
Follow links | yield response.follow(link, callback=self.parse) |
scrapy.Request() |
Create an HTTP request | scrapy.Request(url, callback=self.parse) |
scrapy.FormRequest() |
Submit a form | scrapy.FormRequest(url, formdata={}) |
scrapy.FormRequest.from_response() |
Populate a form from a response | scrapy.FormRequest.from_response(response, formdata={}) |
ItemLoader |
Loader for processing items | loader = ItemLoader(item=MyItem()) |
scrapy.Field() |
Define a field in an Item | name = scrapy.Field() |
yield |
Return data or requests | yield {'title': title} |
response.meta |
Pass data between requests | response.meta['custom_data'] |
response.headers |
Access response headers | response.headers.get('Content-Type') |
response.status |
HTTP status code of the response | if response.status == 200: |
response.url |
URL of the current response | item['url'] = response.url |
response.text |
Textual content of the response | data = json.loads(response.text) |
response.body |
Binary content of the response | response.body |
self.logger |
Spider logger | self.logger.info('Message') |
scrapy.signals |
Signal system | crawler.signals.connect() |
process_item() |
Process items in a pipeline | def process_item(self, item, spider) |
open_spider() |
Initialize when spider starts | def open_spider(self, spider) |
close_spider() |
Clean up when spider finishes | def close_spider(self, spider) |
process_request() |
Handle a request in middleware | def process_request(self, request, spider) |
process_response() |
Handle a response in middleware | def process_response(self, request, response, spider) |
Performance and Optimization
Performance Tuning
# settings.py
# Number of concurrent requests
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
# Delay between requests
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = 0.5
# Timeouts
DOWNLOAD_TIMEOUT = 180
RETRY_TIMES = 3
# Caching
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600
HTTPCACHE_DIR = 'httpcache'
# Compression
COMPRESSION_ENABLED = True
Performance Monitoring
# Enable statistics
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
# Custom metrics
class StatsSpider(scrapy.Spider):
name = 'stats'
def parse(self, response):
# Increment counter
self.crawler.stats.inc_value('pages_scraped')
# Set a value
self.crawler.stats.set_value('last_page_url', response.url)
yield {'url': response.url}
Deployment and Scaling
Dockerization
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["scrapy", "crawl", "myspider"]
Using a Task Scheduler
# schedule_spider.py
import schedule
import time
import subprocess
def run_spider():
subprocess.run(['scrapy', 'crawl', 'myspider'])
schedule.every().day.at("08:00").do(run_spider)
schedule.every().hour.do(run_spider)
while True:
schedule.run_pending()
time.sleep(1)
Scaling with Scrapyd
# Install Scrapyd
pip install scrapyd
# Start the server
scrapyd
# Deploy the project
scrapyd-deploy
# Run a spider via the API
curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider
Best Practices for Using Scrapy
Code Structure
- Separate parsing logic into distinct methods
- Use Items to structure data
- Apply Pipelines for post‑processing
- Configure middleware for common tasks
Data Handling
- Always verify data presence before processing
- Use try/except blocks for error handling
- Normalize data in Pipelines
- Maintain logs for debugging
Ethics and Compliance
- Respect sites’ robots.txt files
- Set reasonable delays between requests
- Avoid overloading servers with excessive traffic
- Use appropriate User‑Agent headers
Security
- Never store passwords in code
- Use environment variables for sensitive data
- Employ proxies for anonymity
- Regularly update dependencies
Common Issues and Solutions
IP Blocking
# Use proxies
DOWNLOADER_MIDDLEWARES = {
'scrapy_proxy_middleware.middlewares.ProxyMiddleware': 110,
}
PROXY_LIST = [
'http://proxy1:8000',
'http://proxy2:8000',
'http://proxy3:8000',
]
Bypassing Captchas
class CaptchaMiddleware:
def process_response(self, request, response, spider):
if 'captcha' in response.text.lower():
# Send to captcha solving service
return self.solve_captcha(request, response, spider)
return response
Handling Large Data Volumes
# Use generators
def parse_large_data(self, response):
for item in self.extract_items(response):
yield item
# Release memory
del response
Conclusion
Scrapy is a powerful and flexible web‑scraping tool suitable for both simple data‑extraction tasks and complex industrial projects. Thanks to its architecture, rich feature set, and active community, Scrapy remains one of the top choices for Python developers working with web data.
Mastering Scrapy opens up extensive possibilities for automating data collection, market analysis, competitor monitoring, and many other use cases. Properly leveraging the framework’s capabilities enables the creation of efficient, reliable solutions for handling web data of any complexity.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed