Scrapy-web scrapping

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

What Is Scrapy and Why You Need It

Scrapy is a powerful asynchronous framework for web scraping written in Python, providing high‑performance and scalable data collection from websites. This tool automates the extraction of structured information from web pages, making it indispensable for data analysis, price monitoring, news parsing, and many other tasks.

Key Features and Benefits of Scrapy

Asynchronous Architecture

Scrapy is built on the Twisted library, enabling asynchronous request handling and high performance when working with a large number of web pages. This allows you to process many requests concurrently without blocking the program.

Flexible Data Extraction

The framework supports various data extraction methods:

  • CSS selectors for simple element selection
  • XPath selectors for more complex queries
  • Regular expressions for precise data extraction
  • Support for JSON and XML formats

Scalability and Performance

Scrapy is designed for projects of any size—from small scripts to large‑scale industrial solutions. Built‑in caching, parallel processing, and memory‑management mechanisms ensure stable operation even when handling millions of pages.

Extensibility via Middleware

The middleware system makes it easy to extend the framework’s functionality by adding custom request/response handling logic, header management, proxy support, and more.

Installation and Configuration of Scrapy

Installing the Framework

pip install scrapy

For additional data formats, it is recommended to install the extra packages:

pip install scrapy[all]

Creating a New Project

scrapy startproject myproject
cd myproject

Generating a Spider

scrapy genspider example example.com

Scrapy Project Structure

After creating a project, the following directory layout is generated:

myproject/
├── myproject/
│   ├── __init__.py
│   ├── items.py          # Data structure definitions
│   ├── middlewares.py    # Middleware for request/response processing
│   ├── pipelines.py      # Data processing and storage
│   ├── settings.py       # Project configuration
│   └── spiders/          # Directory with spiders
│       ├── __init__.py
│       └── example.py
└── scrapy.cfg           # Project configuration file

Main File Purposes

  • spiders/ – contains parsing logic and data extraction
  • items.py – defines the structure of extracted data
  • pipelines.py – processes and stores extracted data
  • middlewares.py – configures request and response handling
  • settings.py – holds all project settings

Creating and Configuring a Spider

Basic Spider

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com"]

    def parse(self, response):
        # Extract data
        title = response.css('title::text').get()
        
        # Return data
        yield {
            'title': title,
            'url': response.url,
            'status': response.status
        }

Advanced Spider with Navigation

import scrapy

class AdvancedSpider(scrapy.Spider):
    name = "advanced"
    start_urls = ["https://example.com/catalog"]

    def parse(self, response):
        # Extract product links
        product_links = response.css('a.product-link::attr(href)').getall()
        
        for link in product_links:
            yield response.follow(link, callback=self.parse_product)
        
        # Go to next page
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get(),
            'url': response.url
        }

Data Extraction Methods

CSS Selectors

# Get the first element
title = response.css('h1::text').get()

# Get all elements
links = response.css('a::attr(href)').getall()

# Get cleaned text
clean_text = response.css('p::text').get().strip()

XPath Selectors

# Search by attribute
price = response.xpath('//span[@class="price"]/text()').get()

# Complex queries
items = response.xpath('//div[contains(@class, "item")]//h2/text()').getall()

# Conditional expressions
discount = response.xpath('//span[contains(text(), "Discount")]/following-sibling::span/text()').get()

Combining Methods

def parse_product(self, response):
    # Use CSS for main elements
    name = response.css('h1::text').get()
    
    # Use XPath for complex queries
    features = response.xpath('//div[@class="features"]//li/text()').getall()
    
    # Process data
    yield {
        'name': name.strip() if name else None,
        'features': [f.strip() for f in features if f.strip()],
        'url': response.url
    }

Working with Items and ItemLoader

Defining Data Structure

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    description = scrapy.Field()
    images = scrapy.Field()
    availability = scrapy.Field()
    category = scrapy.Field()

Using ItemLoader

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from myproject.items import ProductItem

class ProductLoader(ItemLoader):
    default_item_class = ProductItem
    default_output_processor = TakeFirst()
    
    price_in = MapCompose(lambda x: x.replace('$', ''), float)
    description_in = MapCompose(str.strip)

# Using it in a spider
def parse_product(self, response):
    loader = ProductLoader(selector=response)
    loader.add_css('name', 'h1::text')
    loader.add_css('price', '.price::text')
    loader.add_css('description', '.description::text')
    
    return loader.load_item()

Pipeline System for Data Processing

Basic Pipeline

# pipelines.py
import json

class JsonPipeline:
    def open_spider(self, spider):
        self.file = open('items.json', 'w', encoding='utf-8')
        self.file.write('[\n')
        self.first_item = True

    def close_spider(self, spider):
        self.file.write('\n]')
        self.file.close()

    def process_item(self, item, spider):
        if not self.first_item:
            self.file.write(',\n')
        else:
            self.first_item = False
        
        line = json.dumps(dict(item), ensure_ascii=False, indent=2)
        self.file.write(line)
        return item

Data Cleaning Pipeline

class CleanDataPipeline:
    def process_item(self, item, spider):
        # Clean price
        if item.get('price'):
            item['price'] = item['price'].replace('$', '').replace(',', '')
            try:
                item['price'] = float(item['price'])
            except ValueError:
                item['price'] = None
        
        # Clean description
        if item.get('description'):
            item['description'] = item['description'].strip()
        
        return item

Database Storage Pipeline

import sqlite3

class DatabasePipeline:
    def open_spider(self, spider):
        self.connection = sqlite3.connect('items.db')
        self.cursor = self.connection.cursor()
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS products (
                id INTEGER PRIMARY KEY,
                name TEXT,
                price REAL,
                description TEXT,
                url TEXT
            )
        ''')
        self.connection.commit()

    def close_spider(self, spider):
        self.connection.close()

    def process_item(self, item, spider):
        self.cursor.execute('''
            INSERT INTO products (name, price, description, url)
            VALUES (?, ?, ?, ?)
        ''', (
            item.get('name'),
            item.get('price'),
            item.get('description'),
            item.get('url')
        ))
        self.connection.commit()
        return item

Project Settings and Configuration

Core Settings in settings.py

# Basic settings
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# Respect robots.txt
ROBOTSTXT_OBEY = False

# User-Agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

# Default request headers
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
}

# Performance settings
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 1

# AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Pipelines
ITEM_PIPELINES = {
    'myproject.pipelines.CleanDataPipeline': 300,
    'myproject.pipelines.DatabasePipeline': 400,
}

# Middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'myproject.middlewares.CustomUserAgentMiddleware': 400,
}

Working with Forms and Authentication

Filling Forms

def parse_login(self, response):
    return scrapy.FormRequest.from_response(
        response,
        formdata={
            'username': 'myuser',
            'password': 'mypassword'
        },
        callback=self.after_login
    )

def after_login(self, response):
    # Verify successful login
    if "Welcome" in response.text:
        # Proceed to protected pages
        yield response.follow('/protected-page', callback=self.parse_protected)

Handling CSRF Tokens

def parse_form(self, response):
    csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()
    
    return scrapy.FormRequest(
        url='https://example.com/submit',
        formdata={
            'csrf_token': csrf_token,
            'data': 'value'
        },
        callback=self.handle_response
    )

Processing JavaScript and Dynamic Content

Integration with Splash

# Install scrapy-splash
# pip install scrapy-splash

# settings.py
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Use in a spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, self.parse, meta={
            'splash': {
                'args': {'wait': 0.5, 'png': 1, 'width': 1024, 'height': 768}
            }
        })

Integration with Selenium

from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse

class SeleniumMiddleware:
    def __init__(self):
        self.driver = webdriver.Chrome()

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        if 'selenium' in request.meta:
            self.driver.get(request.url)
            body = self.driver.page_source
            return HtmlResponse(request.url, body=body, encoding='utf-8', request=request)

    def spider_closed(self):
        self.driver.quit()

Session and Cookie Management

Working with Cookies

def parse(self, response):
    # Get cookies
    cookies = response.headers.getlist('Set-Cookie')
    
    # Send request with cookies
    yield scrapy.Request(
        url='https://example.com/page',
        cookies={'session_id': 'abc123'},
        callback=self.parse_with_session
    )

Preserving Sessions

class SessionSpider(scrapy.Spider):
    name = 'session'
    
    def start_requests(self):
        # Initial request to obtain a session
        yield scrapy.Request(
            url='https://example.com/login',
            callback=self.parse_login,
            meta={'cookiejar': 1}
        )
    
    def parse_login(self, response):
        # Login while preserving the session
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'user', 'password': 'pass'},
            callback=self.parse_protected,
            meta={'cookiejar': response.meta['cookiejar']}
        )
    
    def parse_protected(self, response):
        # Use the saved session
        yield scrapy.Request(
            url='https://example.com/data',
            callback=self.parse_data,
            meta={'cookiejar': response.meta['cookiejar']}
        )

Working with APIs and JSON Data

Parsing a JSON API

import json

class ApiSpider(scrapy.Spider):
    name = 'api'
    
    def start_requests(self):
        headers = {
            'Content-Type': 'application/json',
            'Authorization': 'Bearer token123'
        }
        yield scrapy.Request(
            url='https://api.example.com/data',
            headers=headers,
            callback=self.parse_json
        )
    
    def parse_json(self, response):
        data = json.loads(response.text)
        
        for item in data.get('items', []):
            yield {
                'id': item.get('id'),
                'name': item.get('name'),
                'value': item.get('value')
            }
        
        # Pagination
        next_page = data.get('next_page')
        if next_page:
            yield response.follow(next_page, callback=self.parse_json)

POST Requests to an API

def make_api_request(self, response):
    payload = {
        'query': 'search term',
        'page': 1,
        'limit': 50
    }
    
    yield scrapy.Request(
        url='https://api.example.com/search',
        method='POST',
        body=json.dumps(payload),
        headers={'Content-Type': 'application/json'},
        callback=self.parse_api_response
    )

Error Handling and Logging

Logging Configuration

# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'

# Custom logger settings
import logging

logging.getLogger('scrapy').setLevel(logging.WARNING)

Spider Error Handling

class ErrorHandlingSpider(scrapy.Spider):
    name = 'error_handling'
    
    def parse(self, response):
        try:
            title = response.css('title::text').get()
            if not title:
                self.logger.warning(f"No title found for {response.url}")
                return
            
            yield {'title': title, 'url': response.url}
            
        except Exception as e:
            self.logger.error(f"Error parsing {response.url}: {str(e)}")
    
    def errback(self, failure):
        self.logger.error(f"Request failed: {failure.request.url}")
        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error(f"HTTP error {response.status}: {response.url}")

Table of Core Scrapy Methods and Functions

Method / Function Description Usage Example
scrapy.Spider Base class for creating spiders class MySpider(scrapy.Spider)
start_requests() Generates initial requests yield scrapy.Request(url, callback=self.parse)
parse() Main response‑handling method def parse(self, response)
response.css() Extract data using CSS selectors response.css('h1::text').get()
response.xpath() Extract data using XPath response.xpath('//h1/text()').get()
response.follow() Follow links yield response.follow(link, callback=self.parse)
scrapy.Request() Create an HTTP request scrapy.Request(url, callback=self.parse)
scrapy.FormRequest() Submit a form scrapy.FormRequest(url, formdata={})
scrapy.FormRequest.from_response() Populate a form from a response scrapy.FormRequest.from_response(response, formdata={})
ItemLoader Loader for processing items loader = ItemLoader(item=MyItem())
scrapy.Field() Define a field in an Item name = scrapy.Field()
yield Return data or requests yield {'title': title}
response.meta Pass data between requests response.meta['custom_data']
response.headers Access response headers response.headers.get('Content-Type')
response.status HTTP status code of the response if response.status == 200:
response.url URL of the current response item['url'] = response.url
response.text Textual content of the response data = json.loads(response.text)
response.body Binary content of the response response.body
self.logger Spider logger self.logger.info('Message')
scrapy.signals Signal system crawler.signals.connect()
process_item() Process items in a pipeline def process_item(self, item, spider)
open_spider() Initialize when spider starts def open_spider(self, spider)
close_spider() Clean up when spider finishes def close_spider(self, spider)
process_request() Handle a request in middleware def process_request(self, request, spider)
process_response() Handle a response in middleware def process_response(self, request, response, spider)

Performance and Optimization

Performance Tuning

# settings.py
# Number of concurrent requests
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16

# Delay between requests
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = 0.5

# Timeouts
DOWNLOAD_TIMEOUT = 180
RETRY_TIMES = 3

# Caching
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600
HTTPCACHE_DIR = 'httpcache'

# Compression
COMPRESSION_ENABLED = True

Performance Monitoring

# Enable statistics
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

# Custom metrics
class StatsSpider(scrapy.Spider):
    name = 'stats'
    
    def parse(self, response):
        # Increment counter
        self.crawler.stats.inc_value('pages_scraped')
        
        # Set a value
        self.crawler.stats.set_value('last_page_url', response.url)
        
        yield {'url': response.url}

Deployment and Scaling

Dockerization

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["scrapy", "crawl", "myspider"]

Using a Task Scheduler

# schedule_spider.py
import schedule
import time
import subprocess

def run_spider():
    subprocess.run(['scrapy', 'crawl', 'myspider'])

schedule.every().day.at("08:00").do(run_spider)
schedule.every().hour.do(run_spider)

while True:
    schedule.run_pending()
    time.sleep(1)

Scaling with Scrapyd

# Install Scrapyd
pip install scrapyd

# Start the server
scrapyd

# Deploy the project
scrapyd-deploy

# Run a spider via the API
curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider

Best Practices for Using Scrapy

Code Structure

  • Separate parsing logic into distinct methods
  • Use Items to structure data
  • Apply Pipelines for post‑processing
  • Configure middleware for common tasks

Data Handling

  • Always verify data presence before processing
  • Use try/except blocks for error handling
  • Normalize data in Pipelines
  • Maintain logs for debugging

Ethics and Compliance

  • Respect sites’ robots.txt files
  • Set reasonable delays between requests
  • Avoid overloading servers with excessive traffic
  • Use appropriate User‑Agent headers

Security

  • Never store passwords in code
  • Use environment variables for sensitive data
  • Employ proxies for anonymity
  • Regularly update dependencies

Common Issues and Solutions

IP Blocking

# Use proxies
DOWNLOADER_MIDDLEWARES = {
    'scrapy_proxy_middleware.middlewares.ProxyMiddleware': 110,
}

PROXY_LIST = [
    'http://proxy1:8000',
    'http://proxy2:8000',
    'http://proxy3:8000',
]

Bypassing Captchas

class CaptchaMiddleware:
    def process_response(self, request, response, spider):
        if 'captcha' in response.text.lower():
            # Send to captcha solving service
            return self.solve_captcha(request, response, spider)
        return response

Handling Large Data Volumes

# Use generators
def parse_large_data(self, response):
    for item in self.extract_items(response):
        yield item
        
    # Release memory
    del response

Conclusion

Scrapy is a powerful and flexible web‑scraping tool suitable for both simple data‑extraction tasks and complex industrial projects. Thanks to its architecture, rich feature set, and active community, Scrapy remains one of the top choices for Python developers working with web data.

Mastering Scrapy opens up extensive possibilities for automating data collection, market analysis, competitor monitoring, and many other use cases. Properly leveraging the framework’s capabilities enables the creation of efficient, reliable solutions for handling web data of any complexity.

News