Space - advanced NLP

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Key Advantages of SpaCy

Architecture and Performance

SpaCy is built on the Cython programming language, delivering exceptional speed for text processing. The library employs optimized algorithms and data structures that enable the processing of millions of tokens per minute on standard hardware.

Multilingual Support

The library supports more than 20 languages, including Russian, English, Chinese, Japanese, Arabic, and many others. Each language comes with specialized pretrained models optimized for its specific linguistic characteristics.

Ready‑to‑Use Models

SpaCy provides pretrained models of various sizes (small, medium, large), allowing you to choose the optimal balance between speed and accuracy for a given task.

Installation and Setup of SpaCy

Installing the Core Library

pip install spacy

Downloading Language Models

For English:

python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg

For Russian:

python -m spacy download ru_core_news_sm
python -m spacy download ru_core_news_md
python -m spacy download ru_core_news_lg

Importing and Initializing

import spacy

# Load a model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("Apple is looking at buying a startup in the UK.")

Core Components of the SpaCy Pipeline

Tokenization

SpaCy automatically splits text into tokens, taking into account language‑specific rules, punctuation, and special symbols:

doc = nlp("Don't worry, it's working!")
for token in doc:
    print(token.text, token.is_alpha, token.is_punct, token.like_num)

Morphological Analysis

Each token carries rich morphological information:

for token in doc:
    print(f"Word: {token.text}")
    print(f"Lemma: {token.lemma_}")
    print(f"Part of Speech: {token.pos_}")
    print(f"Tag: {token.tag_}")
    print(f"Morphological Features: {token.morph}")

Syntactic Parsing

SpaCy builds a dependency tree for each sentence:

for token in doc:
    print(f"{token.text} <- {token.dep_} <- {token.head.text}")
    print(f"Children: {[child.text for child in token.children]}")

Lemmatization and Stemming

Lemmatization in SpaCy

SpaCy provides high‑quality lemmatization out of the box:

doc = nlp("The cats are running and jumping")
for token in doc:
    print(f"{token.text} -> {token.lemma_}")

Stemming Alternatives

Although SpaCy does not include stemming directly, you can integrate it with other libraries:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def add_stemming(doc):
    for token in doc:
        token._.stem = stemmer.stem(token.text)
    return doc

# Register a custom attribute
spacy.tokens.Token.set_extension("stem", default=None)

Part‑of‑Speech Tagging

Universal POS Tags

SpaCy uses a universal part‑of‑speech tagging scheme:

pos_counts = {}
for token in doc:
    pos = token.pos_
    pos_counts[pos] = pos_counts.get(pos, 0) + 1

for pos, count in pos_counts.items():
    print(f"{pos}: {count}")

Detailed Morphological Features

for token in doc:
    if token.pos_ == "VERB":
        print(f"Verb: {token.text}")
        print(f"Tense: {token.morph.get('Tense')}")
        print(f"Number: {token.morph.get('Number')}")

Dependency Analysis

Extracting Syntactic Relations

def extract_dependencies(doc):
    dependencies = []
    for token in doc:
        if token.dep_ != "ROOT":
            dependencies.append({
                'dependent': token.text,
                'head': token.head.text,
                'relation': token.dep_
            })
    return dependencies

Finding Subjects and Objects

def find_subjects_objects(doc):
    subjects = [token for token in doc if token.dep_ == "nsubj"]
    objects = [token for token in doc if token.dep_ in ["dobj", "pobj"]]
    return subjects, objects

Named Entity Recognition

Core NER Categories

SpaCy supports a wide range of named‑entity categories:

for ent in doc.ents:
    print(f"Entity: {ent.text}")
    print(f"Label: {ent.label_}")
    print(f"Description: {spacy.explain(ent.label_)}")
    print(f"Position: {ent.start_char}-{ent.end_char}")

Customizing NER

from spacy.training import Example

# Create custom annotations
def create_training_data():
    training_data = [
        ("Microsoft released a new product", 
         {"entities": [(8, 17, "ORG")]})
    ]
    return training_data

Working with Sentences and Phrases

Sentence Segmentation

doc = nlp("This is the first sentence. This is the second sentence!")
for i, sent in enumerate(doc.sents):
    print(f"Sentence {i+1}: {sent.text}")

Extracting Noun Phrases

def extract_noun_phrases(doc):
    noun_phrases = []
    for chunk in doc.noun_chunks:
        noun_phrases.append({
            'text': chunk.text,
            'root': chunk.root.text,
            'dep': chunk.root.dep_,
            'head': chunk.root.head.text
        })
    return noun_phrases

Patterns and Matching

Using the Matcher

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Find dates
date_pattern = [
    {"SHAPE": "dd"},
    {"LOWER": {"IN": ["january", "february", "march"]}},
    {"SHAPE": "dddd"}
]

matcher.add("DATE_PATTERN", [date_pattern])
matches = matcher(doc)

PhraseMatcher for Large Vocabularies

from spacy.matcher import PhraseMatcher

phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
companies = ["apple", "microsoft", "google", "amazon"]
patterns = [nlp.make_doc(text) for text in companies]
phrase_matcher.add("COMPANIES", patterns)

Customizing the Pipeline

Adding Custom Components

@spacy.language.Language.component("custom_sentiment")
def sentiment_component(doc):
    # Simple sentiment analysis
    positive_words = {"good", "great", "excellent", "amazing"}
    negative_words = {"bad", "terrible", "awful", "horrible"}
    
    pos_count = sum(1 for token in doc if token.lemma_.lower() in positive_words)
    neg_count = sum(1 for token in doc if token.lemma_.lower() in negative_words)
    
    doc._.sentiment = "positive" if pos_count > neg_count else "negative"
    return doc

# Register the extension
spacy.tokens.Doc.set_extension("sentiment", default="neutral")
nlp.add_pipe("custom_sentiment", last=True)

Managing Pipeline Components

# View components
print(nlp.pipe_names)

# Disable components for speed
with nlp.disable_pipes("ner", "parser"):
    doc = nlp("Fast processing with only tokenization and POS")

Multilingual Support

Working with Russian

# Load the Russian model
nlp_ru = spacy.load("ru_core_news_sm")

text_ru = "Moscow — the capital of Russia. Red Square is located in the city centre."
doc_ru = nlp_ru(text_ru)

for ent in doc_ru.ents:
    print(f"Entity: {ent.text}, Type: {ent.label_}")

Language Detection

from spacy.lang.ru import Russian
from spacy.lang.en import English

def detect_language(text):
    # Simple language detection based on models
    try:
        nlp_en = English()
        nlp_ru = Russian()
        
        doc_en = nlp_en(text)
        doc_ru = nlp_ru(text)
        
        # Basic heuristic
        return "en" if len([t for t in doc_en if t.is_alpha]) > 0 else "ru"
    except:
        return "unknown"

Visualization and Analysis

Dependency Visualization

from spacy import displacy

# Render dependency parse in a web server
displacy.serve(doc, style="dep", port=5000)

# Save to HTML
html = displacy.render(doc, style="dep", page=True)
with open("dependencies.html", "w", encoding="utf-8") as f:
    f.write(html)

Entity Visualization

# Color scheme for entities
colors = {
    "ORG": "#7aecec",
    "PERSON": "#aa9cfc",
    "GPE": "#feca57"
}

options = {"ents": ["ORG", "PERSON", "GPE"], "colors": colors}
displacy.serve(doc, style="ent", options=options)

Integration with Machine Learning

Integration with scikit‑learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

# Create a vectorizer that uses the SpaCy tokenizer
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer, lowercase=False)

Working with Transformers

# Install spacy‑transformers
# pip install spacy-transformers

import spacy_transformers

# Load a transformer‑based model
nlp_trf = spacy.load("en_core_web_trf")

# Get embeddings
doc_trf = nlp_trf("This is a sentence with transformer embeddings.")
embeddings = doc_trf._.trf_data.tensors[0]

Table of Core SpaCy Methods and Properties

Component Method/Property Description Example Usage
nlp nlp(text) Process text doc = nlp("Hello world")
nlp nlp.pipe(texts) Batch processing docs = list(nlp.pipe(texts))
nlp nlp.pipe_names List of pipeline components print(nlp.pipe_names)
nlp nlp.add_pipe() Add a component nlp.add_pipe("custom_component")
Doc doc.ents Named entities for ent in doc.ents: print(ent.text)
Doc doc.sents Sentences for sent in doc.sents: print(sent.text)
Doc doc.noun_chunks Noun phrases for chunk in doc.noun_chunks: print(chunk.text)
Doc doc.vector Vector representation similarity = doc1.similarity(doc2)
Token token.text Token text print(token.text)
Token token.lemma_ Token lemma print(token.lemma_)
Token token.pos_ Part of speech print(token.pos_)
Token token.tag_ Detailed tag print(token.tag_)
Token token.dep_ Dependency relation print(token.dep_)
Token token.head Head token print(token.head.text)
Token token.children Child tokens print(list(token.children))
Token token.is_alpha Alphabetic character if token.is_alpha: print(token.text)
Token token.is_punct Punctuation mark if token.is_punct: print(token.text)
Token token.is_stop Stop word if not token.is_stop: print(token.text)
Token token.like_num Looks like a number if token.like_num: print(token.text)
Token token.morph Morphological features print(token.morph)
Span span.text Span text print(span.text)
Span span.label_ Entity label print(span.label_)
Span span.start Start index print(span.start)
Span span.end End index print(span.end)
Matcher matcher.add() Add a pattern matcher.add("PATTERN", [pattern])
Matcher matcher(doc) Find matches matches = matcher(doc)
displacy displacy.render() Visualization html = displacy.render(doc, style="dep")
displacy displacy.serve() Web visualization server displacy.serve(doc, style="ent")

Practical Usage Examples

Customer Review Analysis

def analyze_reviews(reviews):
    results = []
    for review in reviews:
        doc = nlp(review)
        
        # Extract entities
        entities = [(ent.text, ent.label_) for ent in doc.ents]
        
        # Simple sentiment analysis
        positive_words = {"good", "great", "excellent", "amazing", "wonderful"}
        negative_words = {"bad", "terrible", "awful", "horrible", "disappointing"}
        
        pos_count = sum(1 for token in doc if token.lemma_.lower() in positive_words)
        neg_count = sum(1 for token in doc if token.lemma_.lower() in negative_words)
        
        sentiment = "positive" if pos_count > neg_count else "negative" if neg_count > pos_count else "neutral"
        
        results.append({
            'review': review,
            'entities': entities,
            'sentiment': sentiment,
            'pos_score': pos_count,
            'neg_score': neg_count
        })
    
    return results

Key Phrase Extraction

def extract_key_phrases(text, min_freq=2):
    doc = nlp(text)
    
    # Extract noun phrases
    noun_phrases = [chunk.text.lower() for chunk in doc.noun_chunks 
                   if len(chunk.text.split()) > 1]
    
    # Count frequencies
    from collections import Counter
    phrase_counts = Counter(noun_phrases)
    
    # Filter by frequency
    key_phrases = [phrase for phrase, count in phrase_counts.items() 
                  if count >= min_freq]
    
    return key_phrases

Document Classification

def classify_documents(documents, categories):
    classified = []
    
    for doc_text in documents:
        doc = nlp(doc_text)
        
        # Feature extraction
        features = {
            'entities': [ent.label_ for ent in doc.ents],
            'pos_tags': [token.pos_ for token in doc],
            'keywords': [token.lemma_.lower() for token in doc 
                        if token.is_alpha and not token.is_stop]
        }
        
        # Simple keyword‑based classification
        scores = {}
        for category, keywords in categories.items():
            score = sum(1 for keyword in keywords 
                       if keyword in features['keywords'])
            scores[category] = score
        
        predicted_category = max(scores, key=scores.get)
        classified.append({
            'text': doc_text,
            'category': predicted_category,
            'confidence': scores[predicted_category],
            'features': features
        })
    
    return classified

Performance Optimization

Batch Processing

def process_large_dataset(texts, batch_size=1000):
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        
        # Disable unnecessary components for speed
        with nlp.disable_pipes("ner", "parser"):
            docs = list(nlp.pipe(batch))
        
        for doc in docs:
            results.append({
                'text': doc.text,
                'tokens': len(doc),
                'sentences': len(list(doc.sents))
            })
    
    return results

Caching Results

import functools

@functools.lru_cache(maxsize=1000)
def cached_nlp_processing(text):
    doc = nlp(text)
    return {
        'entities': [(ent.text, ent.label_) for ent in doc.ents],
        'tokens': [token.text for token in doc],
        'pos_tags': [token.pos_ for token in doc]
    }

Frequently Asked Questions

How to Choose the Right Model?

The choice depends on your speed versus accuracy requirements. “sm” models are faster but less accurate; “lg” models are more accurate but slower.

Can I Train My Own Model?

Yes, SpaCy provides tools for training custom models. Use the spacy train command with prepared training data.

How to Process Very Large Texts?

Use nlp.pipe() for batch processing and disable unnecessary pipeline components to speed up execution.

Does SpaCy Support GPU?

Yes, when used with transformer models via the spacy‑transformers library.

How to Integrate SpaCy with Flask/Django?

Load the model at application startup and use it inside view functions. It’s recommended to employ a process pool for parallel processing.

Comparison with Other Libraries

SpaCy vs NLTK

SpaCy is geared toward production use with high performance, whereas NLTK is more suited for educational purposes and research.

SpaCy vs CoreNLP

SpaCy is written in Python and easier to use; CoreNLP (Java) offers more language models but is more complex to set up.

SpaCy vs Transformers

SpaCy provides a full processing pipeline, while Transformers focus on modern neural models. They complement each other well.

Conclusion

SpaCy is a powerful and flexible library for natural language processing that successfully combines high performance, accuracy, and ease of use. With its rich functionality, multilingual support, and seamless integration with modern machine‑learning tools, SpaCy remains one of the top choices for building NLP applications in production environments.

The library continues to evolve rapidly, adding new language support, improving algorithms, and integrating with cutting‑edge technologies such as transformers. This makes SpaCy a reliable solution for long‑term NLP projects.

News