Key Advantages of SpaCy
Architecture and Performance
SpaCy is built on the Cython programming language, delivering exceptional speed for text processing. The library employs optimized algorithms and data structures that enable the processing of millions of tokens per minute on standard hardware.
Multilingual Support
The library supports more than 20 languages, including Russian, English, Chinese, Japanese, Arabic, and many others. Each language comes with specialized pretrained models optimized for its specific linguistic characteristics.
Ready‑to‑Use Models
SpaCy provides pretrained models of various sizes (small, medium, large), allowing you to choose the optimal balance between speed and accuracy for a given task.
Installation and Setup of SpaCy
Installing the Core Library
pip install spacy
Downloading Language Models
For English:
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
For Russian:
python -m spacy download ru_core_news_sm
python -m spacy download ru_core_news_md
python -m spacy download ru_core_news_lg
Importing and Initializing
import spacy
# Load a model
nlp = spacy.load("en_core_web_sm")
# Process text
doc = nlp("Apple is looking at buying a startup in the UK.")
Core Components of the SpaCy Pipeline
Tokenization
SpaCy automatically splits text into tokens, taking into account language‑specific rules, punctuation, and special symbols:
doc = nlp("Don't worry, it's working!")
for token in doc:
print(token.text, token.is_alpha, token.is_punct, token.like_num)
Morphological Analysis
Each token carries rich morphological information:
for token in doc:
print(f"Word: {token.text}")
print(f"Lemma: {token.lemma_}")
print(f"Part of Speech: {token.pos_}")
print(f"Tag: {token.tag_}")
print(f"Morphological Features: {token.morph}")
Syntactic Parsing
SpaCy builds a dependency tree for each sentence:
for token in doc:
print(f"{token.text} <- {token.dep_} <- {token.head.text}")
print(f"Children: {[child.text for child in token.children]}")
Lemmatization and Stemming
Lemmatization in SpaCy
SpaCy provides high‑quality lemmatization out of the box:
doc = nlp("The cats are running and jumping")
for token in doc:
print(f"{token.text} -> {token.lemma_}")
Stemming Alternatives
Although SpaCy does not include stemming directly, you can integrate it with other libraries:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def add_stemming(doc):
for token in doc:
token._.stem = stemmer.stem(token.text)
return doc
# Register a custom attribute
spacy.tokens.Token.set_extension("stem", default=None)
Part‑of‑Speech Tagging
Universal POS Tags
SpaCy uses a universal part‑of‑speech tagging scheme:
pos_counts = {}
for token in doc:
pos = token.pos_
pos_counts[pos] = pos_counts.get(pos, 0) + 1
for pos, count in pos_counts.items():
print(f"{pos}: {count}")
Detailed Morphological Features
for token in doc:
if token.pos_ == "VERB":
print(f"Verb: {token.text}")
print(f"Tense: {token.morph.get('Tense')}")
print(f"Number: {token.morph.get('Number')}")
Dependency Analysis
Extracting Syntactic Relations
def extract_dependencies(doc):
dependencies = []
for token in doc:
if token.dep_ != "ROOT":
dependencies.append({
'dependent': token.text,
'head': token.head.text,
'relation': token.dep_
})
return dependencies
Finding Subjects and Objects
def find_subjects_objects(doc):
subjects = [token for token in doc if token.dep_ == "nsubj"]
objects = [token for token in doc if token.dep_ in ["dobj", "pobj"]]
return subjects, objects
Named Entity Recognition
Core NER Categories
SpaCy supports a wide range of named‑entity categories:
for ent in doc.ents:
print(f"Entity: {ent.text}")
print(f"Label: {ent.label_}")
print(f"Description: {spacy.explain(ent.label_)}")
print(f"Position: {ent.start_char}-{ent.end_char}")
Customizing NER
from spacy.training import Example
# Create custom annotations
def create_training_data():
training_data = [
("Microsoft released a new product",
{"entities": [(8, 17, "ORG")]})
]
return training_data
Working with Sentences and Phrases
Sentence Segmentation
doc = nlp("This is the first sentence. This is the second sentence!")
for i, sent in enumerate(doc.sents):
print(f"Sentence {i+1}: {sent.text}")
Extracting Noun Phrases
def extract_noun_phrases(doc):
noun_phrases = []
for chunk in doc.noun_chunks:
noun_phrases.append({
'text': chunk.text,
'root': chunk.root.text,
'dep': chunk.root.dep_,
'head': chunk.root.head.text
})
return noun_phrases
Patterns and Matching
Using the Matcher
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Find dates
date_pattern = [
{"SHAPE": "dd"},
{"LOWER": {"IN": ["january", "february", "march"]}},
{"SHAPE": "dddd"}
]
matcher.add("DATE_PATTERN", [date_pattern])
matches = matcher(doc)
PhraseMatcher for Large Vocabularies
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
companies = ["apple", "microsoft", "google", "amazon"]
patterns = [nlp.make_doc(text) for text in companies]
phrase_matcher.add("COMPANIES", patterns)
Customizing the Pipeline
Adding Custom Components
@spacy.language.Language.component("custom_sentiment")
def sentiment_component(doc):
# Simple sentiment analysis
positive_words = {"good", "great", "excellent", "amazing"}
negative_words = {"bad", "terrible", "awful", "horrible"}
pos_count = sum(1 for token in doc if token.lemma_.lower() in positive_words)
neg_count = sum(1 for token in doc if token.lemma_.lower() in negative_words)
doc._.sentiment = "positive" if pos_count > neg_count else "negative"
return doc
# Register the extension
spacy.tokens.Doc.set_extension("sentiment", default="neutral")
nlp.add_pipe("custom_sentiment", last=True)
Managing Pipeline Components
# View components
print(nlp.pipe_names)
# Disable components for speed
with nlp.disable_pipes("ner", "parser"):
doc = nlp("Fast processing with only tokenization and POS")
Multilingual Support
Working with Russian
# Load the Russian model
nlp_ru = spacy.load("ru_core_news_sm")
text_ru = "Moscow — the capital of Russia. Red Square is located in the city centre."
doc_ru = nlp_ru(text_ru)
for ent in doc_ru.ents:
print(f"Entity: {ent.text}, Type: {ent.label_}")
Language Detection
from spacy.lang.ru import Russian
from spacy.lang.en import English
def detect_language(text):
# Simple language detection based on models
try:
nlp_en = English()
nlp_ru = Russian()
doc_en = nlp_en(text)
doc_ru = nlp_ru(text)
# Basic heuristic
return "en" if len([t for t in doc_en if t.is_alpha]) > 0 else "ru"
except:
return "unknown"
Visualization and Analysis
Dependency Visualization
from spacy import displacy
# Render dependency parse in a web server
displacy.serve(doc, style="dep", port=5000)
# Save to HTML
html = displacy.render(doc, style="dep", page=True)
with open("dependencies.html", "w", encoding="utf-8") as f:
f.write(html)
Entity Visualization
# Color scheme for entities
colors = {
"ORG": "#7aecec",
"PERSON": "#aa9cfc",
"GPE": "#feca57"
}
options = {"ents": ["ORG", "PERSON", "GPE"], "colors": colors}
displacy.serve(doc, style="ent", options=options)
Integration with Machine Learning
Integration with scikit‑learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
def spacy_tokenizer(text):
doc = nlp(text)
return [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
# Create a vectorizer that uses the SpaCy tokenizer
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer, lowercase=False)
Working with Transformers
# Install spacy‑transformers
# pip install spacy-transformers
import spacy_transformers
# Load a transformer‑based model
nlp_trf = spacy.load("en_core_web_trf")
# Get embeddings
doc_trf = nlp_trf("This is a sentence with transformer embeddings.")
embeddings = doc_trf._.trf_data.tensors[0]
Table of Core SpaCy Methods and Properties
| Component | Method/Property | Description | Example Usage |
|---|---|---|---|
| nlp | nlp(text) |
Process text | doc = nlp("Hello world") |
| nlp | nlp.pipe(texts) |
Batch processing | docs = list(nlp.pipe(texts)) |
| nlp | nlp.pipe_names |
List of pipeline components | print(nlp.pipe_names) |
| nlp | nlp.add_pipe() |
Add a component | nlp.add_pipe("custom_component") |
| Doc | doc.ents |
Named entities | for ent in doc.ents: print(ent.text) |
| Doc | doc.sents |
Sentences | for sent in doc.sents: print(sent.text) |
| Doc | doc.noun_chunks |
Noun phrases | for chunk in doc.noun_chunks: print(chunk.text) |
| Doc | doc.vector |
Vector representation | similarity = doc1.similarity(doc2) |
| Token | token.text |
Token text | print(token.text) |
| Token | token.lemma_ |
Token lemma | print(token.lemma_) |
| Token | token.pos_ |
Part of speech | print(token.pos_) |
| Token | token.tag_ |
Detailed tag | print(token.tag_) |
| Token | token.dep_ |
Dependency relation | print(token.dep_) |
| Token | token.head |
Head token | print(token.head.text) |
| Token | token.children |
Child tokens | print(list(token.children)) |
| Token | token.is_alpha |
Alphabetic character | if token.is_alpha: print(token.text) |
| Token | token.is_punct |
Punctuation mark | if token.is_punct: print(token.text) |
| Token | token.is_stop |
Stop word | if not token.is_stop: print(token.text) |
| Token | token.like_num |
Looks like a number | if token.like_num: print(token.text) |
| Token | token.morph |
Morphological features | print(token.morph) |
| Span | span.text |
Span text | print(span.text) |
| Span | span.label_ |
Entity label | print(span.label_) |
| Span | span.start |
Start index | print(span.start) |
| Span | span.end |
End index | print(span.end) |
| Matcher | matcher.add() |
Add a pattern | matcher.add("PATTERN", [pattern]) |
| Matcher | matcher(doc) |
Find matches | matches = matcher(doc) |
| displacy | displacy.render() |
Visualization | html = displacy.render(doc, style="dep") |
| displacy | displacy.serve() |
Web visualization server | displacy.serve(doc, style="ent") |
Practical Usage Examples
Customer Review Analysis
def analyze_reviews(reviews):
results = []
for review in reviews:
doc = nlp(review)
# Extract entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
# Simple sentiment analysis
positive_words = {"good", "great", "excellent", "amazing", "wonderful"}
negative_words = {"bad", "terrible", "awful", "horrible", "disappointing"}
pos_count = sum(1 for token in doc if token.lemma_.lower() in positive_words)
neg_count = sum(1 for token in doc if token.lemma_.lower() in negative_words)
sentiment = "positive" if pos_count > neg_count else "negative" if neg_count > pos_count else "neutral"
results.append({
'review': review,
'entities': entities,
'sentiment': sentiment,
'pos_score': pos_count,
'neg_score': neg_count
})
return results
Key Phrase Extraction
def extract_key_phrases(text, min_freq=2):
doc = nlp(text)
# Extract noun phrases
noun_phrases = [chunk.text.lower() for chunk in doc.noun_chunks
if len(chunk.text.split()) > 1]
# Count frequencies
from collections import Counter
phrase_counts = Counter(noun_phrases)
# Filter by frequency
key_phrases = [phrase for phrase, count in phrase_counts.items()
if count >= min_freq]
return key_phrases
Document Classification
def classify_documents(documents, categories):
classified = []
for doc_text in documents:
doc = nlp(doc_text)
# Feature extraction
features = {
'entities': [ent.label_ for ent in doc.ents],
'pos_tags': [token.pos_ for token in doc],
'keywords': [token.lemma_.lower() for token in doc
if token.is_alpha and not token.is_stop]
}
# Simple keyword‑based classification
scores = {}
for category, keywords in categories.items():
score = sum(1 for keyword in keywords
if keyword in features['keywords'])
scores[category] = score
predicted_category = max(scores, key=scores.get)
classified.append({
'text': doc_text,
'category': predicted_category,
'confidence': scores[predicted_category],
'features': features
})
return classified
Performance Optimization
Batch Processing
def process_large_dataset(texts, batch_size=1000):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
# Disable unnecessary components for speed
with nlp.disable_pipes("ner", "parser"):
docs = list(nlp.pipe(batch))
for doc in docs:
results.append({
'text': doc.text,
'tokens': len(doc),
'sentences': len(list(doc.sents))
})
return results
Caching Results
import functools
@functools.lru_cache(maxsize=1000)
def cached_nlp_processing(text):
doc = nlp(text)
return {
'entities': [(ent.text, ent.label_) for ent in doc.ents],
'tokens': [token.text for token in doc],
'pos_tags': [token.pos_ for token in doc]
}
Frequently Asked Questions
How to Choose the Right Model?
The choice depends on your speed versus accuracy requirements. “sm” models are faster but less accurate; “lg” models are more accurate but slower.
Can I Train My Own Model?
Yes, SpaCy provides tools for training custom models. Use the spacy train command with prepared training data.
How to Process Very Large Texts?
Use nlp.pipe() for batch processing and disable unnecessary pipeline components to speed up execution.
Does SpaCy Support GPU?
Yes, when used with transformer models via the spacy‑transformers library.
How to Integrate SpaCy with Flask/Django?
Load the model at application startup and use it inside view functions. It’s recommended to employ a process pool for parallel processing.
Comparison with Other Libraries
SpaCy vs NLTK
SpaCy is geared toward production use with high performance, whereas NLTK is more suited for educational purposes and research.
SpaCy vs CoreNLP
SpaCy is written in Python and easier to use; CoreNLP (Java) offers more language models but is more complex to set up.
SpaCy vs Transformers
SpaCy provides a full processing pipeline, while Transformers focus on modern neural models. They complement each other well.
Conclusion
SpaCy is a powerful and flexible library for natural language processing that successfully combines high performance, accuracy, and ease of use. With its rich functionality, multilingual support, and seamless integration with modern machine‑learning tools, SpaCy remains one of the top choices for building NLP applications in production environments.
The library continues to evolve rapidly, adding new language support, improving algorithms, and integrating with cutting‑edge technologies such as transformers. This makes SpaCy a reliable solution for long‑term NLP projects.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed