Architecture and Components of NLTK
Core Library Modules
NLTK is built on a modular principle, where each module is responsible for a specific NLP area:
tokenize module – provides various algorithms for splitting text into tokens stem module – contains stemming and lemmatization algorithms tag module – handles part‑of‑speech tagging parse module – includes parsers for syntactic analysis chunk module – implements chunking algorithms classify module – contains machine‑learning classifiers corpus module – provides access to corpora and lexical resources
Corpora and Lexical Resources
NLTK gives access to an extensive collection of corpora, including classic literary works, news articles, annotated texts and specialized dictionaries. These resources serve as the basis for training and testing NLP algorithms.
Features and Advantages of NLTK
Support for a large number of linguistic corpora – more than 50 corpora in various languages
Variety of algorithms for tokenization, stemming, lemmatization – from simple to advanced methods
Tools for morphological and syntactic analysis – including parsers of different types
Utilities for statistical text analysis – frequency distributions, collocations, n‑grams
Flexible architecture for customizing processing pipelines – ability to create custom handlers
Active documentation and educational resources – books, tutorials, code examples
Integration with popular libraries – NumPy, Matplotlib, scikit‑learn
Installation and Initial Setup
Installing the Library
pip install nltk
For additional capabilities it is also recommended to install:
pip install numpy matplotlib
Importing and Downloading Resources
import nltk
# Interactive resource downloader
nltk.download()
# Download specific resources
nltk.download('punkt') # for tokenization
nltk.download('wordnet') # for lemmatization
nltk.download('stopwords') # stop words
nltk.download('averaged_perceptron_tagger') # POS tags
nltk.download('maxent_ne_chunker') # NER
nltk.download('words') # English word list
Verifying Installation
import nltk
print(nltk.__version__)
nltk.data.find('tokenizers/punkt')
Working with Corpora and Text Data
Loading Built‑in Corpora
NLTK contains a rich collection of pre‑processed texts:
from nltk.corpus import gutenberg, brown, reuters, inaugural
# Load Gutenberg corpus
nltk.download('gutenberg')
sample_text = gutenberg.raw('austen-emma.txt')
# Work with Brown corpus
nltk.download('brown')
brown_words = brown.words(categories='news')
# Reuters corpus for classification
nltk.download('reuters')
reuters_categories = reuters.categories()
Working with Custom Texts
# Load text from a file
with open('text.txt', 'r', encoding='utf-8') as file:
custom_text = file.read()
# Work with a text string
text = "Natural Language Processing with NLTK is powerful and flexible."
Text Tokenization
Basic Types of Tokenization
Tokenization is the process of splitting text into smaller units (tokens). NLTK provides several tokenizer types:
from nltk.tokenize import sent_tokenize, word_tokenize, line_tokenize
text = "NLTK is great! It provides many tools. Let's explore them."
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences) # ['NLTK is great!', 'It provides many tools.', "Let's explore them."]
# Word tokenization
words = word_tokenize(text)
print(words) # ['NLTK', 'is', 'great', '!', 'It', 'provides', 'many', 'tools', '.', 'Let', "'s", 'explore', 'them', '.']
Specialized Tokenizers
from nltk.tokenize import WhitespaceTokenizer, RegexpTokenizer, TreebankWordTokenizer
# Whitespace tokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokens = whitespace_tokenizer.tokenize(text)
# Regex tokenizer
regexp_tokenizer = RegexpTokenizer(r'\w+')
words_only = regexp_tokenizer.tokenize(text)
# Penn Treebank tokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(text)
Tokenization for Different Languages
from nltk.tokenize import sent_tokenize
# Tokenization for various languages
german_text = "Hallo Welt! Wie geht es dir? Ich hoffe, alles ist gut."
german_sentences = sent_tokenize(german_text, language='german')
Normalization and Pre‑processing
Stemming
Stemming is the process of removing affixes from words to obtain their base form:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
# Porter stemmer
porter = PorterStemmer()
print(porter.stem('running')) # 'run'
print(porter.stem('happiness')) # 'happi'
# Lancaster stemmer (more aggressive)
lancaster = LancasterStemmer()
print(lancaster.stem('running')) # 'run'
print(lancaster.stem('happiness')) # 'happy'
# Snowball stemmer (multilingual)
snowball = SnowballStemmer('english')
print(snowball.stem('running')) # 'run'
Lemmatization
Lemmatization is more precise than stemming because it takes context and part of speech into account:
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
# Lemmatization with POS specification
print(lemmatizer.lemmatize('running', pos='v')) # 'run' (verb)
print(lemmatizer.lemmatize('running', pos='n')) # 'running' (noun)
print(lemmatizer.lemmatize('better', pos='a')) # 'good' (adjective)
print(lemmatizer.lemmatize('mice', pos='n')) # 'mouse' (noun)
Stop‑word Removal
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Filter stop words
words = word_tokenize("This is a sample sentence with stop words.")
filtered_words = [w for w in words if w.lower() not in stop_words]
print(filtered_words) # ['sample', 'sentence', 'stop', 'words', '.']
# Add custom stop words
custom_stop_words = stop_words.union({'sample', 'sentence'})
Morphological Analysis
Part‑of‑Speech Tagging (POS‑tagging)
POS tags help determine the grammatical role of each word:
from nltk import pos_tag
text = "NLTK is a powerful natural language processing library."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
# [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'),
# ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('library', 'NN')]
# Get POS tag description
nltk.help.upenn_tagset('NNP') # Proper noun
Advanced POS Analysis
from nltk.tag import UnigramTagger, BigramTagger
# Create a custom POS tagger
from nltk.corpus import brown
# Prepare training data
brown_tagged_sents = brown.tagged_sents(categories='news')
train_sents = brown_tagged_sents[:3000]
test_sents = brown_tagged_sents[3000:3500]
# Train tagger
unigram_tagger = UnigramTagger(train_sents)
accuracy = unigram_tagger.evaluate(test_sents)
print(f"Accuracy: {accuracy}")
Syntactic Analysis
Named Entity Recognition (NER)
from nltk import ne_chunk
text = "Barack Obama was born in Hawaii. He worked at Google before joining Apple."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
print(entities)
# Prints a tree with identified named entities
Context‑Free Grammars
from nltk import CFG, ChartParser
# Define grammar
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N | Det Adj N | 'I'
VP -> V NP | V
Det -> 'the' | 'a'
N -> 'cat' | 'dog' | 'book'
Adj -> 'big' | 'small'
V -> 'saw' | 'read' | 'walked'
""")
# Create parser
parser = ChartParser(grammar)
# Parse a sentence
sentence = ['I', 'saw', 'the', 'big', 'cat']
for tree in parser.parse(sentence):
print(tree)
tree.draw() # Visualize parse tree
Chunking
from nltk import RegexpParser # Grammar for noun phrase chunking grammar = "NP: {?*}" chunk_parser = RegexpParser(grammar) # Apply to tagged text text = "The quick brown fox jumps over the lazy dog" tokens = word_tokenize(text) tagged = pos_tag(tokens) chunks = chunk_parser.parse(tagged) print(chunks)
Statistical Text Analysis
Frequency Distributions
from nltk import FreqDist
import matplotlib.pyplot as plt
# Create frequency distribution
text = gutenberg.words('austen-emma.txt')
fdist = FreqDist(text)
# Analyze frequencies
print(fdist.most_common(10)) # 10 most frequent words
print(fdist['Emma']) # Frequency of a specific word
print(fdist.hapaxes()[:10]) # Words occurring once
# Visualization
fdist.plot(30, cumulative=False)
plt.show()
Conditional Frequency Distributions
from nltk import ConditionalFreqDist
# Category‑wise analysis
cfd = ConditionalFreqDist(
(genre, word.lower())
for genre in ['news', 'romance']
for word in brown.words(categories=genre)
)
# Compare word usage across categories
cfd.plot(conditions=['news', 'romance'], samples=['man', 'woman'])
Working with N‑grams
Creating and Analyzing N‑grams
from nltk import ngrams, bigrams, trigrams
from nltk.util import pad_sequence
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())
# Bigrams
bigrams_list = list(bigrams(tokens))
print(bigrams_list[:5])
# Trigrams
trigrams_list = list(trigrams(tokens))
print(trigrams_list[:5])
# Arbitrary‑length n‑grams
four_grams = list(ngrams(tokens, 4))
Finding Collocations
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
# Find significant bigrams
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigram_finder.apply_freq_filter(3) # Minimum frequency
bigram_measures = BigramAssocMeasures()
# Get bigrams with high PMI
best_bigrams = bigram_finder.nbest(bigram_measures.pmi, 10)
print(best_bigrams)
Text Classification
Naive Bayes Classifier
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import random
# Prepare data
nltk.download('movie_reviews')
def document_features(document):
words = set(document)
features = {}
for word in word_features:
features[f'contains({word})'] = (word in words)
return features
# Create feature list
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
# Build train and test sets
featuresets = [(document_features(d), c) for (d, c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
# Train classifier
classifier = NaiveBayesClassifier.train(train_set)
# Test
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy}")
# Show most informative features
classifier.show_most_informative_features(5)
Sentiment Analysis
Basic Sentiment Analysis
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
# Sentiment analysis
text = "NLTK is really amazing for text processing!"
sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)
# {'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.6588}
Working with WordNet
Semantic Analysis
from nltk.corpus import wordnet
# Find synonyms
synsets = wordnet.synsets('good')
for synset in synsets:
print(f"{synset.name()}: {synset.definition()}")
print(f"Examples: {synset.examples()}")
print(f"Lemmas: {[lemma.name() for lemma in synset.lemmas()]}")
print()
# Find antonyms
good_synset = wordnet.synset('good.a.01')
antonyms = []
for lemma in good_synset.lemmas():
if lemma.antonyms():
antonyms.extend([ant.name() for ant in lemma.antonyms()])
print(f"Antonyms of 'good': {antonyms}")
# Semantic similarity
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
similarity = dog.wup_similarity(cat)
print(f"Similarity between dog and cat: {similarity}")
Advanced Capabilities
Working with Regular Expressions
import re
from nltk.tokenize import regexp_tokenize
# Extract email addresses
text = "Contact us at info@example.com or support@test.org"
emails = regexp_tokenize(text, pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
print(emails)
# Extract URLs
urls = regexp_tokenize(text, pattern=r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
Creating Custom Corpora
from nltk.corpus import PlaintextCorpusReader
# Build corpus from custom files
corpus_root = '/path/to/your/corpus'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
print(wordlists.words('file1.txt'))
Table of Core NLTK Methods and Functions
| Category | Function/Class | Description | Usage Example |
|---|---|---|---|
| Resource Loading | nltk.download() |
Downloads corpora and models | nltk.download('punkt') |
nltk.data.find() |
Checks for resource existence | nltk.data.find('tokenizers/punkt') |
|
| Tokenization | word_tokenize() |
Splits into words | word_tokenize("Hello world!") |
sent_tokenize() |
Splits into sentences | sent_tokenize("Hello! How are you?") |
|
RegexpTokenizer() |
Regex‑based tokenization | RegexpTokenizer(r'\w+') |
|
WhitespaceTokenizer() |
Whitespace tokenization | WhitespaceTokenizer().tokenize(text) |
|
| Stemming and Lemmatization | PorterStemmer() |
Porter stemming algorithm | PorterStemmer().stem('running') |
LancasterStemmer() |
Aggressive stemming | LancasterStemmer().stem('running') |
|
SnowballStemmer() |
Multilingual stemming | SnowballStemmer('english').stem('running') |
|
WordNetLemmatizer() |
Lemmatization using WordNet | WordNetLemmatizer().lemmatize('running', pos='v') |
|
| Stop Words | stopwords.words() |
Gets list of stop words | stopwords.words('english') |
| POS Tags | pos_tag() |
Part‑of‑speech tagging | pos_tag(['Hello', 'world']) |
help.upenn_tagset() |
POS tag description | help.upenn_tagset('NN') |
|
| Frequency Analysis | FreqDist() |
Frequency distribution | FreqDist(tokens) |
.most_common() |
Most frequent items | fdist.most_common(10) |
|
.hapaxes() |
Items with frequency 1 | fdist.hapaxes() |
|
.plot() |
Distribution plot | fdist.plot(30) |
|
| N‑grams | ngrams() |
Create n‑grams | list(ngrams(tokens, 2)) |
bigrams() |
Create bigrams | list(bigrams(tokens)) |
|
trigrams() |
Create trigrams | list(trigrams(tokens)) |
|
BigramCollocationFinder |
Find collocations | BigramCollocationFinder.from_words(tokens) |
|
| Syntactic Analysis | ne_chunk() |
Named entity recognition | ne_chunk(pos_tag(tokens)) |
RegexpParser() |
Regex‑based parsing | RegexpParser("NP: {?*}") |
|
ChartParser() |
Parsing with context‑free grammars | ChartParser(grammar) |
|
| Classification | NaiveBayesClassifier |
Naive Bayes classifier | NaiveBayesClassifier.train(train_set) |
classify.accuracy() |
Classifier accuracy | classify.accuracy(classifier, test_set) |
|
| Corpora | gutenberg |
Gutenberg corpus | gutenberg.words('austen-emma.txt') |
brown |
Brown corpus | brown.words(categories='news') |
|
movie_reviews |
Movie reviews corpus | movie_reviews.words('pos') |
|
reuters |
Reuters news corpus | reuters.categories() |
|
| WordNet | wordnet.synsets() |
Find synonyms | wordnet.synsets('good') |
.definition() |
Word definition | synset.definition() |
|
.examples() |
Usage examples | synset.examples() |
|
.wup_similarity() |
Semantic similarity | synset1.wup_similarity(synset2) |
|
| Sentiment Analysis | SentimentIntensityAnalyzer |
Sentiment scoring | SentimentIntensityAnalyzer().polarity_scores(text) |
| Visualization | Text() |
Text object for analysis | Text(tokens) |
.concordance() |
Contextual search | text.concordance('love') |
|
.dispersion_plot() |
Word dispersion plot | text.dispersion_plot(['love', 'hate']) |
|
.similar() |
Find similar words | text.similar('love') |
Integration with Other Libraries
Integration with pandas
import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer
# Sentiment analysis in a DataFrame
df = pd.DataFrame({
'text': ['I love this product!', 'This is terrible.', 'Average quality.']
})
sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])
Integration with scikit‑learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
# Build a pipeline for classification
def preprocess_text(text):
tokens = word_tokenize(text.lower())
return ' '.join([lemmatizer.lemmatize(token) for token in tokens if token.isalpha()])
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier', MultinomialNB())
])
Integration with spaCy
import spacy
# Use NLTK for preprocessing, spaCy for analysis
nlp = spacy.load('en_core_web_sm')
def hybrid_processing(text):
# Preprocess with NLTK
tokens = word_tokenize(text)
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
# Analyze with spaCy
doc = nlp(' '.join(filtered_tokens))
return [(token.text, token.pos_, token.dep_) for token in doc]
Practical Use Cases
News Sentiment Analysis
def analyze_news_sentiment(news_texts):
sia = SentimentIntensityAnalyzer()
results = []
for text in news_texts:
# Preprocess
tokens = word_tokenize(text.lower())
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
processed_text = ' '.join(filtered_tokens)
# Sentiment analysis
sentiment = sia.polarity_scores(processed_text)
results.append({
'text': text[:100] + '...',
'positive': sentiment['pos'],
'negative': sentiment['neg'],
'neutral': sentiment['neu'],
'compound': sentiment['compound']
})
return results
Key Phrase Extraction
def extract_key_phrases(text, n=10): # Tokenize and POS tag tokens = word_tokenize(text.lower()) tagged = pos_tag(tokens) # Extract noun groups grammar = "NP: {?*+}" parser = RegexpParser(grammar) tree = parser.parse(tagged) # Gather phrases phrases = [] for subtree in tree.subtrees(): if subtree.label() == 'NP': phrase = ' '.join([word for word, pos in subtree.leaves()]) phrases.append(phrase) # Frequency count phrase_freq = FreqDist(phrases) return phrase_freq.most_common(n)
Document Comparison
def compare_documents(doc1, doc2):
# Preprocess documents
def preprocess(text):
tokens = word_tokenize(text.lower())
return [lemmatizer.lemmatize(w) for w in tokens if w.isalpha() and w not in stop_words]
tokens1 = preprocess(doc1)
tokens2 = preprocess(doc2)
# Build frequency distributions
fdist1 = FreqDist(tokens1)
fdist2 = FreqDist(tokens2)
# Jaccard similarity
set1 = set(tokens1)
set2 = set(tokens2)
intersection = len(set1.intersection(set2))
union = len(set1.union(set2))
jaccard_similarity = intersection / union if union > 0 else 0
return {
'jaccard_similarity': jaccard_similarity,
'common_words': list(set1.intersection(set2)),
'doc1_unique': list(set1 - set2),
'doc2_unique': list(set2 - set1)
}
Performance Optimization
Caching Results
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_lemmatize(word, pos):
return lemmatizer.lemmatize(word, pos)
@lru_cache(maxsize=1000)
def cached_stem(word):
return stemmer.stem(word)
Batch Processing
def batch_process_texts(texts, batch_size=100):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_results = []
for text in batch:
# Process text
processed = preprocess_text(text)
batch_results.append(processed)
results.extend(batch_results)
return results
Creating Custom Components
Custom Tokenizer
from nltk.tokenize.api import TokenizerI
class CustomTokenizer(TokenizerI):
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, text):
# Custom tokenization logic
if not self.preserve_case:
text = text.lower()
# Simple whitespace tokenization with punctuation removal
tokens = re.findall(r'\b\w+\b', text)
return tokens
Custom Classifier
from nltk.classify.api import ClassifierI
class CustomClassifier(ClassifierI):
def __init__(self, feature_extractor):
self.feature_extractor = feature_extractor
self.model = None
def classify(self, featureset):
# Custom classification logic
return self.model.predict([featureset])[0]
def prob_classify(self, featureset):
# Returns class probabilities
probabilities = self.model.predict_proba([featureset])[0]
return probabilities
Processing Multiple Languages
Multilingual Support
# Settings for different languages
language_settings = {
'english': {
'stopwords': set(stopwords.words('english')),
'stemmer': SnowballStemmer('english'),
'tokenizer': 'punkt'
},
'spanish': {
'stopwords': set(stopwords.words('spanish')),
'stemmer': SnowballStemmer('spanish'),
'tokenizer': 'punkt'
},
'french': {
'stopwords': set(stopwords.words('french')),
'stemmer': SnowballStemmer('french'),
'tokenizer': 'punkt'
}
}
def process_multilingual_text(text, language='english'):
settings = language_settings[language]
# Tokenize
tokens = word_tokenize(text, language=language)
# Remove stop words
filtered_tokens = [w for w in tokens if w.lower() not in settings['stopwords']]
# Stem
stemmed_tokens = [settings['stemmer'].stem(w) for w in filtered_tokens]
return stemmed_tokens
Working with Large Datasets
Streaming Processing
def stream_process_large_corpus(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
# Process one line at a time
processed_line = preprocess_text(line.strip())
yield processed_line
# Use generator to save memory
def analyze_large_corpus(file_path):
word_freq = FreqDist()
for processed_line in stream_process_large_corpus(file_path):
tokens = word_tokenize(processed_line)
for token in tokens:
word_freq[token] += 1
return word_freq.most_common(100)
Parallel Processing
from multiprocessing import Pool
import functools
def parallel_text_processing(texts, num_processes=4):
with Pool(processes=num_processes) as pool:
results = pool.map(preprocess_text, texts)
return results
Debugging and Testing
Component Testing
import unittest
class TestNLTKProcessing(unittest.TestCase):
def setUp(self):
self.sample_text = "This is a test sentence for NLTK processing."
def test_tokenization(self):
tokens = word_tokenize(self.sample_text)
self.assertIsInstance(tokens, list)
self.assertGreater(len(tokens), 0)
def test_pos_tagging(self):
tokens = word_tokenize(self.sample_text)
tagged = pos_tag(tokens)
self.assertEqual(len(tagged), len(tokens))
def test_lemmatization(self):
lemmatizer = WordNetLemmatizer()
result = lemmatizer.lemmatize('running', pos='v')
self.assertEqual(result, 'run')
if __name__ == '__main__':
unittest.main()
Tips for Using NLTK
Best Practices
Always download required resources before using the corresponding functions
Use appropriate POS tags for lemmatization to obtain more accurate results
Combine different preprocessing methods depending on the task
Test on diverse text types to verify the robustness of your solution
Utilize caching to improve performance when processing large volumes of data
Common Mistakes
Forgetting to download resources – always check that necessary models are loaded
Choosing the wrong tokenizer – different tasks require different tokenization approaches
Ignoring context during lemmatization – specifying the part of speech greatly improves results
Inefficient handling of big data – use generators and streaming processing
Alternatives and Comparison
Comparison with Other Libraries
spaCy: Faster and more modern alternative, optimized for production
TextBlob: Simplified interface for basic NLP tasks
Gensim: Specializes in topic modeling and vector‑space representations
Transformers: State‑of‑the‑art pretrained models for complex NLP tasks
When to Use NLTK
Educational purposes – an excellent library for learning NLP fundamentals
Research projects – rich set of algorithms and corpora
Prototyping – rapid creation of basic NLP solutions
Specific linguistic tasks – unique tools for deep language analysis
Frequently Asked Questions
What is NLTK? NLTK (Natural Language Toolkit) is a comprehensive Python library for natural language processing, providing tools for tokenization, morphological analysis, syntactic parsing and other NLP tasks.
Does NLTK support Russian? NLTK has limited support for Russian. Core functionality works best with English, but some features (e.g., tokenization) can be applied to Russian text.
How does NLTK differ from spaCy? NLTK offers an educational and research‑focused toolkit with a wide range of algorithms, whereas spaCy is optimized for production use and runs faster.
Is NLTK suitable for machine learning? NLTK includes basic classification algorithms, but for serious machine‑learning tasks it is recommended to use it together with scikit‑learn or other ML libraries.
Can NLTK handle big data? NLTK is not optimized for large‑scale data processing. For such tasks, spaCy or distributed computing frameworks are preferable.
Does NLTK have GPU support? NLTK does not have built‑in GPU support. For GPU acceleration use specialized libraries such as PyTorch or TensorFlow.
How to add support for a new language in NLTK? You can create custom tokenizers, add stop‑word lists and train your own models for the target language.
Is NLTK suitable for commercial use? Yes, NLTK is released under the Apache License 2.0, which permits use in commercial projects.
Conclusion
NLTK remains one of the most important and influential libraries in the Python natural language processing ecosystem. Thanks to its educational focus, rich set of resources and comprehensive coverage of NLP aspects, it serves as an excellent starting point for learning and understanding natural language processing principles.
While production‑grade projects may require more modern and optimized solutions, NLTK provides a solid foundation for grasping fundamental NLP concepts and remains an indispensable tool for researchers, students and developers working with textual data.
Combining NLTK with other modern libraries enables the creation of powerful and efficient solutions for a wide range of natural language processing tasks, from simple text analysis to complex language understanding systems.
* --- **💬 Response:**
Introduction to NLTK
NLTK (Natural Language Toolkit) — is one of the most well‑known and widely used Python libraries for natural language processing (NLP). Developed in 2001 at the University of Pennsylvania, it provides comprehensive tools and resources for working with text, including tokenization, morphological analysis, syntactic parsing, building frequency distributions and much more. The library is actively used both for educational purposes and in applied text‑analysis projects.
NLTK contains more than 50 corpora and lexical resources, including the Penn Treebank, Brown Corpus, WordNet and many others. This makes it an indispensable tool for researchers, students and developers working on natural language processing tasks.
Architecture and Components of NLTK
Core Library Modules
NLTK is built on a modular principle, where each module is responsible for a specific NLP area:
tokenize module – provides various algorithms for splitting text into tokens stem module – contains stemming and lemmatization algorithms tag module – handles part‑of‑speech tagging parse module – includes parsers for syntactic analysis chunk module – implements chunking algorithms classify module – contains machine‑learning classifiers corpus module – provides access to corpora and lexical resources
Corpora and Lexical Resources
NLTK gives access to an extensive collection of corpora, including classic literary works, news articles, annotated texts and specialized dictionaries. These resources serve as the basis for training and testing NLP algorithms.
Features and Advantages of NLTK
Support for a large number of linguistic corpora – more than 50 corpora in various languages
Variety of algorithms for tokenization, stemming, lemmatization – from simple to advanced methods
Tools for morphological and syntactic analysis – including parsers of different types
Utilities for statistical text analysis – frequency distributions, collocations, n‑grams
Flexible architecture for customizing processing pipelines – ability to create custom handlers
Active documentation and educational resources – books, tutorials, code examples
Integration with popular libraries – NumPy, Matplotlib, scikit‑learn
Installation and Initial Setup
Installing the Library
pip install nltk
For additional capabilities it is also recommended to install:
pip install numpy matplotlib
Importing and Downloading Resources
import nltk
# Interactive resource downloader
nltk.download()
# Download specific resources
nltk.download('punkt') # for tokenization
nltk.download('wordnet') # for lemmatization
nltk.download('stopwords') # stop words
nltk.download('averaged_perceptron_tagger') # POS tags
nltk.download('maxent_ne_chunker') # NER
nltk.download('words') # English word list
Verifying Installation
import nltk
print(nltk.__version__)
nltk.data.find('tokenizers/punkt')
Working with Corpora and Text Data
Loading Built‑in Corpora
NLTK contains a rich collection of pre‑processed texts:
from nltk.corpus import gutenberg, brown, reuters, inaugural
# Load Gutenberg corpus
nltk.download('gutenberg')
sample_text = gutenberg.raw('austen-emma.txt')
# Work with Brown corpus
nltk.download('brown')
brown_words = brown.words(categories='news')
# Reuters corpus for classification
nltk.download('reuters')
reuters_categories = reuters.categories()
Working with Custom Texts
# Load text from a file
with open('text.txt', 'r', encoding='utf-8') as file:
custom_text = file.read()
# Work with a text string
text = "Natural Language Processing with NLTK is powerful and flexible."
Text Tokenization
Basic Types of Tokenization
Tokenization is the process of splitting text into smaller units (tokens). NLTK provides several tokenizer types:
from nltk.tokenize import sent_tokenize, word_tokenize, line_tokenize
text = "NLTK is great! It provides many tools. Let's explore them."
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences) # ['NLTK is great!', 'It provides many tools.', "Let's explore them."]
# Word tokenization
words = word_tokenize(text)
print(words) # ['NLTK', 'is', 'great', '!', 'It', 'provides', 'many', 'tools', '.', 'Let', "'s", 'explore', 'them', '.']
Specialized Tokenizers
from nltk.tokenize import WhitespaceTokenizer, RegexpTokenizer, TreebankWordTokenizer
# Whitespace tokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokens = whitespace_tokenizer.tokenize(text)
# Regex tokenizer
regexp_tokenizer = RegexpTokenizer(r'\w+')
words_only = regexp_tokenizer.tokenize(text)
# Penn Treebank tokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(text)
Tokenization for Different Languages
from nltk.tokenize import sent_tokenize
# Tokenization for various languages
german_text = "Hallo Welt! Wie geht es dir? Ich hoffe, alles ist gut."
german_sentences = sent_tokenize(german_text, language='german')
Normalization and Pre‑processing
Stemming
Stemming is the process of removing affixes from words to obtain their base form:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
# Porter stemmer
porter = PorterStemmer()
print(porter.stem('running')) # 'run'
print(porter.stem('happiness')) # 'happi'
# Lancaster stemmer (more aggressive)
lancaster = LancasterStemmer()
print(lancaster.stem('running')) # 'run'
print(lancaster.stem('happiness')) # 'happy'
# Snowball stemmer (multilingual)
snowball = SnowballStemmer('english')
print(snowball.stem('running')) # 'run'
Lemmatization
Lemmatization is more precise than stemming because it takes context and part of speech into account:
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
# Lemmatization with POS specification
print(lemmatizer.lemmatize('running', pos='v')) # 'run' (verb)
print(lemmatizer.lemmatize('running', pos='n')) # 'running' (noun)
print(lemmatizer.lemmatize('better', pos='a')) # 'good' (adjective)
print(lemmatizer.lemmatize('mice', pos='n')) # 'mouse' (noun)
Stop‑word Removal
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Filter stop words
words = word_tokenize("This is a sample sentence with stop words.")
filtered_words = [w for w in words if w.lower() not in stop_words]
print(filtered_words) # ['sample', 'sentence', 'stop', 'words', '.']
# Add custom stop words
custom_stop_words = stop_words.union({'sample', 'sentence'})
Morphological Analysis
Part‑of‑Speech Tagging (POS‑tagging)
POS tags help determine the grammatical role of each word:
from nltk import pos_tag
text = "NLTK is a powerful natural language processing library."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
# [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'),
# ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('library', 'NN')]
# Get POS tag description
nltk.help.upenn_tagset('NNP') # Proper noun
Advanced POS Analysis
from nltk.tag import UnigramTagger, BigramTagger
# Create a custom POS tagger
from nltk.corpus import brown
# Prepare training data
brown_tagged_sents = brown.tagged_sents(categories='news')
train_sents = brown_tagged_sents[:3000]
test_sents = brown_tagged_sents[3000:3500]
# Train tagger
unigram_tagger = UnigramTagger(train_sents)
accuracy = unigram_tagger.evaluate(test_sents)
print(f"Accuracy: {accuracy}")
Syntactic Analysis
Named Entity Recognition (NER)
from nltk import ne_chunk
text = "Barack Obama was born in Hawaii. He worked at Google before joining Apple."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
print(entities)
# Prints a tree with identified named entities
Context‑Free Grammars
from nltk import CFG, ChartParser
# Define grammar
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N | Det Adj N | 'I'
VP -> V NP | V
Det -> 'the' | 'a'
N -> 'cat' | 'dog' | 'book'
Adj -> 'big' | 'small'
V -> 'saw' | 'read' | 'walked'
""")
# Create parser
parser = ChartParser(grammar)
# Parse a sentence
sentence = ['I', 'saw', 'the', 'big', 'cat']
for tree in parser.parse(sentence):
print(tree)
tree.draw() # Visualize parse tree
Chunking
from nltk import RegexpParser # Grammar for noun phrase chunking grammar = "NP: {?*}" chunk_parser = RegexpParser(grammar) # Apply to tagged text text = "The quick brown fox jumps over the lazy dog" tokens = word_tokenize(text) tagged = pos_tag(tokens) chunks = chunk_parser.parse(tagged) print(chunks)
Statistical Text Analysis
Frequency Distributions
from nltk import FreqDist
import matplotlib.pyplot as plt
# Create frequency distribution
text = gutenberg.words('austen-emma.txt')
fdist = FreqDist(text)
# Analyze frequencies
print(fdist.most_common(10)) # 10 most frequent words
print(fdist['Emma']) # Frequency of a specific word
print(fdist.hapaxes()[:10]) # Words occurring once
# Visualization
fdist.plot(30, cumulative=False)
plt.show()
Conditional Frequency Distributions
from nltk import ConditionalFreqDist
# Category‑wise analysis
cfd = ConditionalFreqDist(
(genre, word.lower())
for genre in ['news', 'romance']
for word in brown.words(categories=genre)
)
# Compare word usage across categories
cfd.plot(conditions=['news', 'romance'], samples=['man', 'woman'])
Working with N‑grams
Creating and Analyzing N‑grams
from nltk import ngrams, bigrams, trigrams
from nltk.util import pad_sequence
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())
# Bigrams
bigrams_list = list(bigrams(tokens))
print(bigrams_list[:5])
# Trigrams
trigrams_list = list(trigrams(tokens))
print(trigrams_list[:5])
# Arbitrary‑length n‑grams
four_grams = list(ngrams(tokens, 4))
Finding Collocations
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
# Find significant bigrams
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigram_finder.apply_freq_filter(3) # Minimum frequency
bigram_measures = BigramAssocMeasures()
# Get bigrams with high PMI
best_bigrams = bigram_finder.nbest(bigram_measures.pmi, 10)
print(best_bigrams)
Text Classification
Naive Bayes Classifier
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import random
# Prepare data
nltk.download('movie_reviews')
def document_features(document):
words = set(document)
features = {}
for word in word_features:
features[f'contains({word})'] = (word in words)
return features
# Create feature list
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
# Build train and test sets
featuresets = [(document_features(d), c) for (d, c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
# Train classifier
classifier = NaiveBayesClassifier.train(train_set)
# Test
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy}")
# Show most informative features
classifier.show_most_informative_features(5)
Sentiment Analysis
Basic Sentiment Analysis
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
# Sentiment analysis
text = "NLTK is really amazing for text processing!"
sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)
# {'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.6588}
Working with WordNet
Semantic Analysis
from nltk.corpus import wordnet
# Find synonyms
synsets = wordnet.synsets('good')
for synset in synsets:
print(f"{synset.name()}: {synset.definition()}")
print(f"Examples: {synset.examples()}")
print(f"Lemmas: {[lemma.name() for lemma in synset.lemmas()]}")
print()
# Find antonyms
good_synset = wordnet.synset('good.a.01')
antonyms = []
for lemma in good_synset.lemmas():
if lemma.antonyms():
antonyms.extend([ant.name() for ant in lemma.antonyms()])
print(f"Antonyms of 'good': {antonyms}")
# Semantic similarity
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
similarity = dog.wup_similarity(cat)
print(f"Similarity between dog and cat: {similarity}")
Advanced Capabilities
Working with Regular Expressions
import re
from nltk.tokenize import regexp_tokenize
# Extract email addresses
text = "Contact us at info@example.com or support@test.org"
emails = regexp_tokenize(text, pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
print(emails)
# Extract URLs
urls = regexp_tokenize(text, pattern=r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
Creating Custom Corpora
from nltk.corpus import PlaintextCorpusReader
# Build corpus from custom files
corpus_root = '/path/to/your/corpus'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
print(wordlists.words('file1.txt'))
Table of Core NLTK Methods and Functions
| Category | Function/Class | Description | Usage Example |
|---|---|---|---|
| Resource Loading | nltk.download() |
Downloads corpora and models | nltk.download('punkt') |
nltk.data.find() |
Checks for resource existence | nltk.data.find('tokenizers/punkt') |
|
| Tokenization | word_tokenize() |
Splits into words | word_tokenize("Hello world!") |
sent_tokenize() |
Splits into sentences | sent_tokenize("Hello! How are you?") |
|
RegexpTokenizer() |
Regex‑based tokenization | RegexpTokenizer(r'\w+') |
|
WhitespaceTokenizer() |
Whitespace tokenization | WhitespaceTokenizer().tokenize(text) |
|
| Stemming and Lemmatization | PorterStemmer() |
Porter stemming algorithm | PorterStemmer().stem('running') |
LancasterStemmer() |
Aggressive stemming | LancasterStemmer().stem('running') |
|
SnowballStemmer() |
Multilingual stemming | SnowballStemmer('english').stem('running') |
|
WordNetLemmatizer() |
Lemmatization using WordNet | WordNetLemmatizer().lemmatize('running', pos='v') |
|
| Stop Words | stopwords.words() |
Gets list of stop words | stopwords.words('english') |
| POS Tags | pos_tag() |
Part‑of‑speech tagging | pos_tag(['Hello', 'world']) |
help.upenn_tagset() |
POS tag description | help.upenn_tagset('NN') |
|
| Frequency Analysis | FreqDist() |
Frequency distribution | FreqDist(tokens) |
.most_common() |
Most frequent items | fdist.most_common(10) |
|
.hapaxes() |
Items with frequency 1 | fdist.hapaxes() |
|
.plot() |
Distribution plot | fdist.plot(30) |
|
| N‑grams | ngrams() |
Create n‑grams | list(ngrams(tokens, 2)) |
bigrams() |
Create bigrams | list(bigrams(tokens)) |
|
trigrams() |
Create trigrams | list(trigrams(tokens)) |
|
BigramCollocationFinder |
Find collocations | BigramCollocationFinder.from_words(tokens) |
|
| Syntactic Analysis | ne_chunk() |
Named entity recognition | ne_chunk(pos_tag(tokens)) |
RegexpParser() |
Regex‑based parsing | RegexpParser("NP: {?*}") |
|
ChartParser() |
Parsing with context‑free grammars | ChartParser(grammar) |
|
| Classification | NaiveBayesClassifier |
Naive Bayes classifier | NaiveBayesClassifier.train(train_set) |
classify.accuracy() |
Classifier accuracy | classify.accuracy(classifier, test_set) |
|
| Corpora | gutenberg |
Gutenberg corpus | gutenberg.words('austen-emma.txt') |
brown |
Brown corpus | brown.words(categories='news') |
|
movie_reviews |
Movie reviews corpus | movie_reviews.words('pos') |
|
reuters |
Reuters news corpus | reuters.categories() |
|
| WordNet | wordnet.synsets() |
Find synonyms | wordnet.synsets('good') |
.definition() |
Word definition | synset.definition() |
|
.examples() |
Usage examples | synset.examples() |
|
.wup_similarity() |
Semantic similarity | synset1.wup_similarity(synset2) |
|
| Sentiment Analysis | SentimentIntensityAnalyzer |
Sentiment scoring | SentimentIntensityAnalyzer().polarity_scores(text) |
| Visualization | Text() |
Text object for analysis | Text(tokens) |
.concordance() |
Contextual search | text.concordance('love') |
|
.dispersion_plot() |
Word dispersion plot | text.dispersion_plot(['love', 'hate']) |
|
.similar() |
Find similar words | text.similar('love') |
Integration with Other Libraries
Integration with pandas
import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer
# Sentiment analysis in a DataFrame
df = pd.DataFrame({
'text': ['I love this product!', 'This is terrible.', 'Average quality.']
})
sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])
Integration with scikit‑learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
# Build a pipeline for classification
def preprocess_text(text):
tokens = word_tokenize(text.lower())
return ' '.join([lemmatizer.lemmatize(token) for token in tokens if token.isalpha()])
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier', MultinomialNB())
])
Integration with spaCy
import spacy
# Use NLTK for preprocessing, spaCy for analysis
nlp = spacy.load('en_core_web_sm')
def hybrid_processing(text):
# Preprocess with NLTK
tokens = word_tokenize(text)
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
# Analyze with spaCy
doc = nlp(' '.join(filtered_tokens))
return [(token.text, token.pos_, token.dep_) for token in doc]
Practical Use Cases
News Sentiment Analysis
def analyze_news_sentiment(news_texts):
sia = SentimentIntensityAnalyzer()
results = []
for text in news_texts:
# Preprocess
tokens = word_tokenize(text.lower())
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
processed_text = ' '.join(filtered_tokens)
# Sentiment analysis
sentiment = sia.polarity_scores(processed_text)
results.append({
'text': text[:100] + '...',
'positive': sentiment['pos'],
'negative': sentiment['neg'],
'neutral': sentiment['neu'],
'compound': sentiment['compound']
})
return results
Key Phrase Extraction
def extract_key_phrases(text, n=10): # Tokenize and POS tag tokens = word_tokenize(text.lower()) tagged = pos_tag(tokens) # Extract noun groups grammar = "NP: {?*+}" parser = RegexpParser(grammar) tree = parser.parse(tagged) # Gather phrases phrases = [] for subtree in tree.subtrees(): if subtree.label() == 'NP': phrase = ' '.join([word for word, pos in subtree.leaves()]) phrases.append(phrase) # Frequency count phrase_freq = FreqDist(phrases) return phrase_freq.most_common(n)
Document Comparison
def compare_documents(doc1, doc2):
# Preprocess documents
def preprocess(text):
tokens = word_tokenize(text.lower())
return [lemmatizer.lemmatize(w) for w in tokens if w.isalpha() and w not in stop_words]
tokens1 = preprocess(doc1)
tokens2 = preprocess(doc2)
# Build frequency distributions
fdist1 = FreqDist(tokens1)
fdist2 = FreqDist(tokens2)
# Jaccard similarity
set1 = set(tokens1)
set2 = set(tokens2)
intersection = len(set1.intersection(set2))
union = len(set1.union(set2))
jaccard_similarity = intersection / union if union > 0 else 0
return {
'jaccard_similarity': jaccard_similarity,
'common_words': list(set1.intersection(set2)),
'doc1_unique': list(set1 - set2),
'doc2_unique': list(set2 - set1)
}
Performance Optimization
Caching Results
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_lemmatize(word, pos):
return lemmatizer.lemmatize(word, pos)
@lru_cache(maxsize=1000)
def cached_stem(word):
return stemmer.stem(word)
Batch Processing
def batch_process_texts(texts, batch_size=100):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_results = []
for text in batch:
# Process text
processed = preprocess_text(text)
batch_results.append(processed)
results.extend(batch_results)
return results
Creating Custom Components
Custom Tokenizer
from nltk.tokenize.api import TokenizerI
class CustomTokenizer(TokenizerI):
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, text):
# Custom tokenization logic
if not self.preserve_case:
text = text.lower()
# Simple whitespace tokenization with punctuation removal
tokens = re.findall(r'\b\w+\b', text)
return tokens
Custom Classifier
from nltk.classify.api import ClassifierI
class CustomClassifier(ClassifierI):
def __init__(self, feature_extractor):
self.feature_extractor = feature_extractor
self.model = None
def classify(self, featureset):
# Custom classification logic
return self.model.predict([featureset])[0]
def prob_classify(self, featureset):
# Returns class probabilities
probabilities = self.model.predict_proba([featureset])[0]
return probabilities
Processing Multiple Languages
Multilingual Support
# Settings for different languages
language_settings = {
'english': {
'stopwords': set(stopwords.words('english')),
'stemmer': SnowballStemmer('english'),
'tokenizer': 'punkt'
},
'spanish': {
'stopwords': set(stopwords.words('spanish')),
'stemmer': SnowballStemmer('spanish'),
'tokenizer': 'punkt'
},
'french': {
'stopwords': set(stopwords.words('french')),
'stemmer': SnowballStemmer('french'),
'tokenizer': 'punkt'
}
}
def process_multilingual_text(text, language='english'):
settings = language_settings[language]
# Tokenize
tokens = word_tokenize(text, language=language)
# Remove stop words
filtered_tokens = [w for w in tokens if w.lower() not in settings['stopwords']]
# Stem
stemmed_tokens = [settings['stemmer'].stem(w) for w in filtered_tokens]
return stemmed_tokens
Working with Large Datasets
Streaming Processing
def stream_process_large_corpus(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
# Process one line at a time
processed_line = preprocess_text(line.strip())
yield processed_line
# Use generator to save memory
def analyze_large_corpus(file_path):
word_freq = FreqDist()
for processed_line in stream_process_large_corpus(file_path):
tokens = word_tokenize(processed_line)
for token in tokens:
word_freq[token] += 1
return word_freq.most_common(100)
Parallel Processing
from multiprocessing import Pool
import functools
def parallel_text_processing(texts, num_processes=4):
with Pool(processes=num_processes) as pool:
results = pool.map(preprocess_text, texts)
return results
Debugging and Testing
Component Testing
import unittest
class TestNLTKProcessing(unittest.TestCase):
def setUp(self):
self.sample_text = "This is a test sentence for NLTK processing
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed