NLTK - processing a natural language

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Architecture and Components of NLTK

Core Library Modules

NLTK is built on a modular principle, where each module is responsible for a specific NLP area:

tokenize module – provides various algorithms for splitting text into tokens stem module – contains stemming and lemmatization algorithms tag module – handles part‑of‑speech tagging parse module – includes parsers for syntactic analysis chunk module – implements chunking algorithms classify module – contains machine‑learning classifiers corpus module – provides access to corpora and lexical resources

Corpora and Lexical Resources

NLTK gives access to an extensive collection of corpora, including classic literary works, news articles, annotated texts and specialized dictionaries. These resources serve as the basis for training and testing NLP algorithms.

Features and Advantages of NLTK

Support for a large number of linguistic corpora – more than 50 corpora in various languages

Variety of algorithms for tokenization, stemming, lemmatization – from simple to advanced methods

Tools for morphological and syntactic analysis – including parsers of different types

Utilities for statistical text analysis – frequency distributions, collocations, n‑grams

Flexible architecture for customizing processing pipelines – ability to create custom handlers

Active documentation and educational resources – books, tutorials, code examples

Integration with popular libraries – NumPy, Matplotlib, scikit‑learn

Installation and Initial Setup

Installing the Library

pip install nltk

For additional capabilities it is also recommended to install:

pip install numpy matplotlib

Importing and Downloading Resources

import nltk

# Interactive resource downloader
nltk.download()

# Download specific resources
nltk.download('punkt')      # for tokenization
nltk.download('wordnet')    # for lemmatization
nltk.download('stopwords')  # stop words
nltk.download('averaged_perceptron_tagger')  # POS tags
nltk.download('maxent_ne_chunker')  # NER
nltk.download('words')      # English word list

Verifying Installation

import nltk
print(nltk.__version__)
nltk.data.find('tokenizers/punkt')

Working with Corpora and Text Data

Loading Built‑in Corpora

NLTK contains a rich collection of pre‑processed texts:

from nltk.corpus import gutenberg, brown, reuters, inaugural

# Load Gutenberg corpus
nltk.download('gutenberg')
sample_text = gutenberg.raw('austen-emma.txt')

# Work with Brown corpus
nltk.download('brown')
brown_words = brown.words(categories='news')

# Reuters corpus for classification
nltk.download('reuters')
reuters_categories = reuters.categories()

Working with Custom Texts

# Load text from a file
with open('text.txt', 'r', encoding='utf-8') as file:
    custom_text = file.read()

# Work with a text string
text = "Natural Language Processing with NLTK is powerful and flexible."

Text Tokenization

Basic Types of Tokenization

Tokenization is the process of splitting text into smaller units (tokens). NLTK provides several tokenizer types:

from nltk.tokenize import sent_tokenize, word_tokenize, line_tokenize

text = "NLTK is great! It provides many tools. Let's explore them."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)  # ['NLTK is great!', 'It provides many tools.', "Let's explore them."]

# Word tokenization
words = word_tokenize(text)
print(words)  # ['NLTK', 'is', 'great', '!', 'It', 'provides', 'many', 'tools', '.', 'Let', "'s", 'explore', 'them', '.']

Specialized Tokenizers

from nltk.tokenize import WhitespaceTokenizer, RegexpTokenizer, TreebankWordTokenizer

# Whitespace tokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokens = whitespace_tokenizer.tokenize(text)

# Regex tokenizer
regexp_tokenizer = RegexpTokenizer(r'\w+')
words_only = regexp_tokenizer.tokenize(text)

# Penn Treebank tokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(text)

Tokenization for Different Languages

from nltk.tokenize import sent_tokenize

# Tokenization for various languages
german_text = "Hallo Welt! Wie geht es dir? Ich hoffe, alles ist gut."
german_sentences = sent_tokenize(german_text, language='german')

Normalization and Pre‑processing

Stemming

Stemming is the process of removing affixes from words to obtain their base form:

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# Porter stemmer
porter = PorterStemmer()
print(porter.stem('running'))    # 'run'
print(porter.stem('happiness'))  # 'happi'

# Lancaster stemmer (more aggressive)
lancaster = LancasterStemmer()
print(lancaster.stem('running'))    # 'run'
print(lancaster.stem('happiness'))  # 'happy'

# Snowball stemmer (multilingual)
snowball = SnowballStemmer('english')
print(snowball.stem('running'))   # 'run'

Lemmatization

Lemmatization is more precise than stemming because it takes context and part of speech into account:

from nltk.stem import WordNetLemmatizer

nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()

# Lemmatization with POS specification
print(lemmatizer.lemmatize('running', pos='v'))    # 'run' (verb)
print(lemmatizer.lemmatize('running', pos='n'))    # 'running' (noun)
print(lemmatizer.lemmatize('better', pos='a'))     # 'good' (adjective)
print(lemmatizer.lemmatize('mice', pos='n'))       # 'mouse' (noun)

Stop‑word Removal

from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Filter stop words
words = word_tokenize("This is a sample sentence with stop words.")
filtered_words = [w for w in words if w.lower() not in stop_words]
print(filtered_words)  # ['sample', 'sentence', 'stop', 'words', '.']

# Add custom stop words
custom_stop_words = stop_words.union({'sample', 'sentence'})

Morphological Analysis

Part‑of‑Speech Tagging (POS‑tagging)

POS tags help determine the grammatical role of each word:

from nltk import pos_tag

text = "NLTK is a powerful natural language processing library."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
# [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), 
#  ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('library', 'NN')]

# Get POS tag description
nltk.help.upenn_tagset('NNP')  # Proper noun

Advanced POS Analysis

from nltk.tag import UnigramTagger, BigramTagger

# Create a custom POS tagger
from nltk.corpus import brown

# Prepare training data
brown_tagged_sents = brown.tagged_sents(categories='news')
train_sents = brown_tagged_sents[:3000]
test_sents = brown_tagged_sents[3000:3500]

# Train tagger
unigram_tagger = UnigramTagger(train_sents)
accuracy = unigram_tagger.evaluate(test_sents)
print(f"Accuracy: {accuracy}")

Syntactic Analysis

Named Entity Recognition (NER)

from nltk import ne_chunk

text = "Barack Obama was born in Hawaii. He worked at Google before joining Apple."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)

print(entities)
# Prints a tree with identified named entities

Context‑Free Grammars

from nltk import CFG, ChartParser

# Define grammar
grammar = CFG.fromstring("""
    S -> NP VP
    NP -> Det N | Det Adj N | 'I'
    VP -> V NP | V
    Det -> 'the' | 'a'
    N -> 'cat' | 'dog' | 'book'
    Adj -> 'big' | 'small'
    V -> 'saw' | 'read' | 'walked'
""")

# Create parser
parser = ChartParser(grammar)

# Parse a sentence
sentence = ['I', 'saw', 'the', 'big', 'cat']
for tree in parser.parse(sentence):
    print(tree)
    tree.draw()  # Visualize parse tree

Chunking

from nltk import RegexpParser

# Grammar for noun phrase chunking
grammar = "NP: {?*}"
chunk_parser = RegexpParser(grammar)

# Apply to tagged text
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
chunks = chunk_parser.parse(tagged)
print(chunks)

Statistical Text Analysis

Frequency Distributions

from nltk import FreqDist
import matplotlib.pyplot as plt

# Create frequency distribution
text = gutenberg.words('austen-emma.txt')
fdist = FreqDist(text)

# Analyze frequencies
print(fdist.most_common(10))    # 10 most frequent words
print(fdist['Emma'])            # Frequency of a specific word
print(fdist.hapaxes()[:10])     # Words occurring once

# Visualization
fdist.plot(30, cumulative=False)
plt.show()

Conditional Frequency Distributions

from nltk import ConditionalFreqDist

# Category‑wise analysis
cfd = ConditionalFreqDist(
    (genre, word.lower())
    for genre in ['news', 'romance']
    for word in brown.words(categories=genre)
)

# Compare word usage across categories
cfd.plot(conditions=['news', 'romance'], samples=['man', 'woman'])

Working with N‑grams

Creating and Analyzing N‑grams

from nltk import ngrams, bigrams, trigrams
from nltk.util import pad_sequence

text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())

# Bigrams
bigrams_list = list(bigrams(tokens))
print(bigrams_list[:5])

# Trigrams
trigrams_list = list(trigrams(tokens))
print(trigrams_list[:5])

# Arbitrary‑length n‑grams
four_grams = list(ngrams(tokens, 4))

Finding Collocations

from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

# Find significant bigrams
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigram_finder.apply_freq_filter(3)  # Minimum frequency
bigram_measures = BigramAssocMeasures()

# Get bigrams with high PMI
best_bigrams = bigram_finder.nbest(bigram_measures.pmi, 10)
print(best_bigrams)

Text Classification

Naive Bayes Classifier

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import random

# Prepare data
nltk.download('movie_reviews')

def document_features(document):
    words = set(document)
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in words)
    return features

# Create feature list
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

# Build train and test sets
featuresets = [(document_features(d), c) for (d, c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]

# Train classifier
classifier = NaiveBayesClassifier.train(train_set)

# Test
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy}")

# Show most informative features
classifier.show_most_informative_features(5)

Sentiment Analysis

Basic Sentiment Analysis

from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Sentiment analysis
text = "NLTK is really amazing for text processing!"
sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)
# {'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.6588}

Working with WordNet

Semantic Analysis

from nltk.corpus import wordnet

# Find synonyms
synsets = wordnet.synsets('good')
for synset in synsets:
    print(f"{synset.name()}: {synset.definition()}")
    print(f"Examples: {synset.examples()}")
    print(f"Lemmas: {[lemma.name() for lemma in synset.lemmas()]}")
    print()

# Find antonyms
good_synset = wordnet.synset('good.a.01')
antonyms = []
for lemma in good_synset.lemmas():
    if lemma.antonyms():
        antonyms.extend([ant.name() for ant in lemma.antonyms()])
print(f"Antonyms of 'good': {antonyms}")

# Semantic similarity
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
similarity = dog.wup_similarity(cat)
print(f"Similarity between dog and cat: {similarity}")

Advanced Capabilities

Working with Regular Expressions

import re
from nltk.tokenize import regexp_tokenize

# Extract email addresses
text = "Contact us at info@example.com or support@test.org"
emails = regexp_tokenize(text, pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
print(emails)

# Extract URLs
urls = regexp_tokenize(text, pattern=r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

Creating Custom Corpora

from nltk.corpus import PlaintextCorpusReader

# Build corpus from custom files
corpus_root = '/path/to/your/corpus'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
print(wordlists.words('file1.txt'))

Table of Core NLTK Methods and Functions

Category Function/Class Description Usage Example
Resource Loading nltk.download() Downloads corpora and models nltk.download('punkt')
  nltk.data.find() Checks for resource existence nltk.data.find('tokenizers/punkt')
Tokenization word_tokenize() Splits into words word_tokenize("Hello world!")
  sent_tokenize() Splits into sentences sent_tokenize("Hello! How are you?")
  RegexpTokenizer() Regex‑based tokenization RegexpTokenizer(r'\w+')
  WhitespaceTokenizer() Whitespace tokenization WhitespaceTokenizer().tokenize(text)
Stemming and Lemmatization PorterStemmer() Porter stemming algorithm PorterStemmer().stem('running')
  LancasterStemmer() Aggressive stemming LancasterStemmer().stem('running')
  SnowballStemmer() Multilingual stemming SnowballStemmer('english').stem('running')
  WordNetLemmatizer() Lemmatization using WordNet WordNetLemmatizer().lemmatize('running', pos='v')
Stop Words stopwords.words() Gets list of stop words stopwords.words('english')
POS Tags pos_tag() Part‑of‑speech tagging pos_tag(['Hello', 'world'])
  help.upenn_tagset() POS tag description help.upenn_tagset('NN')
Frequency Analysis FreqDist() Frequency distribution FreqDist(tokens)
  .most_common() Most frequent items fdist.most_common(10)
  .hapaxes() Items with frequency 1 fdist.hapaxes()
  .plot() Distribution plot fdist.plot(30)
N‑grams ngrams() Create n‑grams list(ngrams(tokens, 2))
  bigrams() Create bigrams list(bigrams(tokens))
  trigrams() Create trigrams list(trigrams(tokens))
  BigramCollocationFinder Find collocations BigramCollocationFinder.from_words(tokens)
Syntactic Analysis ne_chunk() Named entity recognition ne_chunk(pos_tag(tokens))
  RegexpParser() Regex‑based parsing RegexpParser("NP: {?*}")
  ChartParser() Parsing with context‑free grammars ChartParser(grammar)
Classification NaiveBayesClassifier Naive Bayes classifier NaiveBayesClassifier.train(train_set)
  classify.accuracy() Classifier accuracy classify.accuracy(classifier, test_set)
Corpora gutenberg Gutenberg corpus gutenberg.words('austen-emma.txt')
  brown Brown corpus brown.words(categories='news')
  movie_reviews Movie reviews corpus movie_reviews.words('pos')
  reuters Reuters news corpus reuters.categories()
WordNet wordnet.synsets() Find synonyms wordnet.synsets('good')
  .definition() Word definition synset.definition()
  .examples() Usage examples synset.examples()
  .wup_similarity() Semantic similarity synset1.wup_similarity(synset2)
Sentiment Analysis SentimentIntensityAnalyzer Sentiment scoring SentimentIntensityAnalyzer().polarity_scores(text)
Visualization Text() Text object for analysis Text(tokens)
  .concordance() Contextual search text.concordance('love')
  .dispersion_plot() Word dispersion plot text.dispersion_plot(['love', 'hate'])
  .similar() Find similar words text.similar('love')

Integration with Other Libraries

Integration with pandas

import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer

# Sentiment analysis in a DataFrame
df = pd.DataFrame({
    'text': ['I love this product!', 'This is terrible.', 'Average quality.']
})

sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])

Integration with scikit‑learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Build a pipeline for classification
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    return ' '.join([lemmatizer.lemmatize(token) for token in tokens if token.isalpha()])

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

Integration with spaCy

import spacy

# Use NLTK for preprocessing, spaCy for analysis
nlp = spacy.load('en_core_web_sm')

def hybrid_processing(text):
    # Preprocess with NLTK
    tokens = word_tokenize(text)
    filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
    
    # Analyze with spaCy
    doc = nlp(' '.join(filtered_tokens))
    return [(token.text, token.pos_, token.dep_) for token in doc]

Practical Use Cases

News Sentiment Analysis

def analyze_news_sentiment(news_texts):
    sia = SentimentIntensityAnalyzer()
    results = []
    
    for text in news_texts:
        # Preprocess
        tokens = word_tokenize(text.lower())
        filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
        processed_text = ' '.join(filtered_tokens)
        
        # Sentiment analysis
        sentiment = sia.polarity_scores(processed_text)
        results.append({
            'text': text[:100] + '...',
            'positive': sentiment['pos'],
            'negative': sentiment['neg'],
            'neutral': sentiment['neu'],
            'compound': sentiment['compound']
        })
    
    return results

Key Phrase Extraction

def extract_key_phrases(text, n=10):
    # Tokenize and POS tag
    tokens = word_tokenize(text.lower())
    tagged = pos_tag(tokens)
    
    # Extract noun groups
    grammar = "NP: {?*+}"
    parser = RegexpParser(grammar)
    tree = parser.parse(tagged)
    
    # Gather phrases
    phrases = []
    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            phrase = ' '.join([word for word, pos in subtree.leaves()])
            phrases.append(phrase)
    
    # Frequency count
    phrase_freq = FreqDist(phrases)
    return phrase_freq.most_common(n)

Document Comparison

def compare_documents(doc1, doc2):
    # Preprocess documents
    def preprocess(text):
        tokens = word_tokenize(text.lower())
        return [lemmatizer.lemmatize(w) for w in tokens if w.isalpha() and w not in stop_words]
    
    tokens1 = preprocess(doc1)
    tokens2 = preprocess(doc2)
    
    # Build frequency distributions
    fdist1 = FreqDist(tokens1)
    fdist2 = FreqDist(tokens2)
    
    # Jaccard similarity
    set1 = set(tokens1)
    set2 = set(tokens2)
    
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    
    jaccard_similarity = intersection / union if union > 0 else 0
    
    return {
        'jaccard_similarity': jaccard_similarity,
        'common_words': list(set1.intersection(set2)),
        'doc1_unique': list(set1 - set2),
        'doc2_unique': list(set2 - set1)
    }

Performance Optimization

Caching Results

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_lemmatize(word, pos):
    return lemmatizer.lemmatize(word, pos)

@lru_cache(maxsize=1000)
def cached_stem(word):
    return stemmer.stem(word)

Batch Processing

def batch_process_texts(texts, batch_size=100):
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = []
        
        for text in batch:
            # Process text
            processed = preprocess_text(text)
            batch_results.append(processed)
        
        results.extend(batch_results)
    
    return results

Creating Custom Components

Custom Tokenizer

from nltk.tokenize.api import TokenizerI

class CustomTokenizer(TokenizerI):
    def __init__(self, preserve_case=False):
        self.preserve_case = preserve_case
    
    def tokenize(self, text):
        # Custom tokenization logic
        if not self.preserve_case:
            text = text.lower()
        
        # Simple whitespace tokenization with punctuation removal
        tokens = re.findall(r'\b\w+\b', text)
        return tokens

Custom Classifier

from nltk.classify.api import ClassifierI

class CustomClassifier(ClassifierI):
    def __init__(self, feature_extractor):
        self.feature_extractor = feature_extractor
        self.model = None
    
    def classify(self, featureset):
        # Custom classification logic
        return self.model.predict([featureset])[0]
    
    def prob_classify(self, featureset):
        # Returns class probabilities
        probabilities = self.model.predict_proba([featureset])[0]
        return probabilities

Processing Multiple Languages

Multilingual Support

# Settings for different languages
language_settings = {
    'english': {
        'stopwords': set(stopwords.words('english')),
        'stemmer': SnowballStemmer('english'),
        'tokenizer': 'punkt'
    },
    'spanish': {
        'stopwords': set(stopwords.words('spanish')),
        'stemmer': SnowballStemmer('spanish'),
        'tokenizer': 'punkt'
    },
    'french': {
        'stopwords': set(stopwords.words('french')),
        'stemmer': SnowballStemmer('french'),
        'tokenizer': 'punkt'
    }
}

def process_multilingual_text(text, language='english'):
    settings = language_settings[language]
    
    # Tokenize
    tokens = word_tokenize(text, language=language)
    
    # Remove stop words
    filtered_tokens = [w for w in tokens if w.lower() not in settings['stopwords']]
    
    # Stem
    stemmed_tokens = [settings['stemmer'].stem(w) for w in filtered_tokens]
    
    return stemmed_tokens

Working with Large Datasets

Streaming Processing

def stream_process_large_corpus(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            # Process one line at a time
            processed_line = preprocess_text(line.strip())
            yield processed_line

# Use generator to save memory
def analyze_large_corpus(file_path):
    word_freq = FreqDist()
    
    for processed_line in stream_process_large_corpus(file_path):
        tokens = word_tokenize(processed_line)
        for token in tokens:
            word_freq[token] += 1
    
    return word_freq.most_common(100)

Parallel Processing

from multiprocessing import Pool
import functools

def parallel_text_processing(texts, num_processes=4):
    with Pool(processes=num_processes) as pool:
        results = pool.map(preprocess_text, texts)
    return results

Debugging and Testing

Component Testing

import unittest

class TestNLTKProcessing(unittest.TestCase):
    def setUp(self):
        self.sample_text = "This is a test sentence for NLTK processing."
    
    def test_tokenization(self):
        tokens = word_tokenize(self.sample_text)
        self.assertIsInstance(tokens, list)
        self.assertGreater(len(tokens), 0)
    
    def test_pos_tagging(self):
        tokens = word_tokenize(self.sample_text)
        tagged = pos_tag(tokens)
        self.assertEqual(len(tagged), len(tokens))
    
    def test_lemmatization(self):
        lemmatizer = WordNetLemmatizer()
        result = lemmatizer.lemmatize('running', pos='v')
        self.assertEqual(result, 'run')

if __name__ == '__main__':
    unittest.main()

Tips for Using NLTK

Best Practices

Always download required resources before using the corresponding functions

Use appropriate POS tags for lemmatization to obtain more accurate results

Combine different preprocessing methods depending on the task

Test on diverse text types to verify the robustness of your solution

Utilize caching to improve performance when processing large volumes of data

Common Mistakes

Forgetting to download resources – always check that necessary models are loaded

Choosing the wrong tokenizer – different tasks require different tokenization approaches

Ignoring context during lemmatization – specifying the part of speech greatly improves results

Inefficient handling of big data – use generators and streaming processing

Alternatives and Comparison

Comparison with Other Libraries

spaCy: Faster and more modern alternative, optimized for production

TextBlob: Simplified interface for basic NLP tasks

Gensim: Specializes in topic modeling and vector‑space representations

Transformers: State‑of‑the‑art pretrained models for complex NLP tasks

When to Use NLTK

Educational purposes – an excellent library for learning NLP fundamentals

Research projects – rich set of algorithms and corpora

Prototyping – rapid creation of basic NLP solutions

Specific linguistic tasks – unique tools for deep language analysis

Frequently Asked Questions

What is NLTK? NLTK (Natural Language Toolkit) is a comprehensive Python library for natural language processing, providing tools for tokenization, morphological analysis, syntactic parsing and other NLP tasks.

Does NLTK support Russian? NLTK has limited support for Russian. Core functionality works best with English, but some features (e.g., tokenization) can be applied to Russian text.

How does NLTK differ from spaCy? NLTK offers an educational and research‑focused toolkit with a wide range of algorithms, whereas spaCy is optimized for production use and runs faster.

Is NLTK suitable for machine learning? NLTK includes basic classification algorithms, but for serious machine‑learning tasks it is recommended to use it together with scikit‑learn or other ML libraries.

Can NLTK handle big data? NLTK is not optimized for large‑scale data processing. For such tasks, spaCy or distributed computing frameworks are preferable.

Does NLTK have GPU support? NLTK does not have built‑in GPU support. For GPU acceleration use specialized libraries such as PyTorch or TensorFlow.

How to add support for a new language in NLTK? You can create custom tokenizers, add stop‑word lists and train your own models for the target language.

Is NLTK suitable for commercial use? Yes, NLTK is released under the Apache License 2.0, which permits use in commercial projects.

Conclusion

NLTK remains one of the most important and influential libraries in the Python natural language processing ecosystem. Thanks to its educational focus, rich set of resources and comprehensive coverage of NLP aspects, it serves as an excellent starting point for learning and understanding natural language processing principles.

While production‑grade projects may require more modern and optimized solutions, NLTK provides a solid foundation for grasping fundamental NLP concepts and remains an indispensable tool for researchers, students and developers working with textual data.

Combining NLTK with other modern libraries enables the creation of powerful and efficient solutions for a wide range of natural language processing tasks, from simple text analysis to complex language understanding systems.

* --- **💬 Response:**

Introduction to NLTK

NLTK (Natural Language Toolkit) — is one of the most well‑known and widely used Python libraries for natural language processing (NLP). Developed in 2001 at the University of Pennsylvania, it provides comprehensive tools and resources for working with text, including tokenization, morphological analysis, syntactic parsing, building frequency distributions and much more. The library is actively used both for educational purposes and in applied text‑analysis projects.

NLTK contains more than 50 corpora and lexical resources, including the Penn Treebank, Brown Corpus, WordNet and many others. This makes it an indispensable tool for researchers, students and developers working on natural language processing tasks.

Architecture and Components of NLTK

Core Library Modules

NLTK is built on a modular principle, where each module is responsible for a specific NLP area:

tokenize module – provides various algorithms for splitting text into tokens stem module – contains stemming and lemmatization algorithms tag module – handles part‑of‑speech tagging parse module – includes parsers for syntactic analysis chunk module – implements chunking algorithms classify module – contains machine‑learning classifiers corpus module – provides access to corpora and lexical resources

Corpora and Lexical Resources

NLTK gives access to an extensive collection of corpora, including classic literary works, news articles, annotated texts and specialized dictionaries. These resources serve as the basis for training and testing NLP algorithms.

Features and Advantages of NLTK

Support for a large number of linguistic corpora – more than 50 corpora in various languages

Variety of algorithms for tokenization, stemming, lemmatization – from simple to advanced methods

Tools for morphological and syntactic analysis – including parsers of different types

Utilities for statistical text analysis – frequency distributions, collocations, n‑grams

Flexible architecture for customizing processing pipelines – ability to create custom handlers

Active documentation and educational resources – books, tutorials, code examples

Integration with popular libraries – NumPy, Matplotlib, scikit‑learn

Installation and Initial Setup

Installing the Library

pip install nltk

For additional capabilities it is also recommended to install:

pip install numpy matplotlib

Importing and Downloading Resources

import nltk

# Interactive resource downloader
nltk.download()

# Download specific resources
nltk.download('punkt')      # for tokenization
nltk.download('wordnet')    # for lemmatization
nltk.download('stopwords')  # stop words
nltk.download('averaged_perceptron_tagger')  # POS tags
nltk.download('maxent_ne_chunker')  # NER
nltk.download('words')      # English word list

Verifying Installation

import nltk
print(nltk.__version__)
nltk.data.find('tokenizers/punkt')

Working with Corpora and Text Data

Loading Built‑in Corpora

NLTK contains a rich collection of pre‑processed texts:

from nltk.corpus import gutenberg, brown, reuters, inaugural

# Load Gutenberg corpus
nltk.download('gutenberg')
sample_text = gutenberg.raw('austen-emma.txt')

# Work with Brown corpus
nltk.download('brown')
brown_words = brown.words(categories='news')

# Reuters corpus for classification
nltk.download('reuters')
reuters_categories = reuters.categories()

Working with Custom Texts

# Load text from a file
with open('text.txt', 'r', encoding='utf-8') as file:
    custom_text = file.read()

# Work with a text string
text = "Natural Language Processing with NLTK is powerful and flexible."

Text Tokenization

Basic Types of Tokenization

Tokenization is the process of splitting text into smaller units (tokens). NLTK provides several tokenizer types:

from nltk.tokenize import sent_tokenize, word_tokenize, line_tokenize

text = "NLTK is great! It provides many tools. Let's explore them."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)  # ['NLTK is great!', 'It provides many tools.', "Let's explore them."]

# Word tokenization
words = word_tokenize(text)
print(words)  # ['NLTK', 'is', 'great', '!', 'It', 'provides', 'many', 'tools', '.', 'Let', "'s", 'explore', 'them', '.']

Specialized Tokenizers

from nltk.tokenize import WhitespaceTokenizer, RegexpTokenizer, TreebankWordTokenizer

# Whitespace tokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokens = whitespace_tokenizer.tokenize(text)

# Regex tokenizer
regexp_tokenizer = RegexpTokenizer(r'\w+')
words_only = regexp_tokenizer.tokenize(text)

# Penn Treebank tokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(text)

Tokenization for Different Languages

from nltk.tokenize import sent_tokenize

# Tokenization for various languages
german_text = "Hallo Welt! Wie geht es dir? Ich hoffe, alles ist gut."
german_sentences = sent_tokenize(german_text, language='german')

Normalization and Pre‑processing

Stemming

Stemming is the process of removing affixes from words to obtain their base form:

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# Porter stemmer
porter = PorterStemmer()
print(porter.stem('running'))    # 'run'
print(porter.stem('happiness'))  # 'happi'

# Lancaster stemmer (more aggressive)
lancaster = LancasterStemmer()
print(lancaster.stem('running'))    # 'run'
print(lancaster.stem('happiness'))  # 'happy'

# Snowball stemmer (multilingual)
snowball = SnowballStemmer('english')
print(snowball.stem('running'))   # 'run'

Lemmatization

Lemmatization is more precise than stemming because it takes context and part of speech into account:

from nltk.stem import WordNetLemmatizer

nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()

# Lemmatization with POS specification
print(lemmatizer.lemmatize('running', pos='v'))    # 'run' (verb)
print(lemmatizer.lemmatize('running', pos='n'))    # 'running' (noun)
print(lemmatizer.lemmatize('better', pos='a'))     # 'good' (adjective)
print(lemmatizer.lemmatize('mice', pos='n'))       # 'mouse' (noun)

Stop‑word Removal

from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Filter stop words
words = word_tokenize("This is a sample sentence with stop words.")
filtered_words = [w for w in words if w.lower() not in stop_words]
print(filtered_words)  # ['sample', 'sentence', 'stop', 'words', '.']

# Add custom stop words
custom_stop_words = stop_words.union({'sample', 'sentence'})

Morphological Analysis

Part‑of‑Speech Tagging (POS‑tagging)

POS tags help determine the grammatical role of each word:

from nltk import pos_tag

text = "NLTK is a powerful natural language processing library."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
# [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), 
#  ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('library', 'NN')]

# Get POS tag description
nltk.help.upenn_tagset('NNP')  # Proper noun

Advanced POS Analysis

from nltk.tag import UnigramTagger, BigramTagger

# Create a custom POS tagger
from nltk.corpus import brown

# Prepare training data
brown_tagged_sents = brown.tagged_sents(categories='news')
train_sents = brown_tagged_sents[:3000]
test_sents = brown_tagged_sents[3000:3500]

# Train tagger
unigram_tagger = UnigramTagger(train_sents)
accuracy = unigram_tagger.evaluate(test_sents)
print(f"Accuracy: {accuracy}")

Syntactic Analysis

Named Entity Recognition (NER)

from nltk import ne_chunk

text = "Barack Obama was born in Hawaii. He worked at Google before joining Apple."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)

print(entities)
# Prints a tree with identified named entities

Context‑Free Grammars

from nltk import CFG, ChartParser

# Define grammar
grammar = CFG.fromstring("""
    S -> NP VP
    NP -> Det N | Det Adj N | 'I'
    VP -> V NP | V
    Det -> 'the' | 'a'
    N -> 'cat' | 'dog' | 'book'
    Adj -> 'big' | 'small'
    V -> 'saw' | 'read' | 'walked'
""")

# Create parser
parser = ChartParser(grammar)

# Parse a sentence
sentence = ['I', 'saw', 'the', 'big', 'cat']
for tree in parser.parse(sentence):
    print(tree)
    tree.draw()  # Visualize parse tree

Chunking

from nltk import RegexpParser

# Grammar for noun phrase chunking
grammar = "NP: {?*}"
chunk_parser = RegexpParser(grammar)

# Apply to tagged text
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
chunks = chunk_parser.parse(tagged)
print(chunks)

Statistical Text Analysis

Frequency Distributions

from nltk import FreqDist
import matplotlib.pyplot as plt

# Create frequency distribution
text = gutenberg.words('austen-emma.txt')
fdist = FreqDist(text)

# Analyze frequencies
print(fdist.most_common(10))    # 10 most frequent words
print(fdist['Emma'])            # Frequency of a specific word
print(fdist.hapaxes()[:10])     # Words occurring once

# Visualization
fdist.plot(30, cumulative=False)
plt.show()

Conditional Frequency Distributions

from nltk import ConditionalFreqDist

# Category‑wise analysis
cfd = ConditionalFreqDist(
    (genre, word.lower())
    for genre in ['news', 'romance']
    for word in brown.words(categories=genre)
)

# Compare word usage across categories
cfd.plot(conditions=['news', 'romance'], samples=['man', 'woman'])

Working with N‑grams

Creating and Analyzing N‑grams

from nltk import ngrams, bigrams, trigrams
from nltk.util import pad_sequence

text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())

# Bigrams
bigrams_list = list(bigrams(tokens))
print(bigrams_list[:5])

# Trigrams
trigrams_list = list(trigrams(tokens))
print(trigrams_list[:5])

# Arbitrary‑length n‑grams
four_grams = list(ngrams(tokens, 4))

Finding Collocations

from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

# Find significant bigrams
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigram_finder.apply_freq_filter(3)  # Minimum frequency
bigram_measures = BigramAssocMeasures()

# Get bigrams with high PMI
best_bigrams = bigram_finder.nbest(bigram_measures.pmi, 10)
print(best_bigrams)

Text Classification

Naive Bayes Classifier

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
import random

# Prepare data
nltk.download('movie_reviews')

def document_features(document):
    words = set(document)
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in words)
    return features

# Create feature list
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

# Build train and test sets
featuresets = [(document_features(d), c) for (d, c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]

# Train classifier
classifier = NaiveBayesClassifier.train(train_set)

# Test
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy}")

# Show most informative features
classifier.show_most_informative_features(5)

Sentiment Analysis

Basic Sentiment Analysis

from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Sentiment analysis
text = "NLTK is really amazing for text processing!"
sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)
# {'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.6588}

Working with WordNet

Semantic Analysis

from nltk.corpus import wordnet

# Find synonyms
synsets = wordnet.synsets('good')
for synset in synsets:
    print(f"{synset.name()}: {synset.definition()}")
    print(f"Examples: {synset.examples()}")
    print(f"Lemmas: {[lemma.name() for lemma in synset.lemmas()]}")
    print()

# Find antonyms
good_synset = wordnet.synset('good.a.01')
antonyms = []
for lemma in good_synset.lemmas():
    if lemma.antonyms():
        antonyms.extend([ant.name() for ant in lemma.antonyms()])
print(f"Antonyms of 'good': {antonyms}")

# Semantic similarity
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
similarity = dog.wup_similarity(cat)
print(f"Similarity between dog and cat: {similarity}")

Advanced Capabilities

Working with Regular Expressions

import re
from nltk.tokenize import regexp_tokenize

# Extract email addresses
text = "Contact us at info@example.com or support@test.org"
emails = regexp_tokenize(text, pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
print(emails)

# Extract URLs
urls = regexp_tokenize(text, pattern=r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

Creating Custom Corpora

from nltk.corpus import PlaintextCorpusReader

# Build corpus from custom files
corpus_root = '/path/to/your/corpus'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
print(wordlists.words('file1.txt'))

Table of Core NLTK Methods and Functions

Category Function/Class Description Usage Example
Resource Loading nltk.download() Downloads corpora and models nltk.download('punkt')
  nltk.data.find() Checks for resource existence nltk.data.find('tokenizers/punkt')
Tokenization word_tokenize() Splits into words word_tokenize("Hello world!")
  sent_tokenize() Splits into sentences sent_tokenize("Hello! How are you?")
  RegexpTokenizer() Regex‑based tokenization RegexpTokenizer(r'\w+')
  WhitespaceTokenizer() Whitespace tokenization WhitespaceTokenizer().tokenize(text)
Stemming and Lemmatization PorterStemmer() Porter stemming algorithm PorterStemmer().stem('running')
  LancasterStemmer() Aggressive stemming LancasterStemmer().stem('running')
  SnowballStemmer() Multilingual stemming SnowballStemmer('english').stem('running')
  WordNetLemmatizer() Lemmatization using WordNet WordNetLemmatizer().lemmatize('running', pos='v')
Stop Words stopwords.words() Gets list of stop words stopwords.words('english')
POS Tags pos_tag() Part‑of‑speech tagging pos_tag(['Hello', 'world'])
  help.upenn_tagset() POS tag description help.upenn_tagset('NN')
Frequency Analysis FreqDist() Frequency distribution FreqDist(tokens)
  .most_common() Most frequent items fdist.most_common(10)
  .hapaxes() Items with frequency 1 fdist.hapaxes()
  .plot() Distribution plot fdist.plot(30)
N‑grams ngrams() Create n‑grams list(ngrams(tokens, 2))
  bigrams() Create bigrams list(bigrams(tokens))
  trigrams() Create trigrams list(trigrams(tokens))
  BigramCollocationFinder Find collocations BigramCollocationFinder.from_words(tokens)
Syntactic Analysis ne_chunk() Named entity recognition ne_chunk(pos_tag(tokens))
  RegexpParser() Regex‑based parsing RegexpParser("NP: {?*}")
  ChartParser() Parsing with context‑free grammars ChartParser(grammar)
Classification NaiveBayesClassifier Naive Bayes classifier NaiveBayesClassifier.train(train_set)
  classify.accuracy() Classifier accuracy classify.accuracy(classifier, test_set)
Corpora gutenberg Gutenberg corpus gutenberg.words('austen-emma.txt')
  brown Brown corpus brown.words(categories='news')
  movie_reviews Movie reviews corpus movie_reviews.words('pos')
  reuters Reuters news corpus reuters.categories()
WordNet wordnet.synsets() Find synonyms wordnet.synsets('good')
  .definition() Word definition synset.definition()
  .examples() Usage examples synset.examples()
  .wup_similarity() Semantic similarity synset1.wup_similarity(synset2)
Sentiment Analysis SentimentIntensityAnalyzer Sentiment scoring SentimentIntensityAnalyzer().polarity_scores(text)
Visualization Text() Text object for analysis Text(tokens)
  .concordance() Contextual search text.concordance('love')
  .dispersion_plot() Word dispersion plot text.dispersion_plot(['love', 'hate'])
  .similar() Find similar words text.similar('love')

Integration with Other Libraries

Integration with pandas

import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer

# Sentiment analysis in a DataFrame
df = pd.DataFrame({
    'text': ['I love this product!', 'This is terrible.', 'Average quality.']
})

sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])

Integration with scikit‑learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Build a pipeline for classification
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    return ' '.join([lemmatizer.lemmatize(token) for token in tokens if token.isalpha()])

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

Integration with spaCy

import spacy

# Use NLTK for preprocessing, spaCy for analysis
nlp = spacy.load('en_core_web_sm')

def hybrid_processing(text):
    # Preprocess with NLTK
    tokens = word_tokenize(text)
    filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
    
    # Analyze with spaCy
    doc = nlp(' '.join(filtered_tokens))
    return [(token.text, token.pos_, token.dep_) for token in doc]

Practical Use Cases

News Sentiment Analysis

def analyze_news_sentiment(news_texts):
    sia = SentimentIntensityAnalyzer()
    results = []
    
    for text in news_texts:
        # Preprocess
        tokens = word_tokenize(text.lower())
        filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
        processed_text = ' '.join(filtered_tokens)
        
        # Sentiment analysis
        sentiment = sia.polarity_scores(processed_text)
        results.append({
            'text': text[:100] + '...',
            'positive': sentiment['pos'],
            'negative': sentiment['neg'],
            'neutral': sentiment['neu'],
            'compound': sentiment['compound']
        })
    
    return results

Key Phrase Extraction

def extract_key_phrases(text, n=10):
    # Tokenize and POS tag
    tokens = word_tokenize(text.lower())
    tagged = pos_tag(tokens)
    
    # Extract noun groups
    grammar = "NP: {?*+}"
    parser = RegexpParser(grammar)
    tree = parser.parse(tagged)
    
    # Gather phrases
    phrases = []
    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            phrase = ' '.join([word for word, pos in subtree.leaves()])
            phrases.append(phrase)
    
    # Frequency count
    phrase_freq = FreqDist(phrases)
    return phrase_freq.most_common(n)

Document Comparison

def compare_documents(doc1, doc2):
    # Preprocess documents
    def preprocess(text):
        tokens = word_tokenize(text.lower())
        return [lemmatizer.lemmatize(w) for w in tokens if w.isalpha() and w not in stop_words]
    
    tokens1 = preprocess(doc1)
    tokens2 = preprocess(doc2)
    
    # Build frequency distributions
    fdist1 = FreqDist(tokens1)
    fdist2 = FreqDist(tokens2)
    
    # Jaccard similarity
    set1 = set(tokens1)
    set2 = set(tokens2)
    
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    
    jaccard_similarity = intersection / union if union > 0 else 0
    
    return {
        'jaccard_similarity': jaccard_similarity,
        'common_words': list(set1.intersection(set2)),
        'doc1_unique': list(set1 - set2),
        'doc2_unique': list(set2 - set1)
    }

Performance Optimization

Caching Results

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_lemmatize(word, pos):
    return lemmatizer.lemmatize(word, pos)

@lru_cache(maxsize=1000)
def cached_stem(word):
    return stemmer.stem(word)

Batch Processing

def batch_process_texts(texts, batch_size=100):
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = []
        
        for text in batch:
            # Process text
            processed = preprocess_text(text)
            batch_results.append(processed)
        
        results.extend(batch_results)
    
    return results

Creating Custom Components

Custom Tokenizer

from nltk.tokenize.api import TokenizerI

class CustomTokenizer(TokenizerI):
    def __init__(self, preserve_case=False):
        self.preserve_case = preserve_case
    
    def tokenize(self, text):
        # Custom tokenization logic
        if not self.preserve_case:
            text = text.lower()
        
        # Simple whitespace tokenization with punctuation removal
        tokens = re.findall(r'\b\w+\b', text)
        return tokens

Custom Classifier

from nltk.classify.api import ClassifierI

class CustomClassifier(ClassifierI):
    def __init__(self, feature_extractor):
        self.feature_extractor = feature_extractor
        self.model = None
    
    def classify(self, featureset):
        # Custom classification logic
        return self.model.predict([featureset])[0]
    
    def prob_classify(self, featureset):
        # Returns class probabilities
        probabilities = self.model.predict_proba([featureset])[0]
        return probabilities

Processing Multiple Languages

Multilingual Support

# Settings for different languages
language_settings = {
    'english': {
        'stopwords': set(stopwords.words('english')),
        'stemmer': SnowballStemmer('english'),
        'tokenizer': 'punkt'
    },
    'spanish': {
        'stopwords': set(stopwords.words('spanish')),
        'stemmer': SnowballStemmer('spanish'),
        'tokenizer': 'punkt'
    },
    'french': {
        'stopwords': set(stopwords.words('french')),
        'stemmer': SnowballStemmer('french'),
        'tokenizer': 'punkt'
    }
}

def process_multilingual_text(text, language='english'):
    settings = language_settings[language]
    
    # Tokenize
    tokens = word_tokenize(text, language=language)
    
    # Remove stop words
    filtered_tokens = [w for w in tokens if w.lower() not in settings['stopwords']]
    
    # Stem
    stemmed_tokens = [settings['stemmer'].stem(w) for w in filtered_tokens]
    
    return stemmed_tokens

Working with Large Datasets

Streaming Processing

def stream_process_large_corpus(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            # Process one line at a time
            processed_line = preprocess_text(line.strip())
            yield processed_line

# Use generator to save memory
def analyze_large_corpus(file_path):
    word_freq = FreqDist()
    
    for processed_line in stream_process_large_corpus(file_path):
        tokens = word_tokenize(processed_line)
        for token in tokens:
            word_freq[token] += 1
    
    return word_freq.most_common(100)

Parallel Processing

from multiprocessing import Pool
import functools

def parallel_text_processing(texts, num_processes=4):
    with Pool(processes=num_processes) as pool:
        results = pool.map(preprocess_text, texts)
    return results

Debugging and Testing

Component Testing

import unittest

class TestNLTKProcessing(unittest.TestCase):
    def setUp(self):
        self.sample_text = "This is a test sentence for NLTK processing                                

News