Gensim - Work with texts, Word2VEC models

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

What Is Gensim and Why Is It Needed

Gensim (Generate Similar) — a high‑level Python library for text vectorization, topic modeling, and learning distributed word representations. It was designed for efficient handling of large text datasets without the need to load everything into memory. The library is best known for its support of Word2Vec, FastText, and LDA (Latent Dirichlet Allocation) models.

Gensim differs from other libraries by focusing on unstructured text data and its ability to process document corpora that do not fit into RAM. This makes it an ideal tool for analyzing massive text collections such as news archives, scientific articles, or social‑media streams.

Main Capabilities of the Gensim Library

Word Vector Representations

Gensim provides full support for creating and working with word vector representations, including Word2Vec, FastText, and Doc2Vec. These models allow you to convert text into numeric vectors while preserving semantic relationships between words.

Topic Modeling

The library includes several algorithms for topic modeling:

  • LDA (Latent Dirichlet Allocation) — for discovering hidden topics in a document collection
  • LSI (Latent Semantic Indexing) — for dimensionality reduction and uncovering latent concepts
  • HDP (Hierarchical Dirichlet Process) — for automatically determining the number of topics

Corpus Handling

Gensim enables efficient processing of large text corpora through streaming, allowing analysis of data that exceeds available RAM.

Similarity Computation

The library offers various metrics for calculating similarity between documents, words, and topics, including cosine similarity and other semantic measures.

Installation and Basic Setup

Installing the Library

Install Gensim the standard way via pip:

pip install gensim

Some models may require additional dependencies:

pip install gensim[complete]

Importing Core Modules

import gensim
from gensim.models import Word2Vec, FastText, LdaModel, Doc2Vec
from gensim.corpora.dictionary import Dictionary
from gensim.corpora import MmCorpus
from gensim.similarities import Similarity
from gensim.utils import simple_preprocess

Preparing Text Data

Tokenization and Pre‑processing

Gensim expects tokenized documents as a list of lists. Proper data pre‑processing is critical for result quality:

import re
from gensim.utils import simple_preprocess

def preprocess_text(text):
    """Pre‑process text for Gensim"""
    # Remove special characters
    text = re.sub(r'[^a-zA-Zа-яА-Я\s]', '', text)
    # Tokenize and convert to lowercase
    tokens = simple_preprocess(text, deacc=True)
    return tokens

documents = [
    "Gensim is a useful library for NLP tasks.",
    "It supports topic modeling and word embeddings.",
    "Models like Word2Vec and FastText are implemented."
]

# Pre‑process documents
texts = [preprocess_text(doc) for doc in documents]

Creating a Dictionary and Corpus

# Create dictionary
dictionary = Dictionary(texts)

# Filter extreme values
dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000)

# Create corpus (bag‑of‑words representation)
corpus = [dictionary.doc2bow(text) for text in texts]

Working with Word2Vec Models

Training a Word2Vec Model

from gensim.models import Word2Vec

# Train model
model = Word2Vec(
    sentences=texts,
    vector_size=100,    # Dimensionality of vectors
    window=5,           # Context window size
    min_count=1,        # Minimum word frequency
    workers=4,          # Number of threads
    sg=0,               # 0 = CBOW, 1 = Skip‑gram
    epochs=10           # Number of training epochs
)

Using the Trained Model

# Retrieve a word vector
vector = model.wv['word']

# Find most similar words
similar_words = model.wv.most_similar('word', topn=10)

# Compute similarity between two words
similarity = model.wv.similarity('word1', 'word2')

# Analogies (king - man + woman = queen)
analogy = model.wv.most_similar(positive=['king', 'woman'], negative=['man'])

FastText Models

FastText Features

FastText extends Word2Vec by learning vectors not only for whole words but also for their sub‑strings (character n‑grams). This enables handling of rare and out‑of‑vocabulary words.

from gensim.models import FastText

# Train FastText model
ft_model = FastText(
    sentences=texts,
    vector_size=100,
    window=3,
    min_count=1,
    min_n=3,           # Minimum n‑gram length
    max_n=6,           # Maximum n‑gram length
    sg=1,              # Skip‑gram
    epochs=10
)

# Get vector for an unknown word
unknown_word_vector = ft_model.wv['неизвестноеслово']

Topic Modeling with LDA

Creating an LDA Model

from gensim.models import LdaModel

# Train LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=5,      # Number of topics
    random_state=42,   # For reproducibility
    passes=10,         # Number of passes through the corpus
    alpha='auto',      # Document‑level concentration parameter
    per_word_topics=True
)

# Print topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

Analyzing Documents with LDA

# Get topics for a document
doc_topics = lda_model.get_document_topics(corpus[0])

# Predict topics for a new document
new_doc = "machine learning artificial intelligence"
new_doc_bow = dictionary.doc2bow(preprocess_text(new_doc))
new_doc_topics = lda_model.get_document_topics(new_doc_bow)

Doc2Vec Models

Training Doc2Vec

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

# Prepare data for Doc2Vec
tagged_docs = [TaggedDocument(words=text, tags=[str(i)]) 
               for i, text in enumerate(texts)]

# Train model
doc2vec_model = Doc2Vec(
    documents=tagged_docs,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4,
    epochs=10
)

# Infer vector for a document
doc_vector = doc2vec_model.infer_vector(texts[0])

# Find similar documents
similar_docs = doc2vec_model.dv.most_similar([doc_vector])

Model Evaluation

Metrics for Topic Models

from gensim.models import CoherenceModel

# Compute coherence
coherence_model = CoherenceModel(
    model=lda_model,
    texts=texts,
    dictionary=dictionary,
    coherence='c_v'
)

coherence_score = coherence_model.get_coherence()
print(f'Coherence Score: {coherence_score}')

# Perplexity (lower is better)
perplexity = lda_model.log_perplexity(corpus)
print(f'Perplexity: {perplexity}')

Visualizing Results

import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# Interactive LDA visualization
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

Document Similarity Computation

Creating a Similarity Index

from gensim.similarities import MatrixSimilarity
from gensim.models import TfidfModel

# Build TF‑IDF model
tfidf_model = TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]

# Create similarity index
similarity_index = MatrixSimilarity(tfidf_corpus)

# Compute similarity for a new document
new_doc_tfidf = tfidf_model[new_doc_bow]
similarities = similarity_index[new_doc_tfidf]

Saving and Loading Models

Saving Models

# Save various models
model.save("word2vec_model.model")
lda_model.save("lda_model.model")
dictionary.save("dictionary.dict")

# Save in different formats
model.wv.save_word2vec_format("word2vec_model.txt", binary=False)
model.wv.save_word2vec_format("word2vec_model.bin", binary=True)

Loading Models

# Load models
loaded_model = Word2Vec.load("word2vec_model.model")
loaded_lda = LdaModel.load("lda_model.model")
loaded_dict = Dictionary.load("dictionary.dict")

# Load pretrained models
from gensim.models import KeyedVectors
pretrained_model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

Integration with Other Libraries

Working with Pandas

import pandas as pd

# Load data from a DataFrame
df = pd.read_csv('texts.csv')
texts = df['text'].apply(preprocess_text).tolist()

# Create a DataFrame with results
results_df = pd.DataFrame({
    'document': range(len(corpus)),
    'dominant_topic': [max(lda_model.get_document_topics(doc), key=lambda x: x[1])[0] for doc in corpus]
})

Integration with scikit‑learn

from sklearn.cluster import KMeans
import numpy as np

# Use Word2Vec vectors for clustering
word_vectors = np.array([model.wv[word] for word in model.wv.index_to_key])
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(word_vectors)

Table of Gensim Methods and Functions

Module/Class Method/Function Description
Word2Vec Word2Vec(sentences, vector_size, window, min_count) Create and train a Word2Vec model
  wv.most_similar(word, topn) Find most similar words
  wv.similarity(word1, word2) Compute cosine similarity
  wv.doesnt_match(words) Identify the outlier word in a list
  wv.evaluate_word_analogies(file) Assess quality on word analogies
FastText FastText(sentences, vector_size, window, min_n, max_n) Create a FastText model
  wv.most_similar(word) Find similar words (including OOV)
  wv.word_vec(word) Get vector for a word
LdaModel LdaModel(corpus, num_topics, id2word, passes) Create an LDA model
  print_topics(num_words) Display topics with top keywords
  get_document_topics(bow) Retrieve topics for a document
  get_term_topics(word_id) Get topics for a word
  log_perplexity(corpus) Calculate perplexity
Doc2Vec Doc2Vec(documents, vector_size, window, min_count) Create a Doc2Vec model
  infer_vector(words) Infer vector for a new document
  dv.most_similar(vector) Find similar documents
Dictionary Dictionary(texts) Create a dictionary from texts
  doc2bow(words) Convert to bag‑of‑words
  filter_extremes(no_below, no_above) Filter rare/common words
  filter_tokens(bad_ids) Remove specific tokens
  compactify() Compress dictionary
TfidfModel TfidfModel(corpus) Create a TF‑IDF model
  [corpus] Transform corpus to TF‑IDF
CoherenceModel CoherenceModel(model, texts, dictionary, coherence) Model for computing topic coherence
  get_coherence() Calculate coherence metric
Similarity MatrixSimilarity(corpus) Create a similarity matrix
  [vector] Compute similarity against the corpus
  Similarity(output_prefix, corpus, num_features) Similarity index for large corpora
utils simple_preprocess(text) Basic tokenization
  deaccent(text) Remove diacritic marks
  tokenize(text) Tokenize text
corpora MmCorpus.serialize(fname, corpus) Save a corpus
  MmCorpus(fname) Load a corpus
  TextCorpus(input) Work with text files

Practical Use Cases

Sentiment Analysis with Word2Vec

# Create lists of positive and negative words
positive_words = ['good', 'great', 'excellent', 'amazing']
negative_words = ['bad', 'terrible', 'awful', 'horrible']

# Function to compute sentiment score
def get_sentiment_score(text, model):
    words = preprocess_text(text)
    positive_score = 0
    negative_score = 0
    
    for word in words:
        if word in model.wv:
            # Similarity with positive words
            pos_similarities = [model.wv.similarity(word, pos_word) 
                              for pos_word in positive_words 
                              if pos_word in model.wv]
            # Similarity with negative words
            neg_similarities = [model.wv.similarity(word, neg_word) 
                              for neg_word in negative_words 
                              if neg_word in model.wv]
            
            if pos_similarities:
                positive_score += max(pos_similarities)
            if neg_similarities:
                negative_score += max(neg_similarities)
    
    return positive_score - negative_score

Content‑Based Recommendation System

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def recommend_documents(target_doc, corpus, model, top_n=5):
    """Recommend documents based on content similarity"""
    # Get vector for the target document
    target_vector = model.infer_vector(preprocess_text(target_doc))
    
    # Compute similarity with all documents
    similarities = []
    for i, doc in enumerate(corpus):
        doc_vector = model.infer_vector(doc)
        similarity = cosine_similarity([target_vector], [doc_vector])[0][0]
        similarities.append((i, similarity))
    
    # Sort by descending similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return similarities[:top_n]

Social Media Trend Monitoring

def detect_trending_topics(documents, time_windows):
    """Detect trending topics across time windows"""
    trends = {}
    
    for window, docs in time_windows.items():
        # Build corpus for the time window
        window_texts = [preprocess_text(doc) for doc in docs]
        window_dict = Dictionary(window_texts)
        window_corpus = [window_dict.doc2bow(text) for text in window_texts]
        
        # Train LDA for the window
        lda_window = LdaModel(window_corpus, num_topics=5, id2word=window_dict)
        
        # Store top topics
        trends[window] = lda_window.print_topics(num_words=5)
    
    return trends

Performance Optimization

Parameter Tuning for Large Corpora

# Optimized parameters for big data
large_corpus_model = Word2Vec(
    sentences=texts,
    vector_size=300,
    window=10,
    min_count=5,        # Higher min_count to filter noise
    workers=8,          # Max number of threads
    sg=1,               # Skip‑gram works better for large corpora
    epochs=5,           # Fewer epochs to save time
    batch_words=10000   # Batch size for memory efficiency
)

Working with Streaming Data

class MySentences:
    """Iterator for streaming data loading"""
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield preprocess_text(line)

# Use the iterator
sentences = MySentences('./data/')
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

Frequently Asked Questions

What Is Gensim and What Is It Used For?

Gensim — a specialized Python library for text analysis, topic modeling, and creating vector representations of words. It is used for natural‑language‑processing tasks, large‑scale text corpus analysis, recommendation systems, and semantic analysis.

How Does Gensim Differ from Other NLP Libraries?

Gensim focuses on topic modeling and vector representations, whereas libraries like spaCy or NLTK target general‑purpose NLP tasks. Its main advantage is the ability to work with corpora that exceed available RAM.

Can Pre‑trained Models Be Used in Gensim?

Yes, Gensim supports loading pre‑trained Word2Vec, FastText, and other models. You can use models from Google, Facebook, and other organizations, as well as models trained on domain‑specific data.

How to Choose the Optimal Number of Topics for LDA?

The number of topics can be selected using coherence metrics, perplexity, or cross‑validation. Typically, you test a range (e.g., 2 to 20‑30 topics) and pick the value that yields the best results on your chosen metric.

Does Gensim Support GPU Training?

The standard Gensim version does not support GPU, but there are optimized forks and extensions that can leverage GPU acceleration for training large models.

Conclusion

Gensim is a powerful and flexible library for working with text data, especially effective when analyzing large document corpora. Its specialization in topic modeling and vector representations makes it an indispensable tool for researchers and developers in natural‑language‑processing.

The library continues to evolve, adding support for new models and algorithms, ensuring its relevance for modern NLP tasks. Thanks to its thoughtful architecture and extensive documentation, Gensim remains one of the top choices for professional text data work in Python.

News