Speechrecognition - speech recognition

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Introduction

Modern voice assistants, transcription systems, and voice‑control applications rely on speech‑recognition technologies. Python offers a straightforward way to integrate these technologies via the SpeechRecognition library — a flexible and powerful tool for converting audio signals into text.

The library supports both processing audio files and capturing sound from a microphone, and it works with a variety of recognizers — from local engines to cloud APIs such as Google, IBM, Sphinx, and many others. It has become the de‑facto standard for speech‑recognition tasks in Python thanks to its ease of use and versatility.

What Is the SpeechRecognition Library?

SpeechRecognition is a popular open‑source Python library designed to simplify the integration of speech‑recognition technologies into applications. Created by Anthony Zhang, it provides a unified interface for multiple speech‑recognition engines, making it an ideal choice for developers of any skill level.

Key Features

The SpeechRecognition library stands out with the following capabilities:

  • Cross‑platform support: works on Windows, macOS, and Linux
  • Multiple API integrations: works with more than 7 different recognition services
  • Ease of use: minimal code for basic tasks
  • Configurable: extensive options for fine‑tuning to specific needs
  • Active community: regular updates and strong support

Library Architecture

SpeechRecognition follows a modular design, with core components including:

  • Recognizer — the central class that manages recognition
  • AudioSource — an abstraction for audio inputs (files, microphone)
  • AudioData --- a container for raw audio data
  • Engines --- adapters for various recognition services

Installation and Dependency Setup

Basic Installation

pip install SpeechRecognition

Additional Dependencies

To work with different audio sources, install the following optional packages:

Microphone support:

pip install pyaudio

Extended audio format support:

pip install pydub

Google Cloud Speech API:

pip install google-cloud-speech

Platform‑Specific Installation Tips

Windows

On Windows, installing pyaudio may cause issues. Use pre‑compiled wheels:

pip install pipwin
pipwin install pyaudio

macOS

brew install portaudio
pip install pyaudio

Linux (Ubuntu/Debian)

sudo apt-get install python3-pyaudio

Importing Core Components

import speech_recognition as sr

Primary Classes

  • Recognizer() — main class for speech‑recognition operations
  • AudioFile() — handles audio file input
  • Microphone() — handles microphone input

Working with Audio Files

Supported Formats

SpeechRecognition can process the following audio formats:

  • WAV — recommended for highest quality
  • AIFF — alternative to WAV
  • FLAC — lossless compression

Basic File Recognition

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile("speech.wav") as source:
    audio = r.record(source)

try:
    text = r.recognize_google(audio, language="ru-RU")
    print("Recognized text:", text)
except sr.UnknownValueError:
    print("Could not understand audio")
except sr.RequestError as e:
    print(f"Service error: {e}")

Handling Long Audio Files

For lengthy recordings, split the audio into manageable segments:

r = sr.Recognizer()

with sr.AudioFile("long_speech.wav") as source:
    # First 30 seconds
    audio1 = r.record(source, duration=30)
    # Next 30 seconds
    audio2 = r.record(source, duration=30, offset=30)

Converting MP3 to a Supported Format

from pydub import AudioSegment

# Convert MP3 to WAV
audio = AudioSegment.from_mp3("audio.mp3")
audio.export("audio.wav", format="wav")

Working with a Microphone

Basic Microphone Recognition

import speech_recognition as sr

r = sr.Recognizer()
mic = sr.Microphone()

with mic as source:
    r.adjust_for_ambient_noise(source)
    print("Say something:")
    audio = r.listen(source)

try:
    text = r.recognize_google(audio, language="ru-RU")
    print("You said:", text)
except sr.UnknownValueError:
    print("Speech not recognized")
except sr.RequestError as e:
    print(f"Service error: {e}")

Selecting a Specific Microphone

# List all available microphones
print(sr.Microphone.list_microphone_names())

# Use a specific microphone
mic = sr.Microphone(device_index=1)

Continuous Background Recognition

import speech_recognition as sr

r = sr.Recognizer()
m = sr.Microphone()

def callback(recognizer, audio):
    try:
        text = recognizer.recognize_google(audio, language="ru-RU")
        print(f"Recognized: {text}")
    except sr.UnknownValueError:
        pass

# Start background listening
stop_listening = r.listen_in_background(m, callback)

# Stop after 30 seconds
import time
time.sleep(30)
stop_listening(wait_for_stop=False)

Supported Recognizers Overview

Cloud Services

Recognizer Internet Required Free Tier Accuracy Notes
Google Web Speech Yes Yes (limited) High Great for beginners
Google Cloud Speech Yes No (trial available) Very high Professional solution
IBM Watson Yes No (trial available) High Broad language support
Microsoft Azure Yes No (trial available) High Seamless Microsoft ecosystem integration
Amazon Transcribe Yes No High Part of AWS suite

Local Engines

Recognizer Internet Required Free Accuracy Notes
CMU Sphinx No Yes Medium Fully offline
Vosk No Yes High Modern offline alternative

Specialized Services

Recognizer Internet Required Free Accuracy Notes
Wit.ai Yes Yes Medium Chat‑bot development
Houndify Yes Yes (limited) High Fast command processing
Assembly.AI Yes No High Focused on transcription

Comprehensive Methods and Functions Reference

Core Recognizer Methods

Method Description Parameters Returns
record(source, duration=None, offset=0) Capture audio from a source source (AudioSource), duration (float), offset (float) AudioData
listen(source, timeout=None, phrase_time_limit=None) Listen until speech is detected source (AudioSource), timeout (float), phrase_time_limit (float) AudioData
listen_in_background(source, callback) Continuous background listening source (AudioSource), callback (function) stop_listening function
adjust_for_ambient_noise(source, duration=1) Calibrate to ambient noise level source (AudioSource), duration (float) None

Speech‑Recognition Methods

Method Description Main Parameters API Key Required?
recognize_google(audio, key=None, language='en-US') Google Web Speech API audio, language, show_all No
recognize_google_cloud(audio, credentials_json=None) Google Cloud Speech API audio, language, preferred_phrases Yes
recognize_sphinx(audio, language='en-US') CMU Sphinx (offline) audio, language, keyword_entries No
recognize_ibm(audio, username, password) IBM Watson Speech‑to‑Text audio, language, customization_id Yes
recognize_bing(audio, key, language='en-US') Microsoft Bing Voice Recognition audio, language, show_all Yes
recognize_azure(audio, key, region, language='en-US') Microsoft Azure Speech audio, language, endpoint Yes
recognize_wit(audio, key, show_all=False) Wit.ai Speech Recognition audio, show_all Yes
recognize_houndify(audio, client_id, client_key) Houndify Speech Recognition audio, show_all Yes

Recognizer Properties

Property Description Type Default
energy_threshold Energy level threshold for speech detection int 300
dynamic_energy_threshold Automatic adjustment of the energy threshold bool True
dynamic_energy_adjustment_damping Smoothing factor for dynamic adjustment float 0.15
dynamic_energy_ratio Ratio of speech energy to background noise float 1.5
pause_threshold Silence duration that marks the end of a phrase float 0.8
operation_timeout Maximum time allowed for a recognition operation (seconds) float None
phrase_threshold Minimum length of a spoken phrase float 0.3
non_speaking_duration Allowed silence duration between phrases float 0.5

AudioFile Class Methods

Method Description Parameters
__init__(filename_or_fileobject) Initialize with a file path or file‑like object filename_or_fileobject (str or file‑like)
__enter__() Enter context manager -
__exit__(exc_type, exc_value, traceback) Exit context manager -

Microphone Class Methods

Method Description Parameters
__init__(device_index=None, sample_rate=16000, chunk_size=1024) Initialize microphone settings device_index (int), sample_rate (int), chunk_size (int)
list_microphone_names() Static list of available microphones -
list_working_microphones() Static list of working microphones -

Error Handling and Exceptions

Exception Types

import speech_recognition as sr

try:
    # Recognition code
    text = r.recognize_google(audio)
except sr.UnknownValueError:
    print("Speech could not be understood")
except sr.RequestError as e:
    print(f"Recognition service error: {e}")
except sr.WaitTimeoutError:
    print("Listening timed out")
except OSError as e:
    print(f"Operating system error: {e}")

Exception Details

Exception Description Common Causes
UnknownValueError Speech was unintelligible Poor audio quality, background noise, unclear speech
RequestError API request failed Network issues, invalid API key, quota exceeded
WaitTimeoutError Listening exceeded the timeout Extended silence during microphone capture
OSError System‑level error Microphone hardware problems, missing files

Parameter Tuning for Better Accuracy

Microphone Settings Optimization

r = sr.Recognizer()

# Adjust sensitivity
r.energy_threshold = 4000  # Increase for noisy environments
r.dynamic_energy_threshold = True
r.pause_threshold = 1.0  # Longer pause before ending a phrase

# Configure timeouts
r.operation_timeout = 5  # Max operation time
r.phrase_threshold = 0.3  # Minimum phrase length

Audio Quality Recommendations

To achieve optimal recognition results:

  • Sample rate: 16 kHz or higher
  • Bit depth: 16‑bit
  • Channels: mono (1 channel)
  • Format: uncompressed WAV
  • Noise level: as low as possible
  • Microphone distance: 15‑30 cm from the mouth

Pre‑processing Audio

# Calibrate to ambient noise
with sr.Microphone() as source:
    r.adjust_for_ambient_noise(source, duration=2)
    print("Calibration complete")
    
    audio = r.listen(source, timeout=1, phrase_time_limit=5)

Multilingual Support

Supported Languages

SpeechRecognition works with dozens of languages. Common language codes include:

  • ru-RU — Russian
  • en-US — English (US)
  • en-GB — English (UK)
  • de-DE — German
  • fr-FR — French
  • es-ES — Spanish
  • zh-CN — Chinese (Simplified)
  • ja-JP — Japanese

Multilingual Recognition Example

# Try recognition in several languages
languages = ['ru-RU', 'en-US', 'de-DE']

for lang in languages:
    try:
        text = r.recognize_google(audio, language=lang)
        print(f"Recognized in {lang}: {text}")
        break
    except sr.UnknownValueError:
        continue

Practical Use Cases

Building a Voice Assistant

import speech_recognition as sr
import pyttsx3

# Initialize
r = sr.Recognizer()
mic = sr.Microphone()
tts = pyttsx3.init()

def speak(text):
    tts.say(text)
    tts.runAndWait()

def listen():
    with mic as source:
        r.adjust_for_ambient_noise(source)
        audio = r.listen(source)
    
    try:
        command = r.recognize_google(audio, language="ru-RU")
        return command.lower()
    except sr.UnknownValueError:
        return None

# Main loop
while True:
    command = listen()
    if command:
        if "время" in command:
            import datetime
            now = datetime.datetime.now()
            speak(f"The time is {now.hour} hours {now.minute} minutes")
        elif "выход" in command:
            speak("Goodbye!")
            break

Audio File Transcription

import speech_recognition as sr
import os

def transcribe_audio_file(file_path):
    r = sr.Recognizer()
    
    with sr.AudioFile(file_path) as source:
        audio = r.record(source)
    
    try:
        text = r.recognize_google(audio, language="ru-RU")
        return text
    except sr.UnknownValueError:
        return "Unable to recognize speech"
    except sr.RequestError as e:
        return f"Service error: {e}"

# Batch transcription
audio_dir = "audio_files"
for filename in os.listdir(audio_dir):
    if filename.endswith(".wav"):
        file_path = os.path.join(audio_dir, filename)
        text = transcribe_audio_file(file_path)
        
        # Save result
        output_file = filename.replace(".wav", ".txt")
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(text)

Voice Command System

import speech_recognition as sr
import subprocess

class VoiceCommandSystem:
    def __init__(self):
        self.r = sr.Recognizer()
        self.mic = sr.Microphone()
        self.commands = {
            "открыть браузер": self.open_browser,
            "открыть калькулятор": self.open_calculator,
            "выключить компьютер": self.shutdown_computer,
        }
    
    def listen_for_command(self):
        with self.mic as source:
            self.r.adjust_for_ambient_noise(source)
            print("Listening for command...")
            audio = self.r.listen(source, timeout=5)
        
        try:
            command = self.r.recognize_google(audio, language="ru-RU").lower()
            print(f"Recognized command: {command}")
            return command
        except sr.UnknownValueError:
            print("Command not recognized")
            return None
    
    def execute_command(self, command):
        for key, func in self.commands.items():
            if key in command:
                func()
                return True
        return False
    
    def open_browser(self):
        subprocess.run(["start", "chrome"], shell=True)
    
    def open_calculator(self):
        subprocess.run(["calc"], shell=True)
    
    def shutdown_computer(self):
        subprocess.run(["shutdown", "/s", "/t", "10"], shell=True)

# Usage
system = VoiceCommandSystem()
command = system.listen_for_command()
if command and system.execute_command(command):
    print("Command executed")
else:
    print("Command not found")

Performance Optimization

Asynchronous Processing

import asyncio
import speech_recognition as sr
from concurrent.futures import ThreadPoolExecutor

async def async_recognize(audio_data, language="ru-RU"):
    loop = asyncio.get_event_loop()
    r = sr.Recognizer()
    
    with ThreadPoolExecutor() as executor:
        text = await loop.run_in_executor(
            executor, 
            r.recognize_google, 
            audio_data, 
            None, 
            language
        )
    return text

# Example usage
async def main():
    r = sr.Recognizer()
    with sr.AudioFile("speech.wav") as source:
        audio = r.record(source)
    
    text = await async_recognize(audio)
    print(text)

asyncio.run(main())

Caching Recognition Results

import hashlib
import pickle
import speech_recognition as sr

class CachedRecognizer:
    def __init__(self):
        self.r = sr.Recognizer()
        self.cache = {}
    
    def _get_audio_hash(self, audio_data):
        return hashlib.md5(audio_data.get_raw_data()).hexdigest()
    
    def recognize_cached(self, audio_data, language="ru-RU"):
        audio_hash = self._get_audio_hash(audio_data)
        
        if audio_hash in self.cache:
            return self.cache[audio_hash]
        
        try:
            text = self.r.recognize_google(audio_data, language=language)
            self.cache[audio_hash] = text
            return text
        except sr.UnknownValueError:
            return None

Frequently Asked Questions

Why Isn't Microphone Recognition Working?

Common reasons and fixes:

  • Verify pyaudio installation: pip install pyaudio
  • Make sure no other application is using the microphone
  • Check system permissions for microphone access
  • Try specifying a concrete device_index for the microphone

How Can I Improve Recognition Accuracy?

Tips for higher accuracy:

  • Use a high‑quality microphone
  • Speak clearly and at a moderate pace
  • Minimize background noise
  • Call adjust_for_ambient_noise() before recording
  • Select a recognizer that best supports your target language

Can I Use SpeechRecognition Offline?

Yes, for offline operation consider:

  • CMU Sphinx: recognize_sphinx()
  • Vosk (requires additional installation)
  • OpenAI Whisper (separate library)

How Should I Process Long Audio Recordings?

Best practices for long files:

  • Split audio into 30‑60 second chunks
  • Use the offset parameter to process sequential segments
  • Consider specialized services designed for lengthy recordings

Which Engine Is Best for a Commercial Project?

Recommendations based on use case:

  • Google Cloud Speech: top‑tier accuracy and scalability
  • Microsoft Azure: strong integration with Microsoft services
  • IBM Watson: extensive customization options
  • Amazon Transcribe: ideal for AWS‑centric solutions

How Do I Handle Noisy Audio Files?

Techniques to improve noisy recordings:

  • Apply audio pre‑processing (noise‑reduction filters)
  • Tune the energy_threshold parameter
  • Use dedicated noise‑suppression libraries (e.g., noisereduce)
  • Experiment with different recognizers to find the most robust one

Alternatives and Complementary Tools

Modern Alternatives

  • OpenAI Whisper: state‑of‑the‑art neural speech recognizer
  • Wav2Vec 2.0: Facebook’s self‑supervised model
  • DeepSpeech: Mozilla’s open‑source solution
  • Vosk: lightweight offline library

Integration with Other Libraries

# Integration with pydub for advanced audio handling
from pydub import AudioSegment
from pydub.silence import split_on_silence

def transcribe_long_audio(file_path):
    # Load and split audio on silence
    audio = AudioSegment.from_wav(file_path)
    chunks = split_on_silence(audio, min_silence_len=1000, silence_thresh=-40)
    
    # Recognize each chunk
    full_text = ""
    for i, chunk in enumerate(chunks):
        chunk.export(f"temp_chunk_{i}.wav", format="wav")
        
        with sr.AudioFile(f"temp_chunk_{i}.wav") as source:
            audio_data = sr.Recognizer().record(source)
            text = sr.Recognizer().recognize_google(audio_data, language="ru-RU")
            full_text += text + " "
    
    return full_text.strip()

Conclusion

SpeechRecognition remains one of the most popular and user‑friendly libraries for speech‑recognition tasks in Python. Its main advantages are simplicity, broad engine support, and an active developer community.

The library suits both rapid prototyping and production‑grade commercial applications. The variety of supported APIs lets you choose the optimal solution—from free services for personal projects to high‑accuracy paid options for enterprise deployments.

Getting started requires only a few lines of code, yet the library also offers deep customization and performance‑tuning capabilities. Combined with modern audio‑processing techniques and machine‑learning advances, SpeechRecognition can serve as the foundation for sophisticated voice‑interaction systems.

News