Introduction
Modern voice assistants, transcription systems, and voice‑control applications rely on speech‑recognition technologies. Python offers a straightforward way to integrate these technologies via the SpeechRecognition library — a flexible and powerful tool for converting audio signals into text.
The library supports both processing audio files and capturing sound from a microphone, and it works with a variety of recognizers — from local engines to cloud APIs such as Google, IBM, Sphinx, and many others. It has become the de‑facto standard for speech‑recognition tasks in Python thanks to its ease of use and versatility.
What Is the SpeechRecognition Library?
SpeechRecognition is a popular open‑source Python library designed to simplify the integration of speech‑recognition technologies into applications. Created by Anthony Zhang, it provides a unified interface for multiple speech‑recognition engines, making it an ideal choice for developers of any skill level.
Key Features
The SpeechRecognition library stands out with the following capabilities:
- Cross‑platform support: works on Windows, macOS, and Linux
- Multiple API integrations: works with more than 7 different recognition services
- Ease of use: minimal code for basic tasks
- Configurable: extensive options for fine‑tuning to specific needs
- Active community: regular updates and strong support
Library Architecture
SpeechRecognition follows a modular design, with core components including:
- Recognizer — the central class that manages recognition
- AudioSource — an abstraction for audio inputs (files, microphone)
- AudioData --- a container for raw audio data
- Engines --- adapters for various recognition services
Installation and Dependency Setup
Basic Installation
pip install SpeechRecognition
Additional Dependencies
To work with different audio sources, install the following optional packages:
Microphone support:
pip install pyaudio
Extended audio format support:
pip install pydub
Google Cloud Speech API:
pip install google-cloud-speech
Platform‑Specific Installation Tips
Windows
On Windows, installing pyaudio may cause issues. Use pre‑compiled wheels:
pip install pipwin
pipwin install pyaudio
macOS
brew install portaudio
pip install pyaudio
Linux (Ubuntu/Debian)
sudo apt-get install python3-pyaudio
Importing Core Components
import speech_recognition as sr
Primary Classes
- Recognizer() — main class for speech‑recognition operations
- AudioFile() — handles audio file input
- Microphone() — handles microphone input
Working with Audio Files
Supported Formats
SpeechRecognition can process the following audio formats:
- WAV — recommended for highest quality
- AIFF — alternative to WAV
- FLAC — lossless compression
Basic File Recognition
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("speech.wav") as source:
audio = r.record(source)
try:
text = r.recognize_google(audio, language="ru-RU")
print("Recognized text:", text)
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError as e:
print(f"Service error: {e}")
Handling Long Audio Files
For lengthy recordings, split the audio into manageable segments:
r = sr.Recognizer()
with sr.AudioFile("long_speech.wav") as source:
# First 30 seconds
audio1 = r.record(source, duration=30)
# Next 30 seconds
audio2 = r.record(source, duration=30, offset=30)
Converting MP3 to a Supported Format
from pydub import AudioSegment
# Convert MP3 to WAV
audio = AudioSegment.from_mp3("audio.mp3")
audio.export("audio.wav", format="wav")
Working with a Microphone
Basic Microphone Recognition
import speech_recognition as sr
r = sr.Recognizer()
mic = sr.Microphone()
with mic as source:
r.adjust_for_ambient_noise(source)
print("Say something:")
audio = r.listen(source)
try:
text = r.recognize_google(audio, language="ru-RU")
print("You said:", text)
except sr.UnknownValueError:
print("Speech not recognized")
except sr.RequestError as e:
print(f"Service error: {e}")
Selecting a Specific Microphone
# List all available microphones
print(sr.Microphone.list_microphone_names())
# Use a specific microphone
mic = sr.Microphone(device_index=1)
Continuous Background Recognition
import speech_recognition as sr
r = sr.Recognizer()
m = sr.Microphone()
def callback(recognizer, audio):
try:
text = recognizer.recognize_google(audio, language="ru-RU")
print(f"Recognized: {text}")
except sr.UnknownValueError:
pass
# Start background listening
stop_listening = r.listen_in_background(m, callback)
# Stop after 30 seconds
import time
time.sleep(30)
stop_listening(wait_for_stop=False)
Supported Recognizers Overview
Cloud Services
| Recognizer | Internet Required | Free Tier | Accuracy | Notes |
|---|---|---|---|---|
| Google Web Speech | Yes | Yes (limited) | High | Great for beginners |
| Google Cloud Speech | Yes | No (trial available) | Very high | Professional solution |
| IBM Watson | Yes | No (trial available) | High | Broad language support |
| Microsoft Azure | Yes | No (trial available) | High | Seamless Microsoft ecosystem integration |
| Amazon Transcribe | Yes | No | High | Part of AWS suite |
Local Engines
| Recognizer | Internet Required | Free | Accuracy | Notes |
|---|---|---|---|---|
| CMU Sphinx | No | Yes | Medium | Fully offline |
| Vosk | No | Yes | High | Modern offline alternative |
Specialized Services
| Recognizer | Internet Required | Free | Accuracy | Notes |
|---|---|---|---|---|
| Wit.ai | Yes | Yes | Medium | Chat‑bot development |
| Houndify | Yes | Yes (limited) | High | Fast command processing |
| Assembly.AI | Yes | No | High | Focused on transcription |
Comprehensive Methods and Functions Reference
Core Recognizer Methods
| Method | Description | Parameters | Returns |
|---|---|---|---|
record(source, duration=None, offset=0) |
Capture audio from a source | source (AudioSource), duration (float), offset (float) | AudioData |
listen(source, timeout=None, phrase_time_limit=None) |
Listen until speech is detected | source (AudioSource), timeout (float), phrase_time_limit (float) | AudioData |
listen_in_background(source, callback) |
Continuous background listening | source (AudioSource), callback (function) | stop_listening function |
adjust_for_ambient_noise(source, duration=1) |
Calibrate to ambient noise level | source (AudioSource), duration (float) | None |
Speech‑Recognition Methods
| Method | Description | Main Parameters | API Key Required? |
|---|---|---|---|
recognize_google(audio, key=None, language='en-US') |
Google Web Speech API | audio, language, show_all | No |
recognize_google_cloud(audio, credentials_json=None) |
Google Cloud Speech API | audio, language, preferred_phrases | Yes |
recognize_sphinx(audio, language='en-US') |
CMU Sphinx (offline) | audio, language, keyword_entries | No |
recognize_ibm(audio, username, password) |
IBM Watson Speech‑to‑Text | audio, language, customization_id | Yes |
recognize_bing(audio, key, language='en-US') |
Microsoft Bing Voice Recognition | audio, language, show_all | Yes |
recognize_azure(audio, key, region, language='en-US') |
Microsoft Azure Speech | audio, language, endpoint | Yes |
recognize_wit(audio, key, show_all=False) |
Wit.ai Speech Recognition | audio, show_all | Yes |
recognize_houndify(audio, client_id, client_key) |
Houndify Speech Recognition | audio, show_all | Yes |
Recognizer Properties
| Property | Description | Type | Default |
|---|---|---|---|
energy_threshold |
Energy level threshold for speech detection | int | 300 |
dynamic_energy_threshold |
Automatic adjustment of the energy threshold | bool | True |
dynamic_energy_adjustment_damping |
Smoothing factor for dynamic adjustment | float | 0.15 |
dynamic_energy_ratio |
Ratio of speech energy to background noise | float | 1.5 |
pause_threshold |
Silence duration that marks the end of a phrase | float | 0.8 |
operation_timeout |
Maximum time allowed for a recognition operation (seconds) | float | None |
phrase_threshold |
Minimum length of a spoken phrase | float | 0.3 |
non_speaking_duration |
Allowed silence duration between phrases | float | 0.5 |
AudioFile Class Methods
| Method | Description | Parameters |
|---|---|---|
__init__(filename_or_fileobject) |
Initialize with a file path or file‑like object | filename_or_fileobject (str or file‑like) |
__enter__() |
Enter context manager | - |
__exit__(exc_type, exc_value, traceback) |
Exit context manager | - |
Microphone Class Methods
| Method | Description | Parameters |
|---|---|---|
__init__(device_index=None, sample_rate=16000, chunk_size=1024) |
Initialize microphone settings | device_index (int), sample_rate (int), chunk_size (int) |
list_microphone_names() |
Static list of available microphones | - |
list_working_microphones() |
Static list of working microphones | - |
Error Handling and Exceptions
Exception Types
import speech_recognition as sr
try:
# Recognition code
text = r.recognize_google(audio)
except sr.UnknownValueError:
print("Speech could not be understood")
except sr.RequestError as e:
print(f"Recognition service error: {e}")
except sr.WaitTimeoutError:
print("Listening timed out")
except OSError as e:
print(f"Operating system error: {e}")
Exception Details
| Exception | Description | Common Causes |
|---|---|---|
UnknownValueError |
Speech was unintelligible | Poor audio quality, background noise, unclear speech |
RequestError |
API request failed | Network issues, invalid API key, quota exceeded |
WaitTimeoutError |
Listening exceeded the timeout | Extended silence during microphone capture |
OSError |
System‑level error | Microphone hardware problems, missing files |
Parameter Tuning for Better Accuracy
Microphone Settings Optimization
r = sr.Recognizer()
# Adjust sensitivity
r.energy_threshold = 4000 # Increase for noisy environments
r.dynamic_energy_threshold = True
r.pause_threshold = 1.0 # Longer pause before ending a phrase
# Configure timeouts
r.operation_timeout = 5 # Max operation time
r.phrase_threshold = 0.3 # Minimum phrase length
Audio Quality Recommendations
To achieve optimal recognition results:
- Sample rate: 16 kHz or higher
- Bit depth: 16‑bit
- Channels: mono (1 channel)
- Format: uncompressed WAV
- Noise level: as low as possible
- Microphone distance: 15‑30 cm from the mouth
Pre‑processing Audio
# Calibrate to ambient noise
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source, duration=2)
print("Calibration complete")
audio = r.listen(source, timeout=1, phrase_time_limit=5)
Multilingual Support
Supported Languages
SpeechRecognition works with dozens of languages. Common language codes include:
ru-RU— Russianen-US— English (US)en-GB— English (UK)de-DE— Germanfr-FR— Frenches-ES— Spanishzh-CN— Chinese (Simplified)ja-JP— Japanese
Multilingual Recognition Example
# Try recognition in several languages
languages = ['ru-RU', 'en-US', 'de-DE']
for lang in languages:
try:
text = r.recognize_google(audio, language=lang)
print(f"Recognized in {lang}: {text}")
break
except sr.UnknownValueError:
continue
Practical Use Cases
Building a Voice Assistant
import speech_recognition as sr
import pyttsx3
# Initialize
r = sr.Recognizer()
mic = sr.Microphone()
tts = pyttsx3.init()
def speak(text):
tts.say(text)
tts.runAndWait()
def listen():
with mic as source:
r.adjust_for_ambient_noise(source)
audio = r.listen(source)
try:
command = r.recognize_google(audio, language="ru-RU")
return command.lower()
except sr.UnknownValueError:
return None
# Main loop
while True:
command = listen()
if command:
if "время" in command:
import datetime
now = datetime.datetime.now()
speak(f"The time is {now.hour} hours {now.minute} minutes")
elif "выход" in command:
speak("Goodbye!")
break
Audio File Transcription
import speech_recognition as sr
import os
def transcribe_audio_file(file_path):
r = sr.Recognizer()
with sr.AudioFile(file_path) as source:
audio = r.record(source)
try:
text = r.recognize_google(audio, language="ru-RU")
return text
except sr.UnknownValueError:
return "Unable to recognize speech"
except sr.RequestError as e:
return f"Service error: {e}"
# Batch transcription
audio_dir = "audio_files"
for filename in os.listdir(audio_dir):
if filename.endswith(".wav"):
file_path = os.path.join(audio_dir, filename)
text = transcribe_audio_file(file_path)
# Save result
output_file = filename.replace(".wav", ".txt")
with open(output_file, "w", encoding="utf-8") as f:
f.write(text)
Voice Command System
import speech_recognition as sr
import subprocess
class VoiceCommandSystem:
def __init__(self):
self.r = sr.Recognizer()
self.mic = sr.Microphone()
self.commands = {
"открыть браузер": self.open_browser,
"открыть калькулятор": self.open_calculator,
"выключить компьютер": self.shutdown_computer,
}
def listen_for_command(self):
with self.mic as source:
self.r.adjust_for_ambient_noise(source)
print("Listening for command...")
audio = self.r.listen(source, timeout=5)
try:
command = self.r.recognize_google(audio, language="ru-RU").lower()
print(f"Recognized command: {command}")
return command
except sr.UnknownValueError:
print("Command not recognized")
return None
def execute_command(self, command):
for key, func in self.commands.items():
if key in command:
func()
return True
return False
def open_browser(self):
subprocess.run(["start", "chrome"], shell=True)
def open_calculator(self):
subprocess.run(["calc"], shell=True)
def shutdown_computer(self):
subprocess.run(["shutdown", "/s", "/t", "10"], shell=True)
# Usage
system = VoiceCommandSystem()
command = system.listen_for_command()
if command and system.execute_command(command):
print("Command executed")
else:
print("Command not found")
Performance Optimization
Asynchronous Processing
import asyncio
import speech_recognition as sr
from concurrent.futures import ThreadPoolExecutor
async def async_recognize(audio_data, language="ru-RU"):
loop = asyncio.get_event_loop()
r = sr.Recognizer()
with ThreadPoolExecutor() as executor:
text = await loop.run_in_executor(
executor,
r.recognize_google,
audio_data,
None,
language
)
return text
# Example usage
async def main():
r = sr.Recognizer()
with sr.AudioFile("speech.wav") as source:
audio = r.record(source)
text = await async_recognize(audio)
print(text)
asyncio.run(main())
Caching Recognition Results
import hashlib
import pickle
import speech_recognition as sr
class CachedRecognizer:
def __init__(self):
self.r = sr.Recognizer()
self.cache = {}
def _get_audio_hash(self, audio_data):
return hashlib.md5(audio_data.get_raw_data()).hexdigest()
def recognize_cached(self, audio_data, language="ru-RU"):
audio_hash = self._get_audio_hash(audio_data)
if audio_hash in self.cache:
return self.cache[audio_hash]
try:
text = self.r.recognize_google(audio_data, language=language)
self.cache[audio_hash] = text
return text
except sr.UnknownValueError:
return None
Frequently Asked Questions
Why Isn't Microphone Recognition Working?
Common reasons and fixes:
- Verify
pyaudioinstallation:pip install pyaudio - Make sure no other application is using the microphone
- Check system permissions for microphone access
- Try specifying a concrete
device_indexfor the microphone
How Can I Improve Recognition Accuracy?
Tips for higher accuracy:
- Use a high‑quality microphone
- Speak clearly and at a moderate pace
- Minimize background noise
- Call
adjust_for_ambient_noise()before recording - Select a recognizer that best supports your target language
Can I Use SpeechRecognition Offline?
Yes, for offline operation consider:
- CMU Sphinx:
recognize_sphinx() - Vosk (requires additional installation)
- OpenAI Whisper (separate library)
How Should I Process Long Audio Recordings?
Best practices for long files:
- Split audio into 30‑60 second chunks
- Use the
offsetparameter to process sequential segments - Consider specialized services designed for lengthy recordings
Which Engine Is Best for a Commercial Project?
Recommendations based on use case:
- Google Cloud Speech: top‑tier accuracy and scalability
- Microsoft Azure: strong integration with Microsoft services
- IBM Watson: extensive customization options
- Amazon Transcribe: ideal for AWS‑centric solutions
How Do I Handle Noisy Audio Files?
Techniques to improve noisy recordings:
- Apply audio pre‑processing (noise‑reduction filters)
- Tune the
energy_thresholdparameter - Use dedicated noise‑suppression libraries (e.g.,
noisereduce) - Experiment with different recognizers to find the most robust one
Alternatives and Complementary Tools
Modern Alternatives
- OpenAI Whisper: state‑of‑the‑art neural speech recognizer
- Wav2Vec 2.0: Facebook’s self‑supervised model
- DeepSpeech: Mozilla’s open‑source solution
- Vosk: lightweight offline library
Integration with Other Libraries
# Integration with pydub for advanced audio handling
from pydub import AudioSegment
from pydub.silence import split_on_silence
def transcribe_long_audio(file_path):
# Load and split audio on silence
audio = AudioSegment.from_wav(file_path)
chunks = split_on_silence(audio, min_silence_len=1000, silence_thresh=-40)
# Recognize each chunk
full_text = ""
for i, chunk in enumerate(chunks):
chunk.export(f"temp_chunk_{i}.wav", format="wav")
with sr.AudioFile(f"temp_chunk_{i}.wav") as source:
audio_data = sr.Recognizer().record(source)
text = sr.Recognizer().recognize_google(audio_data, language="ru-RU")
full_text += text + " "
return full_text.strip()
Conclusion
SpeechRecognition remains one of the most popular and user‑friendly libraries for speech‑recognition tasks in Python. Its main advantages are simplicity, broad engine support, and an active developer community.
The library suits both rapid prototyping and production‑grade commercial applications. The variety of supported APIs lets you choose the optimal solution—from free services for personal projects to high‑accuracy paid options for enterprise deployments.
Getting started requires only a few lines of code, yet the library also offers deep customization and performance‑tuning capabilities. Combined with modern audio‑processing techniques and machine‑learning advances, SpeechRecognition can serve as the foundation for sophisticated voice‑interaction systems.
The Future of AI in Mathematics and Everyday Life: How Intelligent Agents Are Already Changing the Game
Experts warned about the risks of fake charity with AI
In Russia, universal AI-agent for robots and industrial processes was developed