How Speech Recognition Works

Updated May 2026
Speech recognition, also called automatic speech recognition (ASR), converts spoken language into written text by analyzing the acoustic patterns in audio signals. Modern systems use end-to-end neural networks that take raw audio spectrograms as input and produce text directly, achieving word error rates below 5% for clear English speech. Models like OpenAI's Whisper handle nearly 100 languages, powering voice assistants, live captioning, call center transcription, medical dictation, and any application that needs to convert speech to text.

From Sound Waves to Spectrograms

Speech is a pressure wave traveling through air, produced by the vibration of vocal cords and shaped by the position of the tongue, lips, jaw, and nasal passages. A microphone converts this pressure wave into an electrical signal, which is digitized by sampling the amplitude at regular intervals, typically 16,000 times per second (16 kHz) for speech applications. This raw waveform is a one-dimensional sequence of amplitude values: a 10-second audio clip at 16 kHz produces 160,000 numbers.

Raw waveforms are difficult for models to work with directly because the relationship between amplitude values and linguistic content is extremely indirect. The same phoneme sounds different when spoken by different people, at different pitches, at different speeds, and in different acoustic environments. The standard preprocessing step converts the waveform into a spectrogram using the Short-Time Fourier Transform (STFT). The audio is divided into short overlapping frames (typically 25 milliseconds, with 10-millisecond steps), and a Fourier transform is applied to each frame to decompose it into its frequency components. The result is a 2D representation: time on the horizontal axis, frequency on the vertical axis, and intensity (color or brightness) showing the energy at each time-frequency point.

Mel spectrograms refine this further by mapping the frequency axis to the mel scale, which approximates how the human ear perceives pitch. Humans are much more sensitive to differences between low frequencies (the difference between 100 Hz and 200 Hz is easily heard) than high frequencies (the difference between 8,000 Hz and 8,100 Hz is barely perceptible). The mel scale compresses high frequencies and expands low frequencies, producing a representation that aligns with human perception. Mel-frequency cepstral coefficients (MFCCs), derived from mel spectrograms, were the standard acoustic features for speech recognition from the 1980s through the 2010s. Modern end-to-end systems typically work directly with log-mel spectrograms, letting the neural network learn whatever additional feature transformations are useful.

Traditional Speech Recognition Pipelines

Before end-to-end neural models, speech recognition used a pipeline of specialized components. The acoustic model mapped short audio frames to phonemes (the basic sound units of language). The pronunciation dictionary mapped sequences of phonemes to words: the word "cat" maps to the phoneme sequence /k ae t/. The language model assigned probabilities to word sequences, ensuring that acoustically ambiguous outputs resolved to grammatically and semantically plausible text. "Recognize speech" and "wreck a nice beach" are acoustically similar but the language model strongly prefers the first interpretation.

Hidden Markov Models (HMMs) were the dominant acoustic modeling approach from the 1980s through roughly 2012. An HMM models each phoneme as a sequence of states, with each state generating acoustic observations (spectral features) according to a probability distribution, typically a Gaussian mixture model (GMM). The Viterbi algorithm finds the most likely sequence of phoneme states given the observed audio, producing a phoneme-level transcription. The pronunciation dictionary and language model then convert this to words.

This pipeline approach worked well enough for commercial deployment (Siri, Google Voice Search, and Dragon NaturallySpeaking all used HMM-based systems), but each component had to be trained separately, errors cascaded from one component to the next, and the acoustic model could not benefit from linguistic context because it operated on isolated audio frames. Word error rates for these systems typically ranged from 8% to 15% for clean speech and degraded significantly in noisy environments.

End-to-End Neural Speech Recognition

End-to-end models replaced the entire traditional pipeline with a single neural network that takes audio as input and produces text as output. The network learns to perform acoustic modeling, pronunciation mapping, and language modeling simultaneously, allowing each component to benefit from the others during training. Three main architectures have been used for end-to-end ASR: CTC (Connectionist Temporal Classification), attention-based encoder-decoder models, and transducer models.

CTC Models

CTC, introduced by Alex Graves in 2006, allows a neural network to map a sequence of audio frames to a sequence of characters without requiring explicit alignment between frames and characters. The model outputs a probability distribution over characters (plus a special "blank" symbol) for each audio frame, and the CTC algorithm marginalizes over all possible alignments to compute the probability of the target transcription. DeepSpeech, developed by Baidu Research in 2014, used a deep recurrent neural network with CTC to achieve competitive speech recognition without any of the traditional pipeline components.

Attention-Based Encoder-Decoder

Attention-based models apply the encoder-decoder architecture from machine translation to speech. The encoder (a stack of convolutional and transformer layers) processes the spectrogram and produces a sequence of contextualized audio representations. The decoder generates the transcription token by token, using attention to focus on the relevant part of the audio at each step. When generating the word "morning," the decoder attends to the audio frames around 0.5 to 0.8 seconds where that word was spoken. This architecture naturally handles the variable-length alignment between audio and text and can leverage both acoustic and linguistic context.

Whisper: The Current Standard

OpenAI's Whisper, released in 2022, demonstrated that scaling a simple encoder-decoder transformer and training it on a massive, diverse dataset produces remarkably robust speech recognition. Whisper was trained on 680,000 hours of multilingual audio collected from the internet, covering 96 languages, noisy conditions, multiple accents, and diverse recording qualities. The model processes 30-second audio segments: the log-mel spectrogram is encoded by a transformer encoder, and a transformer decoder generates the text transcription.

Whisper achieves word error rates below 5% for clean English speech, comparable to professional human transcription. More impressively, it handles noisy audio, accented speech, and code-switching (speakers mixing languages within a sentence) far better than previous systems because the training data included these variations at scale. Whisper also performs translation: it can transcribe speech in any of its supported languages directly into English text, combining speech recognition and translation in a single model. The model comes in multiple sizes, from 39 million parameters (tiny) to 1.55 billion parameters (large-v3), allowing deployment on devices ranging from smartphones to server clusters.

Challenges in Speech Recognition

Background noise remains a significant challenge despite neural models' improved robustness. Conversations in restaurants, airport announcements, phone calls with traffic noise, and recordings with music or other speakers all degrade recognition accuracy. Noise-robust training, where the model is trained on audio with artificially added noise, improves performance, and Whisper's massive training set includes many naturally noisy examples. But extreme noise levels, where the signal-to-noise ratio drops below 5 dB (the noise is louder than the speech), still cause substantial error increases.

Speaker diarization, determining who spoke when in a multi-speaker recording, is a separate challenge from transcription. A meeting recording with five participants requires not just converting speech to text but labeling each segment with the correct speaker. Speaker diarization models use speaker embeddings (vector representations that capture voice characteristics) to cluster audio segments by speaker. Accuracy drops significantly when speakers talk over each other (overlapping speech), change topics rapidly, or have similar voice characteristics. Combining diarization with transcription to produce attributed transcripts ("Speaker A: Good morning. Speaker B: Let's get started.") is the standard requirement for meeting transcription, call center analytics, and podcast processing.

Accented and dialectal speech reduces accuracy for systems trained primarily on standard varieties of a language. English spoken with a strong Indian, Nigerian, or Scottish accent may produce error rates 2x to 5x higher than standard American English. This bias reflects the composition of training data: internet audio disproportionately represents certain accents and demographics. Accent-specific fine-tuning improves performance but requires collecting and annotating accent-specific data, which can be expensive for under-represented varieties. Multilingual models like Whisper show better accent robustness than monolingual models because exposure to many languages builds tolerance for phonetic variation.

Specialized vocabulary in domains like medicine, law, and engineering causes errors when domain-specific terms are not well represented in the training data. A doctor dictating a clinical note expects the system to correctly transcribe "acetaminophen," "tachycardia," and "bilateral pneumothorax." General-purpose ASR models may produce phonetically similar but medically incorrect alternatives. Domain adaptation through fine-tuning on domain-specific transcripts or adding custom vocabulary with pronunciation information addresses this, and specialized medical and legal dictation systems have existed for decades.

Applications

Voice assistants (Siri, Alexa, Google Assistant) use speech recognition as their input modality, converting spoken commands into text for intent recognition and response generation. The ASR component runs on-device for speed and privacy, with cloud-based models available for more complex or ambiguous inputs. On-device models have become remarkably capable: modern smartphones run ASR models that achieve near-server accuracy for common commands and queries.

Live captioning and subtitling uses real-time speech recognition to provide text captions for video calls, lectures, broadcasts, and live events. YouTube auto-captions process millions of hours of video. Zoom, Teams, and Google Meet offer real-time captions during meetings. Accessibility regulations in many countries require captioning for broadcast media, and ASR has made compliance dramatically cheaper than manual captioning, though accuracy requirements for legal compliance often still demand human review of auto-generated captions.

Call center analytics transcribes customer service calls at scale, enabling analysis of customer complaints, agent performance, compliance monitoring, and trend detection. A large call center may process 50,000 calls per day, generating transcripts that are searched, categorized, and analyzed for quality assurance, training, and business intelligence. Sentiment detection applied to transcripts identifies frustrated customers early in calls, enabling real-time coaching or supervisor escalation.

Key Takeaway

Modern speech recognition uses end-to-end neural models that convert audio spectrograms directly into text, achieving near-human accuracy for clear speech while continuing to improve on accented, noisy, and multilingual audio.