Deep Learning for Audio: How AI Processes Sound, Speech, and Music
How Computers Represent Sound
Sound is a pressure wave traveling through air. Microphones convert these pressure variations into electrical signals, which are digitized by sampling the signal at regular intervals. Standard audio is sampled at 16,000 Hz (16 kHz) for speech or 44,100 Hz for music, meaning the computer records 16,000 or 44,100 amplitude values per second. A 10-second audio clip at 16 kHz is a one-dimensional array of 160,000 numbers. This raw waveform is one input format that deep learning models can process directly.
The more common representation is the spectrogram, which shows how the frequency content of the sound changes over time. A spectrogram is computed by dividing the waveform into short overlapping windows (typically 25 ms with 10 ms shifts) and computing the Fourier transform of each window. The result is a 2D matrix where one axis is time, the other is frequency, and each value indicates the energy at that frequency during that time window. A mel spectrogram applies a perceptually motivated frequency scale (the mel scale) that matches how humans perceive pitch, compressing higher frequencies where human hearing is less sensitive.
Mel spectrograms are the standard input for most deep learning audio models because they compress the raw waveform into a much smaller representation while preserving the information most relevant to human perception. A 10-second audio clip at 16 kHz produces 160,000 raw samples but only about 1,000 x 80 = 80,000 spectrogram values (1,000 time frames x 80 mel frequency bins). This 2D representation can be processed by CNNs designed for images, with time on one axis and frequency on the other.
Speech Recognition (ASR)
Automatic speech recognition converts spoken audio into text. Modern systems achieve word error rates (WER) below 5% for clear English speech, approaching the 4 to 5% error rate of professional human transcriptionists. The progression has been dramatic: in 2012, a state-of-the-art system had a WER of about 25% on conversational speech. Deep learning reduced this by roughly 80% in a decade.
OpenAI's Whisper model, released in 2022, demonstrated that a single transformer model trained on 680,000 hours of multilingual audio could achieve robust speech recognition across 99 languages, handle accented speech, transcribe audio with background noise, and translate speech from other languages into English. Whisper processes mel spectrograms with a transformer encoder-decoder architecture, where the encoder processes the audio features and the decoder generates the text transcription token by token. The model's strength comes from the massive scale and diversity of its training data, which includes audio from podcasts, interviews, lectures, and conversations in a wide range of acoustic conditions.
Connectionist Temporal Classification (CTC), an older approach still used in some systems, aligns the audio frames with text characters without requiring explicit alignment labels. The model predicts a character at each time step, and CTC handles the fact that multiple consecutive frames might correspond to the same character or to silence. CTC-based models are simpler and faster than encoder-decoder models but generally less accurate for complex speech patterns and long-form transcription.
Real-world speech recognition challenges include background noise (restaurants, cars, outdoor environments), overlapping speakers (meetings, phone calls), domain-specific vocabulary (medical dictation, legal transcription), and code-switching (speakers alternating between languages within a conversation). Systems deployed in production use additional techniques: noise robustness training with artificially corrupted audio, speaker diarization to identify who said what, and language model rescoring to correct errors using knowledge of which word sequences are probable.
Text-to-Speech (TTS)
Text-to-speech synthesis converts written text into natural-sounding audio. Early deep learning TTS systems like Tacotron (2017) used an encoder-decoder architecture to convert text into mel spectrograms, followed by a vocoder network (WaveNet or WaveRNN) that converted the spectrogram into an audio waveform. The resulting speech was dramatically more natural than the concatenative and parametric systems that preceded it, with natural prosody, appropriate emphasis, and smooth transitions between sounds.
Modern TTS systems like VALL-E and Bark can clone a speaker's voice from just a few seconds of reference audio, producing speech in that voice saying arbitrary new text. The zero-shot voice cloning capability raises both exciting possibilities (personalized assistants, preserving voices of people who lose the ability to speak) and serious concerns (impersonation, fraud, deepfake audio). Real-time TTS runs on smartphones and smart speakers, powering the voices of Siri, Alexa, Google Assistant, and navigation systems.
Emotional and expressive TTS is an active research area. Systems are learning to generate speech with appropriate emotion (happy, sad, urgent, calm), speaking style (conversational, formal, whispering, shouting), and prosodic variation (emphasis, pauses, rhythm). These capabilities make AI-generated speech more natural in interactive applications and more useful for audiobook narration, dubbing, and accessibility tools.
Music and Audio Generation
Deep learning can generate music that is stylistically consistent, harmonically coherent, and emotionally expressive. Models like MusicLM, MusicGen, and Suno generate audio from text descriptions ("an upbeat jazz piece with piano and saxophone") or continue a musical passage in a consistent style. These systems are typically trained on large datasets of music with metadata descriptions and learn to generate audio that matches both the acoustic characteristics and the semantic descriptions.
The technical approaches mirror those in other generative domains. Autoregressive models generate audio token by token (where tokens represent compressed audio features). Diffusion models generate spectrograms or waveforms through iterative denoising. Some systems combine both: an autoregressive model generates a coarse structure, and a diffusion model fills in the acoustic details. The challenge unique to music is maintaining long-range coherence: a song needs consistent key, tempo, and style across minutes of audio, which requires modeling much longer temporal dependencies than speech.
Sound effect generation and audio-visual synthesis are growing areas. Models can generate realistic sound effects from text descriptions ("rain on a tin roof," "crowd cheering in a stadium") and synchronize generated audio with video (adding appropriate footstep sounds to a walking animation). These capabilities are valuable for film production, game development, and virtual reality experiences.
Audio Classification and Analysis
Audio classification identifies what type of sound is present in a recording. Environmental sound classification distinguishes between categories like car horns, dog barks, sirens, and music. This powers applications from wildlife monitoring (identifying bird species from recordings) to industrial predictive maintenance (detecting machinery faults from their acoustic signatures) to smart home devices (distinguishing between a doorbell, smoke alarm, and glass breaking).
Speaker identification determines who is speaking from voice characteristics. Speaker verification confirms whether a voice matches a claimed identity (used in voice biometric authentication). These systems extract speaker embeddings, dense vectors that capture the unique characteristics of a voice, and compare them using distance metrics. Modern systems achieve equal error rates below 1% on standard benchmarks, making voice biometrics reliable enough for banking and security applications.
Music information retrieval uses deep learning to analyze musical content: identifying instruments, detecting beats and tempo, extracting melody, classifying genre, and estimating mood. Shazam-like services use audio fingerprinting to identify songs from short snippets. Recommendation systems use deep audio features to suggest similar music based on acoustic similarity rather than just metadata and listening history.
Architectures for Audio
Audio processing uses the same architectures as vision and language, adapted for audio's specific characteristics. CNNs applied to mel spectrograms treat the spectrogram as a 2D image, using convolutional filters to detect time-frequency patterns. This approach is simple and effective for classification tasks. 1D CNNs applied directly to raw waveforms learn to extract features directly from the audio signal, avoiding the information loss inherent in spectrogram computation.
Transformer models have become dominant for audio as well. Audio transformers like AST (Audio Spectrogram Transformer) split the spectrogram into patches and process them with self-attention, exactly like a Vision Transformer processes image patches. Whisper uses a standard encoder-decoder transformer. Self-supervised models like wav2vec 2.0 and HuBERT learn audio representations from unlabeled audio through masked prediction tasks, analogous to how BERT learns language representations. These pre-trained audio models can be fine-tuned for specific tasks with limited labeled data.
Deep learning processes audio through spectrograms or raw waveforms using CNN and transformer architectures. Speech recognition has reached near-human accuracy, text-to-speech produces natural voices from text, and generative models create music and sound effects from descriptions. The same self-supervised pre-training paradigm that revolutionized text and vision is now producing powerful general-purpose audio representations.