How Do Models Learn to Hear?
Raw audio is just a sequence of numbers — amplitude samples captured thousands of times per second. A 10-second clip at 16 kHz is 160,000 floating-point values with no inherent structure beyond the waveform. How do we get from this flat stream of samples to a model that understands speech, identifies speakers, or transcribes languages it has never been explicitly taught? The answer lies in audio encoders : models that compress raw audio into compact, meaningful representations that downstream systems can reason about.
Three paradigms have emerged for training these encoders, each with a different philosophy about where the learning signal comes from:
- Supervised on labeled data (Whisper): collect hundreds of thousands of hours of audio paired with transcriptions, then train an encoder-decoder transformer to predict the text. The encoder learns representations because it must — the decoder needs them to produce accurate transcripts.
- Self-supervised contrastive learning (wav2vec 2.0): mask parts of the audio, then train the model to distinguish the true masked content from distractors. No transcriptions needed — the audio itself provides the learning signal.
- Self-supervised masked prediction (HuBERT): cluster audio frames into discrete pseudo-labels offline, then mask frames and predict the cluster IDs. Like BERT's masked language modelling , but for audio.
This progression mirrors what happened in computer vision. Vision started with supervised training on ImageNet labels, then moved to self-supervised methods like DINO and MAE that learn from images alone. Audio followed the same arc: Whisper showed that massive supervised training produces excellent speech recognition, while wav2vec 2.0 and HuBERT showed that self-supervised pre-training produces general-purpose representations that transfer to tasks no one anticipated at training time. Let's examine each approach in detail.
Whisper: Supervised at Scale
Whisper (Radford et al., 2022) is the simplest approach conceptually: take an enormous amount of audio with transcriptions and train a standard encoder -decoder transformer to predict the text. What makes Whisper remarkable is not the architecture but the scale: 680,000 hours of weakly-supervised multilingual audio scraped from the internet — roughly 77 years of continuous listening. This brute-force approach produced a model that generalises across accents, languages, and recording conditions without any domain-specific tricks.
The architecture has two halves. The encoder takes an 80-channel log-mel spectrogram (a time-frequency representation of the audio, covered in an earlier article in this track) and passes it through two 1D convolutional layers with a stride of 2, which downsample the time axis by 4x. The result feeds into a standard transformer encoder with sinusoidal positional embeddings. For a 30-second audio clip at 100 frames per second (the mel spectrogram rate), the convolutions reduce 3,000 frames to 1,500 frames, each represented as a $d$-dimensional vector. This is the representation that downstream systems can reuse.
The
decoder
is an autoregressive transformer that produces text tokens one at a time, attending to the encoder output via
cross-attention
. What makes Whisper's decoder distinctive is its use of special tokens to handle multiple tasks through a single model. The token sequence might look like:
<|startoftranscript|> <|en|> <|transcribe|> <|0.00|> Hello world <|2.40|> <|endoftext|>
. The language token (
<|en|>
) selects the output language, the task token (
<|transcribe|>
or
<|translate|>
) chooses between transcription and translation, and the timestamp tokens (
<|0.00|>
) provide word-level timing. One architecture, one set of weights, multiple tasks — all controlled by the prefix tokens in the decoder.
Whisper comes in several sizes, from Tiny (39M parameters, 4 encoder layers) to Large-v3 (1.55B parameters, 32 encoder layers). The scaling follows a familiar pattern: larger models are more accurate but slower, with the biggest jumps coming from Tiny to Small and from Medium to Large. For reference, the Large-v3 encoder alone has roughly 600M parameters — comparable to a BERT-Large model.
import json, js
# Whisper model sizes (from the paper, Table 1)
models = [
["Tiny", "39M", 4, 384, 6, 256, "~32x real-time"],
["Base", "74M", 6, 512, 8, 512, "~16x real-time"],
["Small", "244M", 12, 768, 12, 768, "~6x real-time"],
["Medium", "769M", 24, 1024, 16, 1024, "~2x real-time"],
["Large-v3", "1.55B", 32, 1280, 20, 1280, "~1x real-time"],
]
rows = []
for name, params, enc_layers, enc_dim, heads, dec_dim, speed in models:
rows.append([name, params, str(enc_layers), str(enc_dim), str(heads), str(dec_dim), speed])
js.window.py_table_data = json.dumps({
"headers": ["Model", "Params", "Enc Layers", "Enc Dim", "Heads", "Dec Dim", "Approx Speed"],
"rows": rows
})
print("Whisper model family: from 39M to 1.55B parameters")
print("All models share the same encoder-decoder architecture")
print("Trained on 680,000 hours of weakly-supervised multilingual audio")
Why does Whisper matter as a foundation beyond just speech recognition? Because its encoder has become the de facto audio front-end for multimodal language models. Qwen-Audio, Qwen2-Audio, and Qwen2.5-Omni all reuse a pre-trained Whisper encoder as their audio perception module, connecting its output to a language model via a projection layer. Whisper established the pipeline — mel spectrogram to convolutional downsampling to transformer encoder — as the standard way to turn audio into continuous representations that language models can consume.
wav2vec 2.0: Learning Without Labels
Whisper's approach has an obvious bottleneck: labeled data. Transcribing audio requires human annotators, which is expensive, slow, and biased toward well-resourced languages. English has hundreds of thousands of hours of transcribed speech; many of the world's 7,000+ languages have virtually none. Can we learn audio representations from raw, untranscribed audio alone? That's the question wav2vec 2.0 (Baevski et al., 2020) answers with a resounding yes.
The architecture has three components that work together in a self-supervised loop. First, a feature encoder : a 7-layer 1D convolutional neural network that processes the raw 16 kHz waveform directly (no spectrogram needed). Each layer applies convolution, group normalisation, and GELU activation, progressively downsampling the signal. The output rate is approximately 50 Hz — one feature vector every 20 milliseconds — which is a 320x reduction from the raw sample rate. At 16,000 samples per second, a 20ms window spans 320 samples, so each CNN output frame summarises 320 raw audio values into a single $d$-dimensional vector.
Second, a context network : a standard transformer encoder that takes the CNN output sequence and applies self-attention across all frames. While the CNN sees only a local window (each frame corresponds to ~25ms of audio with some receptive-field overlap), the transformer captures long-range dependencies — a phoneme's identity can depend on what was said seconds earlier. The context network produces a contextualised representation $c_t$ for every time step $t$.
Third, a quantization module that discretises the CNN output into a finite set of targets. It uses product quantization with 2 codebooks of 320 entries each. Each CNN output frame is mapped to one entry from each codebook, and the two entries are concatenated to form the quantized representation $q_t$. With $320 \times 320 = 102{,}400$ possible combinations, the codebook is large enough to capture fine phonetic distinctions while still being a discrete, finite target space.
The self-supervised training procedure works as follows. Spans of CNN output frames are masked (typically 10 consecutive frames starting at random positions, covering roughly 200ms of audio each). The context network sees the masked input and must produce contextualised representations $c_t$ at each masked position. The training objective is then a contrastive task : given $c_t$ at a masked position, identify which quantized representation $q_t$ is the true target among a set of $K$ distractors $\tilde{q}$ sampled uniformly from other masked positions in the same utterance.
The contrastive loss for this task is the InfoNCE loss , the same objective used in CLIP for image-text alignment and in dense retrieval for query-document matching:
Let's unpack what each piece does. $c_t$ is the context network's output at masked time step $t$ — this is the model's best guess about what audio was masked, informed by surrounding context. $q_t$ is the true quantized target — the discretised version of the CNN output that was actually at position $t$ before masking. $\tilde{q}$ ranges over a set $Q_t$ that includes the true target $q_t$ and $K$ distractors. $\text{sim}(\cdot, \cdot)$ is cosine similarity. And $\kappa$ is a temperature parameter that controls the sharpness of the distribution.
Let's check the boundary behaviour. When the model perfectly identifies the true target, $\text{sim}(c_t, q_t) \gg \text{sim}(c_t, \tilde{q})$ for all distractors. In that case, the numerator dominates the denominator, the fraction approaches 1, and the loss approaches $-\log(1) = 0$. When the model is completely confused and assigns equal similarity to all candidates, the fraction becomes $1/(K+1)$ (one true target among $K+1$ total candidates), giving a loss of $\log(K+1)$. For $K = 100$ distractors, this maximum loss is $\log(101) \approx 4.62$. The temperature $\kappa$ matters at the margins: a lower $\kappa$ sharpens the softmax, making the model more confident and punishing near-misses more harshly; a higher $\kappa$ smooths the distribution, making training more stable early on when the model is still learning.
The full loss adds a diversity penalty to prevent codebook collapse — the tendency of the quantization module to use only a few codebook entries while ignoring the rest. The diversity loss maximises the entropy of the codebook usage distribution, encouraging all entries to be used equally. Without it, the model might converge to a degenerate solution where most audio frames map to the same handful of codes, destroying the information content of the quantized targets.
The results from the paper are striking. Pre-training wav2vec 2.0 on 960 hours of unlabeled LibriSpeech audio and then fine-tuning on just 10 minutes of labeled data achieves a word error rate (WER) of 4.8% on the test-clean split. For comparison, the previous state of the art used 960 hours of labeled data to reach similar performance. That's a 5,760x reduction in labeled data requirements. The boundary insight is clear: more unlabeled data for pre-training consistently improves representations (the model trained on 60,000 hours of unlabeled data outperforms the one trained on 960 hours), while the amount of labeled fine-tuning data hits diminishing returns quickly (going from 10 minutes to 1 hour helps significantly, but going from 10 hours to 100 hours helps much less).
import json, js
# wav2vec 2.0 results from Table 9 in the paper
# Pre-trained on 960h unlabeled LibriSpeech, varying labeled fine-tuning data
results = [
["10 min", "4.8", "8.2"],
["1 hour", "3.4", "6.1"],
["10 hours", "2.6", "5.0"],
["100 hours","2.3", "4.6"],
["960 hours","2.1", "4.3"],
]
rows = []
for labeled, clean, other in results:
rows.append([labeled, clean, other])
js.window.py_table_data = json.dumps({
"headers": ["Labeled Fine-tuning Data", "WER test-clean (%)", "WER test-other (%)"],
"rows": rows
})
print("wav2vec 2.0 (Large) pre-trained on 960h unlabeled LibriSpeech")
print("Fine-tuned with varying amounts of labeled data")
print()
print("Key insight: 10 min of labels + self-supervised pre-training")
print("rivals models trained on 960h of fully labeled data")
HuBERT: Predicting Hidden Units
wav2vec 2.0 proved that self-supervised learning works for audio, but its contrastive approach has moving parts: the quantization module must be trained jointly with the rest of the network, and the contrastive task requires careful negative sampling. HuBERT (Hsu et al., 2021) proposes a simpler alternative: instead of learning targets online through a contrastive objective, create targets offline using plain k-means clustering, then train the model to predict those cluster assignments at masked positions — exactly like BERT's masked language modelling (MLM) , but with audio frames instead of text tokens and cluster IDs instead of vocabulary words.
The iterative training process is what makes HuBERT unique. It runs in stages, with each stage producing better targets for the next:
- Stage 0 — bootstrap features: extract 39-dimensional MFCCs (Mel-Frequency Cepstral Coefficients, a classic signal-processing feature) from the raw audio. These are crude representations — they capture spectral shape but miss the rich temporal and semantic structure that neural networks learn.
- Stage 0 — cluster: run k-means clustering on the MFCC features with $K = 100$ clusters. Every audio frame gets assigned a cluster ID from $\{0, 1, \ldots, 99\}$. These are the initial pseudo-labels.
- Stage 1 — train HuBERT: mask spans of the audio input (same masking strategy as wav2vec 2.0), feed the unmasked portions through a CNN feature encoder and a transformer, and predict the cluster ID at each masked position using a classification head. The model is trained with standard cross-entropy loss.
- Stage 1 — re-extract features: after training, use the 6th transformer layer's output (an intermediate representation) as new features for all audio frames. These learned representations are far richer than the original MFCCs.
- Stage 2 — re-cluster and retrain: run k-means again on the new features, now with $K = 500$ clusters for finer granularity. Use these improved cluster IDs as targets and train a new (or continued) HuBERT model. The cycle can repeat, but in practice two iterations suffice.
The loss function is straightforward — cross-entropy computed only at masked positions:
Here $\mathcal{M}$ is the set of masked time steps, $c_t$ is the cluster assignment for frame $t$ (determined offline by k-means), and $\tilde{x}$ is the masked version of the input audio. The probability $P(c_t \mid \tilde{x})$ comes from a softmax over $K$ classes at position $t$. Let's check the boundaries. When the model predicts the correct cluster with probability 1, the log term is $\log(1) = 0$, contributing zero loss — perfect prediction. When the model assigns probability $1/K$ uniformly across all clusters (random guessing), each masked position contributes $\log(K)$. For $K = 100$, that's $\log(100) \approx 4.6$ per position; for $K = 500$, it's $\log(500) \approx 6.2$. More clusters mean a harder prediction task and a higher loss ceiling, which is why the second iteration uses 500 clusters — it forces the model to make finer-grained distinctions.
The obvious question is: why does this work when the cluster labels are noisy and semantically meaningless? A k-means cluster over MFCCs doesn't correspond to a phoneme or a word — it's an arbitrary grouping of acoustically similar frames. The key insight from the paper is that the labels don't need to be correct; they just need to be consistent . If two instances of the phoneme /k/ land in the same cluster (even if that cluster also contains some /g/ sounds), the model still learns that these frames are related. And because the model must predict masked frames from context, it is forced to learn both acoustic patterns (what does this sound sound like?) and sequential structure (what sound is likely to come next given what came before?). The re-clustering step then improves the labels using the model's own learned representations, creating a virtuous cycle: better representations produce better clusters, which produce better training targets, which produce better representations.
HuBERT's representations have proven important well beyond speech recognition. Its successor WavLM (Chen et al., 2022) extends the approach with denoising objectives and achieves state-of-the-art results on the SUPERB benchmark across tasks including speaker identification, emotion recognition, and speech separation. More importantly for the audio generation pipeline, HuBERT and WavLM representations serve as semantic tokens in neural audio codecs. Moshi's Mimi codec, for instance, distils from WavLM to create a semantic first codebook that captures linguistic content, while acoustic codebooks capture the fine-grained detail needed for high-fidelity reconstruction. We'll cover this in the next article on neural audio codecs.
Comparing the Three Approaches
Each of these three encoders makes fundamentally different choices about training signal, input representation, and output format. The table below summarises the key differences:
import json, js
rows = [
["Whisper", "Supervised (ASR)", "Log-mel spectrogram", "Text tokens",
"680K hrs labeled", "ASR, translation directly; encoder reused in multimodal LLMs"],
["wav2vec 2.0", "Self-sup. contrastive", "Raw waveform (16kHz)", "Continuous features",
"960h-60Kh unlabeled", "Fine-tune for ASR, speaker ID, emotion recognition"],
["HuBERT", "Self-sup. masked pred.", "Raw waveform (16kHz)", "Continuous features",
"960h unlabeled", "Fine-tune for ASR; semantic tokens for neural codecs"],
]
js.window.py_table_data = json.dumps({
"headers": ["Model", "Training", "Input", "Output", "Data", "Downstream Use"],
"rows": rows
})
print("Three paradigms for learning audio representations:")
print()
print("Supervised (Whisper): best when you have labeled data and want ASR directly")
print("Contrastive (wav2vec 2.0): best for low-resource languages with little labeled data")
print("Masked prediction (HuBERT): simplest self-supervised approach, semantic tokens for codecs")
The pattern that emerges is familiar from NLP and vision: supervised models excel at the specific task they were trained for (Whisper is hard to beat at multilingual ASR with its 680K hours of supervision), but self-supervised models learn more general-purpose representations that transfer to a wider range of downstream tasks. wav2vec 2.0 and HuBERT representations are useful for speaker verification, emotion detection, language identification, speech separation, and keyword spotting — tasks that Whisper's encoder was never optimised for but that benefit from the rich audio understanding these models develop through self-supervised learning.
There's also a practical difference in how these encoders get reused. Whisper's encoder is typically frozen and used as a feature extractor — its output is projected into a language model's embedding space, and only the projection layer and language model are trained. wav2vec 2.0 and HuBERT, by contrast, are more commonly fine-tuned end-to-end on downstream tasks, because their pre-training objective didn't optimise for any specific output format.
All three encoders produce continuous representations — dense floating-point vectors at each time step. But for language models that operate on discrete tokens, we often want to convert audio into a sequence of integers, just like text is tokenized into vocabulary indices. The next article in this track covers how neural audio codecs (EnCodec, SoundStream, DAC, Mimi) use residual vector quantization to turn continuous audio into discrete tokens — bridging the gap between audio encoders and the token-based world of large language models.
Quiz
Test your understanding of the three audio encoder paradigms.
Why is Whisper's encoder widely reused in multimodal language models like Qwen-Audio, rather than training a new audio encoder from scratch?
In wav2vec 2.0's contrastive loss, what happens to the loss value when the model assigns equal similarity to the true target and all $K$ distractors?
Why does HuBERT's training work even though the k-means cluster labels are noisy and semantically meaningless?
What is the key advantage of self-supervised audio encoders (wav2vec 2.0, HuBERT) over supervised ones (Whisper)?