Why Tokenize Audio?
Language models operate on discrete tokens — words, subwords, byte-pairs. A model like GPT or LLaMA never sees raw text; it sees integer IDs drawn from a fixed vocabulary. So what happens when we want a language model to understand or generate audio? Raw audio is a continuous waveform: thousands of floating-point amplitude samples per second. We need a way to convert that continuous signal into a sequence of discrete tokens that a language model can consume, predict, and generate. That conversion is audio tokenization .
There are two fundamentally different kinds of information in audio, and they require different kinds of tokens:
- Semantic tokens: capture what is said — the linguistic content, the words and meaning. These are generated by clustering the internal representations of self-supervised speech models like HuBERT or WavLM . The idea is simple: run audio through a pre-trained model, extract hidden-layer features, then cluster those features (e.g. with k-means) to get discrete labels. Two utterances of "hello" by different speakers in different rooms will map to the same semantic tokens, because the model learned to ignore speaker and acoustic details.
- Acoustic tokens: capture how it sounds — speaker identity, pitch, emotion, recording quality, room acoustics, background noise. These are generated by neural audio codecs (like EnCodec or SoundStream) that compress audio into discrete codes while preserving enough detail to reconstruct the waveform faithfully.
The hierarchy is straightforward: semantic tokens for meaning, acoustic tokens for fidelity. AudioLM (Borsos et al., 2022) established this two-stage paradigm: first generate semantic tokens (deciding what to say), then generate acoustic tokens conditioned on those semantic tokens (deciding how to say it). This separation lets the model plan content before committing to acoustic details — analogous to how a writer might outline an argument before choosing the exact words.
Vector Quantization and Codebooks
Both semantic and acoustic tokens share a common mechanism: vector quantization (VQ) . The audio encoder produces a sequence of continuous feature vectors — one per frame of audio. To turn each continuous vector into a discrete token, we need a codebook : a table of $K$ prototype vectors (also called centroids or codewords), learned during training. To quantize a vector $z$, we find the nearest codebook entry:
where $e_k$ is the $k$-th codebook entry. The index $k$ becomes the discrete token. With $K = 1024$ entries, each token carries $\log_2(1024) = 10$ bits of information. This is identical in concept to Product Quantization used in vector search (covered in the indexing article ), except here we're quantizing audio features rather than document embeddings.
But there's a problem. Audio is rich — a single codebook of 1024 entries can't capture all the variation in pitch, timbre, phoneme identity, and background acoustics simultaneously. Increasing $K$ to, say, $2^{20} \approx 1{,}000{,}000$ entries would give us 20 bits per token, but the codebook becomes enormous and most entries go unused during training (a problem called codebook collapse ). We need a smarter way to increase capacity without blowing up the codebook size.
Residual Vector Quantization (RVQ)
The solution is Residual Vector Quantization (RVQ) : instead of one large codebook, use multiple smaller codebooks in series. Each codebook quantizes the residual error left by the previous one, progressively refining the approximation:
- Step 1: Quantize $z$ with codebook 1 to get approximation $\hat{z}_1$. Compute residual $r_1 = z - \hat{z}_1$.
- Step 2: Quantize $r_1$ with codebook 2 to get $\hat{z}_2$. Compute residual $r_2 = r_1 - \hat{z}_2$.
- Step $q$: Quantize $r_{q-1}$ with codebook $q$ to get $\hat{z}_q$. Compute residual $r_q = r_{q-1} - \hat{z}_q$.
After $Q$ levels, the final approximation is the sum of all codebook contributions:
Each level captures finer detail — like progressively increasing the resolution of an image. Level 1 captures the coarse structure (which phoneme, rough pitch), level 2 corrects the biggest errors left by level 1, and so on.
The total bitrate of an RVQ codec is:
where $Q$ is the number of codebook levels, $K$ is the codebook size (entries per codebook), and $f_s$ is the frame rate in Hz (how many sets of tokens per second). For example, 8 codebooks $\times$ $\log_2(1024) = 10$ bits $\times$ 75 Hz = 6000 bits/s = 6 kbps.
Let's walk through the boundaries of $Q$. With $Q = 1$ (a single codebook), we get a very coarse approximation — enough to capture rough linguistic content (which phoneme is being spoken) but not enough to faithfully reconstruct the waveform. The audio sounds robotic and loses speaker identity. With $Q = 32$, we get extremely fine detail — near-transparent reconstruction quality, indistinguishable from the original for most listeners. But 32 codebooks at 75 Hz means 32 $\times$ 75 = 2,400 tokens per second of audio. For a 30-second clip, that's 72,000 tokens — a heavy burden for any language model. This is the fundamental trade-off: more RVQ levels means higher fidelity but more tokens for the LLM to process.
The code below demonstrates RVQ on simple 2D points using a small codebook. Watch how the residual error shrinks with each quantization level — each codebook corrects the mistakes of the previous one.
import math, random, json, js
random.seed(42)
# --- Tiny codebook: 4 entries in 2D ---
def make_codebook(n_entries, dim):
return [[random.uniform(-2, 2) for _ in range(dim)] for _ in range(n_entries)]
def nearest(vec, codebook):
best_k, best_dist = 0, float('inf')
for k, entry in enumerate(codebook):
dist = sum((v - e) ** 2 for v, e in zip(vec, entry)) ** 0.5
if dist < best_dist:
best_k, best_dist = k, dist
return best_k, codebook[best_k], best_dist
def subtract(a, b):
return [ai - bi for ai, bi in zip(a, b)]
# Create 4 codebooks (one per RVQ level)
Q = 4
K = 4
dim = 2
codebooks = [make_codebook(K, dim) for _ in range(Q)]
# Original vector to quantize
z = [1.7, -0.3]
rows = []
residual = z[:]
total_approx = [0.0] * dim
for q in range(Q):
k, z_hat, dist = nearest(residual, codebooks[q])
total_approx = [t + zh for t, zh in zip(total_approx, z_hat)]
residual = subtract(residual, z_hat)
res_norm = sum(r ** 2 for r in residual) ** 0.5
total_err = sum((o - a) ** 2 for o, a in zip(z, total_approx)) ** 0.5
rows.append([
str(q + 1),
str(k),
f"({z_hat[0]:+.3f}, {z_hat[1]:+.3f})",
f"({residual[0]:+.3f}, {residual[1]:+.3f})",
f"{total_err:.4f}"
])
js.window.py_table_data = json.dumps({
"headers": ["RVQ Level", "Codebook Index", "Quantized Vector", "Remaining Residual", "Total Error (L2)"],
"rows": rows
})
print(f"Original vector: ({z[0]:.3f}, {z[1]:.3f})")
print(f"Final approximation: ({total_approx[0]:.3f}, {total_approx[1]:.3f})")
print(f"Error after {Q} levels: {rows[-1][-1]}")
print(f"Error after 1 level: {rows[0][-1]}")
print()
print("Each RVQ level reduces the remaining error by quantizing the residual.")
SoundStream and EnCodec: The Neural Codecs
With vector quantization and RVQ as our building blocks, we can now understand the neural audio codecs that power modern speech and music generation. These codecs share a common architecture: a convolutional encoder compresses raw waveform into a sequence of latent vectors, an RVQ bottleneck discretizes those vectors into tokens, and a convolutional decoder reconstructs the waveform from the quantized representation. The entire system is trained end-to-end with a combination of reconstruction loss, perceptual loss, and adversarial loss (a discriminator that tries to distinguish real audio from reconstructed audio).
SoundStream (Zeghidour et al., 2021) was the first end-to-end neural codec with RVQ. Google's key innovation was structured dropout on quantizer levels : during training, they randomly drop the last $n$ RVQ levels, forcing the earlier levels to carry enough information for reasonable reconstruction on their own. This gives variable bitrate at inference time — use 3 levels for low quality (3 kbps), 6 for medium, or all 12-18 for full quality (18 kbps) — all from a single trained model.
EnCodec (Défosséz et al., 2022) from Meta followed a similar architecture but added a multi-scale STFT discriminator for adversarial training — multiple discriminators operating on different time-frequency resolutions of the audio, pushing the codec to preserve both fine temporal detail and broad spectral structure. EnCodec supports 24 kHz mono and 48 kHz stereo, operates at bitrates from 1.5 to 24 kbps, and became the de facto standard codec for a generation of audio language models: VALL-E (text-to-speech), Bark (text-to-audio), and MusicGen (text-to-music) all use EnCodec tokens as their discrete audio representation.
DAC (Descript Audio Codec) (Kumar et al., 2023) pushed quality further with improved discriminators and Snake activations (periodic activation functions that help the decoder model the oscillatory nature of audio waveforms). DAC is a universal codec — it handles speech, music, and environmental sounds in a single model — and achieves higher perceptual quality than EnCodec at the same bitrate.
All these codecs operate at frame rates of 50-75 Hz, meaning they produce one set of RVQ tokens every 13-20 ms of audio. At 75 Hz with 8 codebooks, that's 600 tokens per second — already far fewer than the 16,000-48,000 raw waveform samples per second. These codecs compress audio by roughly 100\times while maintaining near-CD quality, and they are the bridge between continuous waveforms and the discrete token sequences that language models can process.
Mimi: Unifying Semantic and Acoustic Tokens
AudioLM's two-stage approach — generate semantic tokens first, then acoustic tokens — works well, but it requires two completely separate tokenization systems: a self-supervised model (like HuBERT) for semantic tokens and a neural codec (like EnCodec) for acoustic tokens. What if we could get both from a single codec?
Mimi , introduced as part of the Moshi system (Défosséz et al., 2024) , does exactly that. Its architecture combines a SeaNet convolutional encoder-decoder (the same backbone as EnCodec) with an 8-layer transformer bottleneck inserted between the encoder and the RVQ. The transformer processes the encoded features before quantization, giving the model enough capacity to disentangle semantic and acoustic information within a single codec.
The key innovation is Split RVQ : the first codebook is trained with knowledge distillation from WavLM (a self-supervised speech model), so it learns to capture semantic content — what is being said. The remaining 7 codebooks are trained normally with reconstruction loss, so they capture acoustic detail — how it sounds. One codec, two kinds of tokens, no separate HuBERT pipeline needed.
Mimi operates at a frame rate of just 12.5 Hz (one frame per 80 ms), with 8 codebooks of 2048 entries each, producing a bitrate of only 1.1 kbps at 24 kHz. The entire codec is fully streaming with 80 ms algorithmic latency — critical for Moshi's real-time, full-duplex speech generation where the model listens and speaks simultaneously.
The unification matters because it removes a major architectural bottleneck. In AudioLM, the semantic model and the acoustic model are trained separately with different objectives, and mismatches between them can cause artifacts. With Mimi, a single end-to-end trained codec produces both semantic and acoustic tokens in a unified, coherent representation.
Since Mimi, the trend has continued. Newer codecs in 2025 push the unified approach further: Voxtral Codec (Mistral, 2025) operates at 12.5 Hz with 1 semantic codebook and 36 FSQ (Finite Scalar Quantization) acoustic codebooks. DualCodec explicitly optimises for both speech understanding and generation. SpectroStream compresses in the spectral domain rather than the waveform domain. The field is moving fast, but the core idea remains: produce both semantic and acoustic tokens from a single, streaming-capable codec.
The Frame Rate Problem
Here's the practical challenge that drives much of modern audio codec design. A codec running at 75 Hz with 8 codebooks produces $75 \times 8 = 600$ tokens per second of audio. For a 30-second audio clip, that's 18,000 tokens. For a 5-minute podcast segment, that's 180,000 tokens. That's a lot for a language model to process — it eats into the context window, drives up compute cost (attention scales quadratically with sequence length), and slows down generation.
This is why modern codecs are pushing frame rates down aggressively. The table below compares several codecs across the two numbers that matter most for LLM integration: tokens per second and tokens for a 30-second clip.
import math, json, js
codecs = [
("EnCodec (2022)", 75.0, 8, 1024),
("DAC (2023)", 86.0, 9, 1024),
("SoundStream (2021)", 50.0, 12, 1024),
("Mimi (2024)", 12.5, 8, 2048),
("Voxtral (2025)", 12.5, 37, 2048),
("Qwen3-TTS (2025)", 12.0, 1, 8192),
]
rows = []
for name, fps, Q, K in codecs:
bits_per_cb = math.log2(K)
tokens_per_sec = fps * Q
tokens_30s = tokens_per_sec * 30
bitrate = fps * Q * bits_per_cb / 1000 # kbps
rows.append([
name,
f"{fps}",
str(Q),
f"{int(bits_per_cb)}",
f"{int(tokens_per_sec)}",
f"{int(tokens_30s):,}",
f"{bitrate:.1f}"
])
js.window.py_table_data = json.dumps({
"headers": [
"Codec", "Frame Rate (Hz)", "Codebooks (Q)",
"Bits/Token", "Tokens/sec", "Tokens (30s)", "Bitrate (kbps)"
],
"rows": rows
})
print("Key takeaway: Mimi produces 100 tokens/sec vs EnCodec's 600.")
print("For a 30-second clip, that's 3,000 vs 18,000 tokens.")
print("Qwen3-TTS goes even further: just 12 tokens/sec with a single large codebook.")
print()
print("Lower frame rate = fewer tokens = more audio fits in the LLM context window.")
The trade-off is clear: a lower frame rate means each token must encode more temporal information (80 ms of audio per frame at 12.5 Hz vs 13 ms at 75 Hz). To compensate, low-frame-rate codecs use larger codebooks ($K = 2048$ or $K = 8192$ instead of $K = 1024$) and/or more expressive encoder architectures (Mimi adds a transformer bottleneck). The information has to go somewhere — either you spread it across many small tokens over time, or you pack it into fewer, richer tokens.
Some systems sidestep the frame rate problem entirely by flattening the RVQ codes: instead of treating each codebook level as a separate stream, they interleave all $Q$ tokens from each frame into a single flat sequence and let the LLM model them autoregressively. Others use delay patterns (MusicGen) or parallel prediction of codebook levels (VALL-E, SoundStorm) to reduce the effective sequence length the LLM sees. We'll cover these generation strategies in later articles.
Quiz
Test your understanding of audio tokenization, vector quantization, and neural audio codecs.
In Residual Vector Quantization (RVQ), what does each successive codebook quantize?
A codec uses 8 codebooks, each with $K = 1024$ entries, at a frame rate of 50 Hz. What is the total bitrate?
What is the key innovation of Mimi's Split RVQ compared to standard RVQ?
Why are modern audio codecs pushing toward lower frame rates (e.g. 12.5 Hz instead of 75 Hz)?