Speech Synthesis - Cruxr.ai

TTS as Language Modeling

How do you turn text into a human voice? The traditional answer was a long, fragile pipeline: convert text to phonemes (pronunciation symbols), predict acoustic features (duration, pitch, energy) for each phoneme, feed those into a vocoder (a signal-processing module that synthesises a waveform), and finally produce audio. Each stage had its own model, its own training data, and its own failure modes. If the phoneme converter mispronounced a name, every downstream stage faithfully reproduced that error. If the duration model paused in the wrong place, the vocoder couldn't fix it.

The modern insight is disarmingly simple: treat TTS as a language modeling problem. We already have autoregressive transformers that predict the next text token given previous tokens. What if we fed in text tokens and asked the model to predict audio tokens instead? The entire multi-stage pipeline collapses into a single sequence-to-sequence model.

VALL-E (Wang et al., 2023) demonstrated this idea at scale. The setup: take a phoneme sequence plus a 3-second audio prompt (the voice we want to clone), and generate EnCodec tokens that a neural codec decoder converts back to a waveform. VALL-E uses two models working together:

AR model (coarse structure): an autoregressive transformer generates the first RVQ level token by token. This level captures the broad phonetic content, prosody, and speaker identity — the skeleton of the utterance.
NAR model (fine detail): a non-autoregressive transformer takes the first-level tokens and predicts all remaining RVQ levels in parallel. These levels add spectral detail, texture, and high-frequency content — the flesh on the skeleton.

The key to VALL-E's quality was scale: it was trained on 60,000 hours of English speech from the LibriLight dataset — roughly 100 times more data than previous TTS systems like Tacotron 2 (which used ~25 hours). This massive scale enabled a striking capability: 3-second voice cloning. Give VALL-E just 3 seconds of someone's voice as a prompt, and it can generate new speech in that voice with matching timbre, accent, and speaking style. No fine-tuning, no per-speaker training — just 3 seconds of audio at inference time.

💡 VALL-E 2 (Chen et al., 2024) refined the approach with repetition-aware sampling and grouped code modeling, achieving the first results that matched human speech quality on zero-shot TTS benchmarks (LibriSpeech and VCTK). The gap between synthetic and human speech had effectively closed — at least on controlled benchmarks.

Parallel Generation: SoundStorm and Beyond

Autoregressive generation has a fundamental speed problem. Neural audio codecs like EnCodec typically produce tokens at around 75 tokens per second per RVQ level, and with 8 levels that's 600 tokens per second. Generating 30 seconds of audio means producing 18,000 tokens sequentially — each one waiting for the previous one. Even on fast hardware, this makes real-time streaming difficult.

SoundStorm (Borsos et al., 2023) attacked this problem head-on with fully non-autoregressive, parallel generation. The architecture works in two stages:

Input: semantic tokens from AudioLM (a separate model that captures high-level speech content). These serve as the "meaning" scaffold.
Generation: SoundStorm uses a MaskGIT-style iterative decoding scheme. At each RVQ level, it predicts ALL tokens simultaneously, keeps the ones it is most confident about, masks the rest, and re-predicts. This repeats for a few iterations, then moves to the next RVQ level (coarse to fine).

The result: 100 times faster than AudioLM's autoregressive approach. SoundStorm generates 30 seconds of audio in about 0.5 seconds on a TPU-v4. The quality matches AudioLM on speech naturalness and speaker preservation benchmarks, while the speed makes real-time and even faster-than-real-time generation practical.

Around the same time, the open-source community produced Bark (Suno, 2023): a 3-stage GPT pipeline that generates text tokens, then semantic tokens, then coarse acoustic tokens, then fine acoustic tokens. Bark is notable for two reasons. First, it's fully open-source (MIT license). Second, it can generate non-speech sounds alongside speech — you can write [laughs], [music], or [sighs] in the input text and Bark will produce the corresponding audio. This hinted at a future where TTS systems handle not just words but the full expressiveness of human communication.

💡 The speed difference between AR and parallel approaches comes down to sequential vs parallel computation. AR models must generate token $n$ before token $n+1$ (a chain of 18,000 dependencies). Parallel models break this chain by predicting many tokens at once and iteratively refining, turning a 18,000-step serial process into perhaps 16-64 parallel passes.

Flow Matching for Speech

Both VALL-E and SoundStorm generate discrete tokens — they quantise audio into codebook entries and predict those entries. But what if we skipped the tokenisation step entirely and generated continuous audio representations directly? This is the idea behind flow matching applied to speech synthesis.

F5-TTS (Chen et al., 2024) uses flow matching to generate mel spectrograms directly from text. The core idea: learn a vector field that transports samples from pure noise to a clean speech mel spectrogram along a straight-line path. At training time, we define the interpolation:

\hat{x}_t = (1 - t) \cdot \epsilon + t \cdot x_0

where $\epsilon \sim \mathcal{N}(0, I)$ is Gaussian noise, $x_0$ is the target mel spectrogram, and $t \in [0, 1]$ is the timestep. Let's check the boundaries. At $t = 0$: $\hat{x}_0 = (1 - 0) \cdot \epsilon + 0 \cdot x_0 = \epsilon$, which is pure noise — no speech information at all. At $t = 1$: $\hat{x}_1 = (1 - 1) \cdot \epsilon + 1 \cdot x_0 = x_0$, which is the clean mel spectrogram — the speech we want. At $t = 0.5$: $\hat{x}_{0.5} = 0.5 \cdot \epsilon + 0.5 \cdot x_0$, an equal blend of noise and speech. The model learns to push samples along this straight-line path from noise to speech.

The neural network $v_\theta(\hat{x}_t, t)$ is trained to predict the velocity (direction and magnitude) needed to move along this path. The target velocity is constant and trivially computed:

v_{\text{target}} = \frac{d\hat{x}_t}{dt} = x_0 - \epsilon

This is just the vector from noise to data — a single direction, constant across time. The training loss is the mean squared error between predicted and target velocities:

\mathcal{L} = \mathbb{E}_{t, \epsilon, x_0} \left[ \| v_\theta(\hat{x}_t, t) - (x_0 - \epsilon) \|^2 \right]

At inference, we start from pure noise $\hat{x}_0 = \epsilon$ and integrate the learned velocity field forward using an ODE solver (e.g., Euler steps) to arrive at $\hat{x}_1 \approx x_0$, the generated mel spectrogram. Because the paths are straight lines, we need far fewer integration steps than diffusion models (typically 16-32 steps vs 50-100 for DDPM).

What makes F5-TTS remarkable is its simplicity. It requires no phoneme conversion (raw text input), no duration model (no explicit alignment between text and audio), and no separate text encoder . The trick: pad the input text with filler tokens to match the expected speech length, concatenate it with the noisy mel spectrogram, and let the model figure out the alignment during denoising. The result is a 0.15 real-time factor — it generates 1 second of speech in just 0.15 seconds.

💡 Flow matching avoids the token bottleneck entirely. There is no RVQ, no codebook, no discrete quantisation — just continuous mel spectrograms transformed from noise to speech. This sidesteps the information loss inherent in quantising audio to a finite set of codebook entries, which can manifest as subtle artifacts in token-based systems.

The 2025 TTS Explosion

By 2025, the ideas from VALL-E, SoundStorm, and F5-TTS had diffused through the field, and TTS research exploded. Dozens of systems appeared, each pushing different frontiers: lower latency, smaller models, longer outputs, more expressive control. Here are the most significant systems and what each contributed:

Orpheus TTS (Canopy AI, 2025) took the simplest possible approach: fine-tune a Llama 3B language model to predict SNAC audio tokens instead of text tokens. No architectural changes — just a language model doing what it does best, except the output vocabulary includes audio codes. With 200ms streaming latency and emotive control tags (you can write <laugh> or <sigh> in the input), Orpheus proved that the "TTS as language modeling" thesis could be executed with off-the-shelf LLMs. Released under Apache 2.0.

Sesame CSM (Sesame AI, March 2025) introduced context-aware conversational speech. Most TTS systems generate each utterance in isolation. CSM conditions on previous dialogue turns — both text and audio — so the generated speech is contextually appropriate. If the previous speaker sounded excited, CSM's response carries a matching energy. This was the first system to treat TTS as a dialogue problem rather than an isolated sentence problem.

VibeVoice (Microsoft, August 2025) combined next-token prediction with diffusion in a novel way: the language model predicts a coarse token, then a lightweight diffusion head refines it into a detailed audio frame. The key innovation was an ultra-low frame rate of 7.5 Hz — just 7.5 audio frames per second, compared to the typical 50-75. Fewer frames means fewer autoregressive steps, enabling generation of 90 minutes of multi-speaker audio (with up to 4 distinct speakers) in a single pass.

Kyutai Pocket TTS (January 2026) attacked the efficiency frontier: a 100-million parameter model that runs in real-time on a CPU — no GPU required. It uses an "Inner Monologue" mechanism where the model first generates an internal reasoning trace before producing speech tokens, improving pronunciation of difficult words and numbers without increasing the audio model's size.

Qwen3-TTS (Alibaba, December 2025) scaled up both data and architecture. Trained on over 5 million hours of speech (nearly 100 times VALL-E's training data), it uses a dual-track streaming architecture with a 12 Hz tokeniser to achieve 97ms first-audio latency. The model supports dozens of languages and can switch between them mid-sentence.

Hume TADA (Hume AI, 2026) introduced a striking architectural constraint: 1:1 text-audio alignment. Each LLM step produces exactly one text token and exactly one audio frame. This makes hallucinations (words that appear in the audio but not in the text) impossible by construction — the model cannot generate audio content that isn't anchored to a text token. This trades some prosodic flexibility for perfect reliability.

Voxtral TTS (Mistral, March 2026) brought flow matching to production scale: a 4-billion parameter model combining flow matching with a custom neural codec, achieving 70ms time-to-first-audio. Released with open weights, it demonstrated that flow-matching TTS could compete with autoregressive approaches at scale.

Several themes emerge from this explosion:

Voice cloning from 3 seconds is standard. Every system above supports zero-shot voice cloning from a short audio prompt. What was VALL-E's headline result in 2023 is now table stakes.
Streaming latency is under 100ms. Multiple systems achieve sub-100ms time-to-first-audio, making them suitable for real-time conversational agents.
Open-source is competitive. Orpheus, F5-TTS, and Sesame CSM are fully open-source with permissive licenses, and their quality rivals or matches commercial APIs.
The architecture wars are ongoing. Pure autoregressive (Orpheus), parallel (SoundStorm), flow matching (F5-TTS, Voxtral), and hybrid approaches (VibeVoice) all produce high-quality speech. There is no single winning architecture yet.

The following table summarises the key systems:

import json, js

rows = [
    ["VALL-E (2023)",       "AR + NAR",        "EnCodec RVQ",    "60K hrs",     "3s voice cloning",    "Zero-shot voice cloning at scale"],
    ["SoundStorm (2023)",   "Parallel (MaskGIT)", "SoundStream",  "—",          "0.5s for 30s audio",  "100x faster than AR"],
    ["Bark (2023)",         "3-stage GPT",     "EnCodec",         "—",          "—",                   "Non-speech sounds, open-source"],
    ["F5-TTS (2024)",       "Flow matching",   "Mel spectrogram", "—",          "0.15 RTF",            "No phonemes, no duration model"],
    ["Orpheus (2025)",      "Llama 3B AR",     "SNAC",            "—",          "200ms streaming",     "Off-the-shelf LLM, Apache 2.0"],
    ["Sesame CSM (2025)",   "Context-aware AR","Multi-codebook",  "—",          "—",                   "Dialogue-aware generation"],
    ["VibeVoice (2025)",    "AR + diffusion",  "7.5 Hz tokens",   "—",          "90 min output",       "Ultra-low frame rate"],
    ["Pocket TTS (2026)",   "AR + inner mono.","Custom codec",    "—",          "Real-time on CPU",    "100M params, no GPU needed"],
    ["Qwen3-TTS (2025)",    "Dual-track AR",   "12 Hz tokeniser", "5M+ hrs",   "97ms latency",        "Multilingual, massive scale"],
    ["Hume TADA (2026)",    "1:1 alignment",   "Custom codec",    "—",          "—",                   "Zero hallucinations by design"],
    ["Voxtral TTS (2026)",  "Flow matching",   "Custom codec",    "—",          "70ms first-audio",    "4B params, open weights"],
]

js.window.py_table_data = json.dumps({
    "headers": ["System", "Architecture", "Audio Repr.", "Training Data", "Speed", "Key Innovation"],
    "rows": rows
})

print("Modern TTS systems span multiple architecture families.")
print("Voice cloning from 3 seconds of audio is now a standard capability.")
print("Streaming latency under 100ms enables real-time conversation.")

Voice Cloning and Safety

Every system in the table above can clone a voice from roughly 3 seconds of audio. This is a remarkable engineering achievement — and a serious safety concern. If anyone can generate speech in anyone else's voice from a few seconds of publicly available audio (a podcast clip, a conference talk, a social media video), the potential for misuse is significant: deepfake audio for fraud, impersonation of public figures, fabricated evidence, and social engineering attacks that sound exactly like a trusted colleague or family member.

The research community and industry have developed several mitigations, though none is a complete solution:

Audio watermarking: embed inaudible signatures in generated audio that can be detected by verification tools but are imperceptible to human listeners. Google's SynthID and Meta's AudioSeal both implement this approach. The challenge: watermarks can sometimes be removed by re-encoding the audio or applying filters.
Speaker verification: train classifiers to distinguish real from synthetic speech. These work well on known synthesis methods but struggle with novel generators (the arms-race problem: as TTS improves, detection must improve in lockstep).
Access controls: many commercial TTS APIs require identity verification, limit voice cloning to pre-registered voices, or add mandatory disclosures. OpenAI initially withheld their voice cloning model entirely, citing safety concerns.
Regulation: the EU AI Act classifies deepfake generation as a transparency obligation (generated content must be labeled). Several US states have enacted laws specifically targeting voice cloning for fraud.

The tension is real: the same capability that enables a visually impaired person to hear any document in a familiar voice, or allows a person who has lost their voice to disease to continue speaking in their own voice, also enables impersonation attacks. Most responsible TTS releases now include both watermarking and content policies, but the technology is increasingly open-source and the genies are difficult to re-bottle.

💡 A positive use case worth highlighting: voice banking. Patients diagnosed with ALS or throat cancer can record their voice before losing it, and TTS systems can then generate speech in their preserved voice indefinitely. This application has transformed quality of life for thousands of patients.

Quiz

Test your understanding of modern speech synthesis.

In VALL-E's two-model architecture, what does the autoregressive (AR) model generate?

All RVQ levels simultaneously

The first RVQ level (coarse structure: phonetic content, prosody, speaker identity)

The mel spectrogram directly

The final waveform samples

How does SoundStorm achieve 100x speedup over autoregressive audio generation?

It uses a smaller model with fewer parameters

It generates tokens at all RVQ levels autoregressively but with a faster GPU

It predicts all tokens at each RVQ level simultaneously using MaskGIT-style iterative decoding

It skips the fine RVQ levels entirely

In the flow matching interpolation $\hat{x}_t = (1 - t) \cdot \epsilon + t \cdot x_0$, what does the model produce at $t = 0$?

The clean mel spectrogram $x_0$

A 50/50 blend of noise and speech

Pure noise $\epsilon$

The velocity field $v_\theta$

What key advantage does F5-TTS's flow matching approach have over token-based systems like VALL-E?

It requires more training data to achieve the same quality

It avoids the information loss from discrete quantisation by working with continuous mel spectrograms

It uses a larger codebook with more entries

It generates audio at a lower sample rate