From Cascaded Pipelines to Native Multimodality
How did voice assistants work before 2024? The answer is a three-stage cascade: a speech-to-text (STT) model transcribes the user's audio into text, a large language model (LLM) reasons over that text and produces a text reply, and a text-to-speech (TTS) model synthesises the reply back into audio. Three separate models, three latency steps, and three boundaries where information is lost. The STT model strips away prosody, emotion, hesitation, laughter, and tone when it converts speech to a flat string. The LLM never sees any of that paralinguistic signal. And the TTS model must invent a plausible delivery from scratch, with no knowledge of how the user sounded or what emotional register would be appropriate. The result is functional but robotic: a system that can answer questions but cannot truly converse.
Consider a concrete example. A user says "I'm fine" in a trembling, sarcastic tone. The STT model faithfully produces the text
I'm fine
. The LLM sees those two words, takes them at face value, and responds with something cheerful. The sarcasm, the tremor, the emotional subtext — all lost at the first boundary. A human listener would immediately detect the mismatch between words and delivery. A cascaded system cannot, because the text bottleneck discards everything that makes speech expressive.
The new paradigm is the omni model (sometimes called a natively multimodal model ): a single neural network that processes audio, text, and often vision as interleaved token sequences. Instead of translating between modalities at each boundary, the model operates on a unified representation where acoustic features, linguistic content, and visual information coexist in the same embedding space. The sarcastic "I'm fine" arrives as audio tokens that carry both the words and the tone, and the model reasons over both simultaneously.
Why does this matter beyond convenience? Because emotional nuance, speaker identity, background noise, code-switching between languages, and conversational dynamics like interruptions and backchanneling are all encoded in the acoustic signal. A text transcript is a lossy compression that discards most of this. Native multimodality preserves it. The key insight is that multimodal understanding requires unified representation, not sequential translation between modalities . Just as vision-language models demonstrated that projecting image features into a shared embedding space with text yields richer understanding than captioning images and feeding captions to an LLM, omni models demonstrate the same principle for audio: keeping the raw signal intact lets the model learn cross-modal correlations that no cascade can capture.
GPT-4o and GPT-5: The Commercial Frontier
When OpenAI launched GPT-4o in May 2024, it marked the first time a major commercial model natively processed text, audio, and vision within a single neural network (OpenAI, 2024) . Previous iterations of ChatGPT voice used the cascaded approach: Whisper for STT, GPT-4 for reasoning, and a separate TTS model for output. GPT-4o collapsed all three into one autoregressive model that operates on interleaved token streams of text, audio, and image tokens. The most immediately noticeable difference was latency — GPT-4o averaged 320 milliseconds of response time, comparable to a human conversational pause, versus the multi-second delay of the cascaded pipeline.
OpenAI has not published the architectural details of GPT-4o, but its behaviour reveals several properties. It can sing, laugh, whisper, and modulate emotional tone in real time, which requires the model to have direct control over acoustic generation rather than delegating to a separate TTS system. It handles code-switching between languages within a single utterance. And it reasons about audio content (identifying speakers, describing sounds, transcribing music) and visual content (describing images, reading text) within the same forward pass. The "o" in GPT-4o stands for "omni", and the model lives up to the name: it accepts any combination of text, audio, and image inputs and can produce text and audio outputs.
In September 2025, OpenAI introduced gpt-realtime , a dedicated end-to-end speech-to-speech model optimised for real-time conversational use. Unlike a pipeline stitching STT and TTS around a text model, gpt-realtime is a single model for audio-in to audio-out, purpose-built for voice interaction. It scored 82.8% on Big Bench Audio, compared to 65.6% for the previous model, demonstrating substantially stronger audio understanding. The model captures non-verbal cues, code-switches mid-sentence, adapts tone to the emotional register of the conversation, and supports sessions lasting up to 60 minutes — long enough for an entire tutoring lesson or therapy session.
Then in August 2025, GPT-5 arrived as a natively multimodal model trained from scratch on text, image, audio, and video simultaneously. Where GPT-4o added audio and vision capabilities to what was fundamentally a text model, GPT-5 was designed from the ground up with all modalities as first-class citizens. The distinction matters architecturally: rather than learning text first and then aligning other modalities to the text representation (the approach used by most open-source multimodal models), GPT-5's representations are inherently cross-modal from the earliest layers. OpenAI reported that this leads to stronger cross-modal reasoning — the model can, for example, hear a question about an image while watching a video and produce a spoken answer that references all three inputs coherently.
Gemini: Google's Multimodal Approach
Google's Gemini family (Gemini Team, 2023) took a different path to native multimodality. From the original Gemini 1.0 paper, the architecture was a decoder-only transformer trained jointly on all modalities from the start . Text, image, audio, and video data were interleaved during pretraining, so the model learned cross-modal associations from its earliest training steps rather than acquiring them through later fine-tuning. Audio input was processed through features from Google's Universal Speech Model (USM) , which converted raw audio into a sequence of acoustic feature tokens that could be interleaved with text and image tokens in the same context window.
The defining technical strength of Gemini has always been its context length. Gemini 1.5 Pro supported a 1 million token context window, and subsequent versions pushed even further. For audio processing, this is transformative: a million tokens can accommodate hours of audio, meaning the model can process an entire podcast, lecture, or meeting recording in a single pass without chunking. This eliminates the retrieval and stitching problems that plague shorter-context models when handling long audio.
In 2025, Google released Gemini 2.5 with native audio output , finally closing the loop from audio-in to audio-out within a single low-latency model — no separate STT or TTS pipeline required. The native audio capabilities include multi-speaker TTS (the model can produce two distinct voices in the same response, enabling dialogue generation), a proactive audio mode that intelligently filters background speech and noise to focus on the primary speaker, and support for 24+ languages with seamless code-switching — the model can begin a sentence in English and finish it in Mandarin without any mode-switching signal.
Gemini's approach to audio thus differs from OpenAI's primarily in its emphasis on long-context processing and the depth of its joint pretraining. While GPT-4o and GPT-5 optimise heavily for real-time conversational latency, Gemini's strength lies in its ability to reason over very long audio inputs — an entire audiobook, a multi-hour meeting — and produce coherent responses that reference specific moments across the full recording.
Qwen2.5-Omni and Qwen3-Omni: Open-Source State of the Art
While GPT-4o and Gemini advanced the commercial frontier, the open-source frontier was being pushed just as aggressively by Alibaba's Qwen team. Qwen2.5-Omni (Xu et al., 2025) , released in March 2025, introduced a novel Thinker-Talker architecture that cleanly separates reasoning from speech generation. Why the split? Because generating high-quality text and high-quality speech simultaneously from a single decoder creates interference: the text generation objective pulls the model toward linguistic precision, while the speech generation objective pulls toward acoustic naturalness, and optimising for both at once degrades each. The Thinker-Talker architecture lets each component focus on its strength.
The Thinker is the main LLM backbone augmented with modality-specific encoders for audio and vision. It takes in audio tokens (from a Whisper-based encoder), image tokens (from a vision encoder), and text tokens, processes them through a standard transformer, and produces text reasoning output — the "thinking" part. The Talker is a separate, smaller decoder that receives the Thinker's hidden representations as conditioning input and generates streaming audio tokens, which are then converted to waveforms by a Diffusion Transformer (DiT) vocoder. The Thinker reasons; the Talker speaks. Information flows from Thinker to Talker but not the reverse, creating a clean directional dependency.
A key technical contribution of Qwen2.5-Omni is TMRoPE (Time-aligned Multimodal Rotary Position Embedding) . In standard transformers, positional encodings simply count tokens sequentially. But audio and video arrive at different temporal resolutions — audio might produce 25 tokens per second while video produces 2 frames per second. TMRoPE assigns positional encodings based on real-world timestamps rather than token position, ensuring that an audio token at $t = 3.5$ seconds and a video frame at $t = 3.5$ seconds receive similar position encodings. This temporal alignment lets the model learn precise audio-visual correspondences: lip movements synchronised with phonemes, gestures timed with speech emphasis, and environmental sounds matched to visual events.
In September 2025, the Qwen team followed up with Qwen3-Omni (Xu et al., 2025) , a major architectural upgrade. Qwen3-Omni uses a Mixture-of-Experts (MoE) architecture with 30 billion total parameters but only 3 billion active per token — a 10:1 sparsity ratio that keeps inference cost manageable while providing enormous model capacity. This MoE design means that different expert sub-networks can specialise in different modalities or tasks without interfering with each other, which is particularly valuable for an omni model that must handle such diverse inputs.
Perhaps the most significant change in Qwen3-Omni is the replacement of Whisper with a new AuT (Audio Transformer) encoder trained from scratch on 20 million hours of audio data. Whisper, while excellent for transcription, was trained primarily on English-heavy data with a transcription objective. AuT was designed specifically for the omni model's needs: multilingual audio understanding (not just transcription), music and sound event recognition, speaker diarisation, and emotion detection. The results speak for themselves — Qwen3-Omni achieved state-of-the-art performance on 32 out of 36 audio benchmarks, outperforming both Gemini 2.5 Pro and GPT-4o on audio understanding tasks.
In December 2025, a lightweight Qwen3-Omni-Flash variant was released, targeting deployment scenarios where full model capacity is unnecessary. The Flash variant trades some benchmark performance for substantially lower latency and memory requirements, making omni-model capabilities accessible on more modest hardware.
Moshi: Full-Duplex Spoken Dialogue
All the models discussed so far — GPT-4o, Gemini, Qwen-Omni — share a fundamental assumption: conversation is turn-based. The user speaks, the model listens, the model responds, and then it waits for the user again. But real human conversation is nothing like this. Humans overlap, interrupt, backchannel ("uh-huh", "right", "mm-hmm"), and the listener's vocalisations are not noise to be filtered out — they are an essential part of conversational dynamics. Moshi (Defossez et al., 2024) , developed by Kyutai Labs, is the first speech-text foundation model designed from the ground up for full-duplex real-time dialogue — meaning it can listen and speak simultaneously, just as humans do.
Moshi's architecture has three components that work together. The first is Helium , a 7 billion parameter text LLM that serves as the linguistic backbone. Helium was pretrained on text data and provides the world knowledge and reasoning capabilities that the system needs. The second component is Mimi , a streaming neural audio codec that compresses speech into discrete tokens at just 12.5 Hz (12.5 frames per second, each frame spanning 80ms of audio). Mimi produces tokens at two levels: semantic tokens that capture what is being said (linguistic content, speaker identity, emotion) and acoustic tokens that capture how it sounds (pitch contour, room acoustics, fine-grained timbre). This semantic-acoustic split is achieved by distilling a self-supervised speech model (like WavLM) into the first codebook while letting the remaining codebooks learn acoustic residuals.
The third and most novel component is the RQ-Transformer (Residual Quantization Transformer) , which consists of two nested transformers. The Temporal Transformer processes the sequence of time steps — it sees one aggregated representation per 80ms frame and models the temporal dependencies across the conversation. The Depth Transformer operates within each time step, autoregressively generating the multiple codebook levels for that frame. At each time step, the Temporal Transformer produces a context vector, and the Depth Transformer uses it to generate: first a text token, then the semantic audio token, then the acoustic audio tokens across multiple codebook levels. This two-level design avoids the quadratic cost of treating all codebook tokens as a flat sequence (which would multiply the sequence length by the number of codebooks).
The breakthrough innovation in Moshi is Inner Monologue : at each time step, the model predicts a chain of text token \to semantic audio token \to acoustic audio tokens . The text tokens are time-aligned with the speech using word-level timestamps — when Moshi is producing audio for the word "hello", the corresponding text token "hello" is generated at the same time step, with PAD tokens filling the gaps between words where no new text token is needed. This chain serves as an "inner monologue" that dramatically improves linguistic quality. Without it, the model must learn language purely from audio tokens, which is far harder than learning from text. With it, the text channel acts as a scaffold that guides the audio generation toward coherent, well-formed language.
An elegant consequence of Inner Monologue is that it enables emergent capabilities without any additional training. If you sample only the text tokens from Moshi's output, you get streaming automatic speech recognition (ASR). If you feed text tokens ahead of audio tokens, you get streaming text-to-speech (TTS). These are not separate modes — they fall out naturally from the model's joint text-audio token generation.
The most radical design choice in Moshi is full-duplex multi-stream generation . At every time step, the model produces tokens for both its own speech and the user's speech — simultaneously. Concretely, each time step involves 17 sub-sequences: 1 text token (Inner Monologue) + 8 codebook levels for Moshi's own audio + 8 codebook levels for the user's audio. The model doesn't just generate its response — it also predicts what the user is saying, which means it maintains an internal model of both sides of the conversation at all times. There is no explicit turn-taking mechanism. If the user interrupts, the model sees the user's audio tokens change from silence to speech and can react immediately, adjusting or stopping its own generation. If the user backchannels ("uh-huh"), the model recognises it as acknowledgment rather than interruption and continues speaking.
The result is remarkably low latency: 160ms theoretical (limited by the 80ms codec frame size and one frame of look-ahead) and roughly 200ms in practice. This is faster than human conversational reaction time (typically 200-300ms), making Moshi feel genuinely responsive in real-time dialogue.
In March 2025, MoshiVis extended Moshi with a vision encoder, adding the ability to see and discuss visual content while maintaining real-time speech capabilities. The vision tokens are interleaved with the audio and text tokens in the same sequence, and the model can reference what it sees in its spoken responses — for example, describing objects in a live camera feed while maintaining a fluid conversation.
LongCat and the Expanding Frontier
Beyond the major model families described above, 2025 saw an explosion of specialised omni models pushing the boundaries in different directions. Each addresses a specific limitation of the earlier architectures, and together they map the expanding frontier of what omni models can do.
Meituan's LongCat-Flash-Omni (November 2025) is a massive-scale omni model with 560 billion total parameters but only 27 billion active per token, using a Scalable Mixture-of-Experts (ScMoE) architecture with zero-computation experts — expert slots that exist in the routing table but require no FLOPs when selected, acting as a form of implicit regularisation. LongCat-Flash-Omni uses Multi-Head Latent Attention (MLA) , the attention variant introduced by DeepSeek that compresses the KV cache by projecting keys and values into a low-dimensional latent space before caching. This is critical at LongCat's scale: with a 128K token context window supporting 8+ minutes of real-time audio-visual interaction, an uncompressed KV cache would be prohibitively large. The model also employs multi-codebook audio detokenisation at coarser temporal resolution — rather than generating audio tokens at the full 12.5 Hz used by Moshi's Mimi codec, LongCat generates tokens at a coarser rate and upsamples, trading a small amount of audio fidelity for substantially lower sequence length and thus faster generation.
InteractiveOmni (2025) takes a different approach, targeting the 4B-8B parameter range with a unified architecture for multi-turn audio-visual interaction. Where LongCat goes big with MoE, InteractiveOmni stays small and dense, optimising for deployment on consumer devices. The model processes audio, video, and text in interleaved turns and can maintain coherent dialogue over multiple exchanges, tracking referents across modalities ("the red car you mentioned" + pointing gesture in video + spoken clarification).
Hume AI's EVI 3 (2025) pushes the emotional intelligence frontier. EVI 3 is a speech-to-speech model specifically optimised for empathy and emotional attunement. In human evaluations, it was rated higher than GPT-4o on empathy, emotional understanding, and conversational naturalness. The model explicitly models the user's emotional state from vocal cues — speech rate, pitch variation, breathiness, micro-pauses — and adjusts its own vocal delivery accordingly. If a user sounds anxious, EVI 3 slows down, softens its tone, and uses more reassuring phrasing. If they sound excited, it matches their energy. This emotional mirroring is not scripted — it emerges from training on large-scale human conversation data with emotion annotations.
The trend across all these models is clear: omni models are simultaneously getting smaller (MoE architectures with high sparsity ratios keep active parameter counts manageable), faster (streaming architectures with sub-200ms latency), and more emotionally aware (explicit modelling of prosody, affect, and conversational dynamics). The cascaded STT-LLM-TTS pipeline that dominated voice AI for a decade is being replaced not by one architecture but by a diverse ecosystem of native multimodal models, each exploring a different point in the design space.
The Architecture Landscape
With so many models now in the omni space, it is worth stepping back and mapping the architectural landscape. The table below summarises the major omni models as of late 2025, comparing their key characteristics.
import json, js
rows = [
["GPT-4o", "OpenAI", "May 2024", "Undisclosed", "Text, Audio, Image", "Text, Audio", "Native autoregressive", "~320ms", "No"],
["gpt-realtime", "OpenAI", "Sep 2025", "Undisclosed", "Audio", "Audio, Text", "End-to-end speech-to-speech", "~200ms", "No"],
["GPT-5", "OpenAI", "Aug 2025", "Undisclosed", "Text, Audio, Image, Video", "Text, Audio", "Native joint pretraining", "Undisclosed", "No"],
["Gemini 2.5", "Google", "2025", "Undisclosed", "Text, Audio, Image, Video", "Text, Audio", "Joint pretraining + USM", "~300ms", "No"],
["Qwen2.5-Omni", "Alibaba", "Mar 2025", "~7B", "Text, Audio, Image, Video", "Text, Audio", "Thinker-Talker + DiT", "~300ms", "Yes"],
["Qwen3-Omni", "Alibaba", "Sep 2025", "30B total / 3B active", "Text, Audio, Image, Video", "Text, Audio", "Thinker-Talker + MoE", "~200ms", "Yes"],
["Moshi", "Kyutai", "Oct 2024", "7B", "Audio", "Audio, Text", "RQ-Transformer + Mimi", "~200ms", "Yes"],
["MoshiVis", "Kyutai", "Mar 2025", "~8B", "Audio, Image", "Audio, Text", "RQ-Transformer + vision", "~200ms", "Yes"],
["LongCat-Flash-Omni","Meituan", "Nov 2025", "560B total / 27B active", "Text, Audio, Image, Video", "Text, Audio", "ScMoE + MLA", "~250ms", "Partial"],
["EVI 3", "Hume AI", "2025", "Undisclosed", "Audio", "Audio", "Speech-to-speech", "~200ms", "No"],
]
js.window.py_table_data = json.dumps({
"headers": ["Model", "Org", "Date", "Params", "Input Modalities", "Output Modalities", "Audio Approach", "Latency", "Open Source"],
"rows": rows
})
print("Major omni models as of late 2025.")
print(f"Total models surveyed: {len(rows)}")
print(f"Open-source or partially open: {sum(1 for r in rows if r[-1] in ('Yes', 'Partial'))}")
Looking across these models, three distinct architectural patterns have emerged for building omni models, each with different tradeoffs:
Pattern 1: Encoder + LLM. This is the simplest approach, used by earlier models like Qwen-Audio and Qwen2-Audio: an audio encoder (typically Whisper) projects audio features into the LLM's embedding space via a linear projection or small adapter network, and the LLM processes the audio embeddings alongside text tokens. This is essentially the same approach used by vision-language models for image inputs. The advantage is simplicity — you can take any pretrained LLM and any pretrained audio encoder and connect them with minimal new parameters. The limitation is that this architecture is input-only for audio : the model can understand speech but cannot generate it. You still need a separate TTS model for audio output, which means you're back to a partial cascade for speech-to-speech applications.
Pattern 2: Thinker-Talker. Used by Qwen2.5-Omni and Qwen3-Omni, this pattern separates the model into a reasoning component (the Thinker, a full LLM that outputs text) and a speech generation component (the Talker, a smaller decoder that converts the Thinker's representations into audio tokens). The key advantage is clean separation of concerns : the Thinker can be evaluated and improved using standard text benchmarks, and the Talker can be evaluated on speech quality metrics, without the two objectives interfering. The disadvantage is that the Talker adds latency (it must wait for Thinker representations before generating speech) and the directional flow from Thinker to Talker means the model cannot easily use acoustic features to inform its reasoning — the thinking happens in text space.
Pattern 3: Unified Multi-Stream. Used by Moshi (and likely GPT-4o), this pattern treats all modalities as interleaved tokens in a single autoregressive model with no separation between understanding and generation. The model simultaneously reads input audio, produces output audio, and generates text as an "inner monologue". This is the most elegant architecture — there are no adapter layers, no separate encoders, no directional constraints — but it is also the hardest to train . The model must learn language, acoustics, conversational dynamics, and audio synthesis all at once, from a unified training objective. Getting this to work well required Moshi's careful innovations: the Mimi codec (to keep the audio token rate manageable), the RQ-Transformer (to factorise time and depth), and Inner Monologue (to scaffold linguistic quality with text tokens).
Several open questions remain. Will one pattern dominate, or will they coexist for different use cases (Thinker-Talker for high-accuracy assistants, Unified Multi-Stream for real-time conversation)? Can we achieve Moshi-quality full-duplex conversation at Qwen3-Omni's scale? Can open-source omni models match or surpass GPT-5's cross-modal reasoning? And as context windows grow to millions of tokens, will specialised audio encoders remain necessary, or will end-to-end models learn to process raw waveforms directly? The omni model space is evolving faster than any other area in AI, and the architectural question is far from settled.
Quiz
Test your understanding of omni model architectures and their design tradeoffs.
What is the primary limitation of a cascaded STT → LLM → TTS pipeline compared to a native omni model?
What is Moshi's 'Inner Monologue' mechanism?
Why does Qwen2.5-Omni use a Thinker-Talker split instead of a single unified model?
What makes Moshi's full-duplex approach fundamentally different from other omni models?