Connecting Audio to Language

We now have powerful audio encoders (Whisper, wav2vec 2.0) that turn raw waveforms into rich feature sequences, and we have large language models (LLMs) that reason over text. The obvious next question: how do we connect them so the LLM can reason about what it hears?

This is exactly the same problem that Vision-Language Models (VLMs) solved for images. The recipe is structurally identical: take a pre-trained encoder, project its output into the LLM's embedding space, and fine-tune on paired data (audio + text). The audio encoder's features live in one coordinate system, the LLM's token embeddings live in another, and we need a bridge between them. If you have read the article on multimodal fusion , the patterns here will feel very familiar.

Three architecture patterns have emerged, directly paralleling the VLM fusion strategies:

  • Encoder + projector + frozen LLM: the cheapest approach to train. The audio encoder extracts features, a lightweight projector (linear layer or small MLP) maps them into the LLM's embedding space, and the LLM itself is kept frozen. Only the projector weights are learned. This works well when the LLM is already strong and we just need it to "see" audio tokens.
  • Encoder + projector + fine-tuned LLM: same as above, but we also update the LLM's weights (often via LoRA or full fine-tuning). This produces better quality because the LLM adapts its internal representations to the new modality, but costs more to train.
  • Native audio tokens (no separate encoder): audio is tokenised directly (e.g. via a neural codec) and fed into the LLM as first-class input tokens alongside text tokens. There is no separate encoder or projection step. This is the most unified approach but requires massive training data because the model must learn audio understanding from scratch rather than inheriting it from a pre-trained encoder.
๐Ÿ’ก The progression from pattern 1 to pattern 3 trades training cost for integration depth. Frozen-LLM approaches can be trained in hours on a single node; native audio-token models need billions of tokens and hundreds of GPUs. Most 2023--2024 audio-language models use pattern 1 or 2.

Qwen-Audio: The First Audio-LLM

Qwen-Audio (Chu et al., 2023) was among the first models to demonstrate that a single audio-language model could handle dozens of audio tasks at once: automatic speech recognition (ASR), sound event detection, music understanding, emotion recognition, and audio question-answering, all within a unified architecture. How does it work?

The architecture follows pattern 2 above: a Whisper-large-v2 encoder (640M parameters) extracts audio features, a projection layer maps them into the LLM's space, and a Qwen-7B LLM generates text conditioned on both the audio tokens and any text prompt. The pipeline looks like this:

Audio waveform $\rightarrow$ Whisper encoder $\rightarrow$ temporal pooling (stride 2) $\rightarrow$ linear projection $\rightarrow$ LLM embedding space $\rightarrow$ Qwen-7B decoder

The Whisper encoder produces features at 50 Hz (50 feature vectors per second of audio). A stride-2 pooling layer halves this to 25 Hz by averaging pairs of consecutive frames, reducing the sequence length before it enters the LLM. Then a linear projection maps each pooled vector from Whisper's hidden dimension to the Qwen-7B embedding dimension:

$$\mathbf{h}_a^i = \mathbf{W}_\text{proj} \cdot \text{Pool}(\mathbf{g}_a^{2i}, \mathbf{g}_a^{2i+1}) + \mathbf{b}_\text{proj}$$

where $\mathbf{g}_a^j \in \mathbb{R}^{d_\text{whisper}}$ is the $j$-th frame from the Whisper encoder, $\text{Pool}$ averages two consecutive frames into one, $\mathbf{W}_\text{proj} \in \mathbb{R}^{d_\text{LLM} \times d_\text{whisper}}$ is the learned projection matrix, and $\mathbf{h}_a^i \in \mathbb{R}^{d_\text{LLM}}$ is the resulting token that gets fed into the LLM. At the boundaries: when $i = 0$, we pool the first two Whisper frames; when $i$ reaches the last valid index, we pool the last two frames. If the total number of Whisper frames is odd, the final frame is typically duplicated or dropped.

A key innovation in Qwen-Audio is hierarchical tags during pre-training. Because the model is trained on 30+ different audio tasks simultaneously, each training sample is tagged with metadata identifying the task type, data source, and language. This prevents the model from confusing an ASR task with a sound-event detection task during multi-task training and allows a single model to handle all of them without interference.

The upgraded Qwen2-Audio (August 2024) switched to the Whisper-large-v3 encoder, replaced the rigid hierarchical tags with natural language prompts, and introduced two interaction modes: voice chat (the user speaks and the model responds to the speech content) and audio analysis (the user provides an audio clip and asks questions about it via text). This made the model far more flexible in practice.

๐Ÿ’ก Qwen-Audio's multi-task training across 30+ tasks is analogous to how early VLMs like LLaVA were trained on visual instruction-following data. The key lesson: diverse training data matters more than architectural novelty when connecting a pre-trained encoder to an LLM.

SeamlessM4T: Multilingual Speech Translation

While Qwen-Audio focuses on understanding audio and generating text responses, SeamlessM4T (Communication et al., 2023) tackles a different challenge: translating across languages and modalities. A single model handles all four translation directions: speech-to-text, speech-to-speech, text-to-speech, and text-to-text, across up to 100 languages. Traditionally, each of these directions required a separate pipeline; SeamlessM4T collapses them into one.

The architecture chains four components together:

  • w2v-BERT 2.0 speech encoder: converts raw audio into a sequence of speech representations. This encoder combines wav2vec 2.0-style self-supervised pre-training with BERT-style masked prediction, giving it strong multilingual speech understanding.
  • Transformer text decoder: generates target-language text tokens, conditioned on the encoder output via cross-attention . This handles the speech-to-text and text-to-text translation paths.
  • Unit-based speech decoder: for speech output, a second decoder generates discrete speech units (learned audio tokens that represent phonetic content) rather than raw waveform samples.
  • HiFi-GAN vocoder: converts the discrete speech units back into a continuous waveform that can be played as audio. This is the final stage for any path that produces speech output.

For a speech-to-speech translation (e.g. English audio in, French audio out), the full pipeline is: audio $\rightarrow$ w2v-BERT 2.0 $\rightarrow$ text decoder (generates French text) $\rightarrow$ speech unit decoder (generates French speech units) $\rightarrow$ HiFi-GAN (synthesises French waveform). The text decoder acts as a semantic bottleneck: translation happens in text space, and then speech is generated from the translated text.

In December 2023, the team released SeamlessStreaming (Barrault et al., 2023) , which added EMMA (Efficient Monotonic Multihead Attention) to enable simultaneous translation. Instead of waiting for the speaker to finish an entire utterance, the model begins translating while speech is still incoming. The key idea behind EMMA is that for translation, attention is roughly monotonic: the next output word typically attends to a position near or after where the previous output word attended. EMMA exploits this by restricting each attention head to a learned monotonic policy that decides when to "read" more input and when to "write" output, achieving simultaneous translation with approximately 2-second latency.

๐Ÿ’ก SeamlessM4T demonstrates a key point about audio-language models: they are not limited to understanding audio. By chaining an encoder, a translation decoder, a speech unit decoder, and a vocoder, a single model replaces what used to be an entire multi-system translation pipeline.

The Audio Projection Problem

Connecting an audio encoder to an LLM seems straightforward in principle, but there is a core challenge that makes it harder than the analogous vision problem: audio has a much higher token rate than text.

Let's put concrete numbers on this. Text is produced at roughly 150 words per minute in normal speech. With a standard tokeniser averaging around 4 tokens per word, that translates to about 10 tokens per second:

$$R_\text{text} = \frac{150 \text{ words/min}}{60 \text{ s/min}} \times 4 \text{ tokens/word} = 10 \text{ tokens/s}$$

Now compare that to the audio encoder's output. Whisper produces features at 50 Hz, meaning 50 feature vectors per second of audio:

$$R_\text{audio} = 50 \text{ features/s}$$

That is a 5$\times$ mismatch . Without downsampling, a 30-second audio clip becomes $50 \times 30 = 1{,}500$ tokens in the LLM context. A 5-minute podcast segment would be $50 \times 300 = 15{,}000$ tokens, consuming a large fraction of the LLM's context window before the model has even started reasoning. For comparison, the text transcript of that same 5-minute segment would be approximately $10 \times 300 = 3{,}000$ tokens.

At the extremes, the problem is clear. For very short audio (a single word, ~0.5 seconds), we get $50 \times 0.5 = 25$ audio tokens. That is manageable. But for a 30-minute meeting recording, we get $50 \times 1{,}800 = 90{,}000$ tokens. No current LLM can usefully attend over 90,000 audio tokens alongside a text prompt.

Three main strategies address this mismatch:

  • Temporal pooling / striding: average or concatenate consecutive frames to reduce the frame rate. Qwen-Audio uses stride 2, halving from 50 Hz to 25 Hz. Simple and effective, but the compression ratio is limited (typically 2--4$\times$). For a 30-second clip at stride 2: $25 \times 30 = 750$ tokens instead of 1,500.
  • Q-Former / attention pooling: use a set of learned query vectors that attend to the full audio feature sequence via cross-attention and compress it into a fixed or much smaller number of tokens. This is the audio equivalent of the Q-Former approach used in BLIP-2 for vision. The compression can be aggressive (e.g. 100 audio features $\rightarrow$ 8 query tokens), but risks losing fine-grained temporal detail.
  • Lower frame-rate codecs: use a neural audio codec that operates at a lower frame rate from the start. For example, the Mimi codec (used by Moshi) operates at 12.5 Hz, so 30 seconds of audio produces $12.5 \times 30 = 375$ tokens, a 4$\times$ reduction compared to 50 Hz Whisper features, with no additional pooling needed.

The general trade-off can be expressed as:

$$N_\text{tokens} = \frac{R_\text{encoder}}{C} \times T$$

where $R_\text{encoder}$ is the encoder's output frame rate (e.g. 50 Hz for Whisper), $C$ is the compression factor from pooling or codec design (e.g. $C = 2$ for stride-2 pooling, $C = 4$ for Mimi's 12.5 Hz rate relative to 50 Hz), and $T$ is the audio duration in seconds. When $C = 1$ (no compression), we get the raw frame count. As $C$ increases, we get fewer tokens but lose temporal resolution. The right value of $C$ depends on the task: ASR can tolerate aggressive compression because the information density of speech is low relative to the frame rate; music analysis or speaker diarisation may need higher resolution to preserve fine-grained timbral or prosodic cues.

๐Ÿ“Œ The audio token rate problem is the main reason most audio-language models in 2023--2024 cap input duration at 30 seconds. Longer audio either requires aggressive compression (losing detail) or eats the LLM's context window (leaving no room for the text prompt and output). This is an active area of research.

Emerging Patterns and Trends

Audio-language models are following the same trajectory as VLMs, roughly two years behind:

  • 2023: the first audio-LLMs appeared (Qwen-Audio, SALMONN, LTU). These are analogous to early VLMs like LLaVA: a pre-trained encoder, a simple projector, and a frozen or lightly fine-tuned LLM. They proved the concept works.
  • 2024: improved training recipes and multi-task capability (Qwen2-Audio, WavLLM). Analogous to LLaVA-1.5 and later VLMs: better data curation, more tasks, stronger results. Models began handling both speech and non-speech audio (music, environmental sounds) in a single system.
  • 2025: native omni-modal models that handle audio, text, and vision as first-class modalities in a single architecture (covered in the next article). The encoder + projector approach gives way to unified tokenisation of all modalities.

The key architectural question at the frontier is: encoder + projector (modular) versus native audio tokens (unified) ?

The encoder approach is easier to train and can leverage years of investment in pre-trained audio models (Whisper, wav2vec 2.0, w2v-BERT). It is modular: you can swap in a better encoder without retraining the LLM. But it adds a conversion step, and the fixed encoder may discard audio information (emotion, prosody, background sounds) that the LLM would find useful if it could access the raw signal.

The native approach is end-to-end: audio tokens are treated identically to text tokens, so the model can learn to preserve whatever information matters for the task (including nuance like emotion, speaking style, and prosody that a fixed encoder might discard). But it requires massive training data because the model must simultaneously learn acoustic modelling and language modelling, rather than inheriting acoustic understanding from a pre-trained encoder.

The next article covers models that go beyond understanding audio: they generate it too, handle it natively alongside text and vision, and do it in real-time. These are the omni-modal models (GPT-4o, Gemini, Moshi) that unify all modalities into a single token stream.

Quiz

Test your understanding of audio-language model architectures and the challenges of connecting audio encoders to LLMs.

Why do audio-language models need a projection layer between the audio encoder and the LLM?

If a Whisper encoder produces features at 50 Hz and we apply stride-2 pooling, how many tokens does a 30-second audio clip produce?

What problem does SeamlessStreaming's EMMA mechanism solve?

What is the main advantage of the native audio token approach (no separate encoder) over the encoder + projector approach?