The Best of Both Worlds

The previous article made the trade-off clear: attention excels at precise recall and in-context learning , while SSMs excel at long-range efficiency and streaming . Pure Mamba models struggle on tasks that require exact copying or retrieval from context — if you ask a pure SSM to repeat a phone number it saw 50,000 tokens ago, it often fails because that specific information was compressed away into the fixed-size state. Pure attention models handle this easily (just attend to the token), but their $O(n^2)$ cost makes million-token contexts prohibitively expensive.

What if we could use attention for the layers where precise recall matters and SSMs for the layers where long-range context compression is sufficient? That's the core idea behind hybrid architectures : stack both attention and SSM layers in a single model, letting each type of layer do what it does best. The attention layers provide the "sharp" memory — direct token-to-token access for recent context or critical information. The SSM layers provide the "soft" memory — efficient compression of long-range patterns and trends.

The key design question is the ratio : how many SSM layers versus attention layers? At one extreme (all attention), we get a standard Transformer. At the other extreme (all SSM), we get pure Mamba. Between these extremes lies a spectrum of hybrids, and the optimal ratio depends on the use case. More attention layers improve recall at the cost of longer-sequence efficiency. More SSM layers improve efficiency at the cost of precise retrieval. As we'll see, production models have converged on ratios around 6:1 to 8:1 (SSM-to-attention), using just enough attention to handle recall-heavy tasks while keeping the overall cost close to linear.

💡 Think of attention layers as "reading glasses" — you only need a few of them to read fine print (precise recall), but you don't need them for everything. SSM layers handle the rest of the processing at much lower cost.

Jamba: Mamba + Attention + MoE

The first large-scale hybrid model to demonstrate that mixing attention and SSM layers works in practice is Jamba (Lieber et al., 2024) , a 52-billion parameter model from AI21 Labs supporting a 256K token context window . Jamba combines three architectural ideas: Mamba SSM layers, standard attention layers, and Mixture of Experts (MoE) feed-forward layers.

Jamba's architecture is built from repeating blocks . Each block contains either a Mamba layer or an attention layer, followed by a feed-forward network (which may or may not use MoE). The blocks are arranged in a specific pattern: for every 1 attention layer, there are 7 Mamba layers — a 7:1 ratio. Concretely, the model has 4 attention layers out of a total of 32 sequence-mixing layers. The attention layers are spaced evenly (roughly every 8 layers), so information passes through several Mamba layers (compressing long-range context) before hitting an attention layer (which can perform precise recall and in-context operations).

Why does the 7:1 ratio work? Consider what each layer type contributes. The Mamba layers process the bulk of the sequence at $O(n)$ cost, maintaining and updating compressed state representations. The attention layers, appearing every 8 layers, act as "checkpoints" where the model can directly compare tokens and perform the kind of precise lookup that SSMs struggle with. Four attention layers are enough for most recall tasks because the model doesn't need to perform exact token matching at every layer — it only needs it at a few strategic points.

The MoE component addresses a different concern: model capacity . Some of the feed-forward layers use 16 experts with top-2 routing, meaning only 2 out of 16 experts are activated per token. This gives Jamba the capacity of a 52B-parameter model while using only ~12B active parameters per token. The MoE layers appear on a subset of blocks (every other block), keeping the computational overhead manageable.

The results are striking. At 256K context, Jamba fits on a single 80 GB GPU — something that would be impossible for a pure-attention 52B model (the KV cache alone would exceed memory). This is because the 28 Mamba layers contribute no KV cache at all — only the 4 attention layers need KV storage. The KV cache is thus roughly $4/32 = 12.5\%$ of what a full-attention model would require. At the boundary: with 0 attention layers (pure Mamba), the KV cache is zero but recall suffers. With 32 attention layers (pure Transformer), recall is perfect but the 256K KV cache won't fit in memory.

import json, js

# KV cache comparison: Jamba (4 attn layers) vs full Transformer (32 attn layers)
n_layers_total = 32
n_attn_jamba = 4
d_model = 4096
n_kv_heads = 8   # GQA
d_head = 128
bytes_per_param = 2  # FP16

seq_lengths = [4096, 32768, 131072, 262144]
labels = ["4K", "32K", "128K", "256K"]

rows = []
for n, label in zip(seq_lengths, labels):
    # Full Transformer: all 32 layers have KV cache
    full_kv = 2 * n_layers_total * n_kv_heads * d_head * n * bytes_per_param
    full_gb = full_kv / 1e9
    # Jamba: only 4 layers have KV cache
    jamba_kv = 2 * n_attn_jamba * n_kv_heads * d_head * n * bytes_per_param
    jamba_gb = jamba_kv / 1e9
    saving = (1 - jamba_kv / full_kv) * 100
    rows.append([label, f"{full_gb:.1f} GB", f"{jamba_gb:.1f} GB", f"{saving:.0f}%"])

js.window.py_table_data = json.dumps({
    "headers": ["Seq Length", "Full Transformer KV", "Jamba KV (4 attn layers)", "KV Memory Saved"],
    "rows": rows
})

print(f"Config: {n_layers_total} layers, {n_kv_heads} KV heads, d_head={d_head}, FP16")
print(f"Jamba uses only {n_attn_jamba}/{n_layers_total} attention layers")
print(f"At 256K tokens: full Transformer needs {rows[-1][1]}, Jamba needs {rows[-1][2]}")
💡 Jamba's key insight: you don't need attention at every layer. A few attention layers spaced evenly through the network are enough to provide precise recall, while the majority of layers can be SSMs that process long sequences cheaply. The 7:1 ratio is not magic — it was found empirically, and other ratios (5:1, 9:1) also work well. The principle is: use as few attention layers as needed to pass recall benchmarks.

Zamba and StripedHyena

Jamba is not the only hybrid approach. Other teams have explored different ways to partition attention and SSM layers, each making different trade-offs.

Zamba (Glorioso et al., 2024) from Zyphra takes a more aggressive approach: it uses a single shared attention layer interleaved among many Mamba layers. Instead of distributing multiple attention layers throughout the network (as Jamba does), Zamba places one attention layer and shares its weights across multiple positions in the network. The model routes through this shared attention block at regular intervals, so it appears at multiple points during the forward pass but uses only one set of attention parameters. This further reduces the parameter count and KV cache (since there's only one attention layer's worth of keys and values to cache) while still providing the recall capability that pure Mamba lacks.

StripedHyena (Together AI, 2023) takes a different approach to the hybrid question. Instead of using Mamba, it alternates between Hyena layers (a convolution-based alternative to attention with sub-quadratic scaling) and standard attention layers. The "striped" pattern — alternating Hyena and attention — gives the model both the long-range efficiency of sub-quadratic sequence mixing and the precise in-context learning of attention. StripedHyena-7B matched the performance of LLaMA 2 7B on standard benchmarks while offering better scaling on longer sequences.

These models share a common lesson: the specific non-attention mechanism matters less than the principle of mixing . Whether you use Mamba, Hyena, RWKV, or another sub-quadratic layer, the hybrid approach of combining it with a few attention layers consistently outperforms either pure approach. The attention layers act as a "precision backbone" — without them, all these models show degradation on recall-heavy tasks. With even a small fraction of attention layers (1 in 6 to 1 in 8), recall performance recovers to near-Transformer levels.

💡 Zamba's shared attention layer is an extreme version of the hybrid idea: use the absolute minimum amount of attention (one shared layer) and fill the rest with SSMs. This pushes the KV cache to near zero while maintaining recall. The downside is that one attention layer may not be enough for the most demanding multi-hop reasoning tasks.

When to Use What

With three paradigms available — pure attention, pure SSM, and hybrid — how do you choose? The answer depends on your sequence length , task requirements , and deployment constraints . Here's a practical comparison:

import json, js

rows = [
    [
        "Pure Attention (GPT, LLaMA)",
        "O(n^2)",
        "Short-medium (up to ~32K)",
        "Excellent",
        "General-purpose LLMs, chat, code generation"
    ],
    [
        "Pure SSM (Mamba)",
        "O(n)",
        "Very long (256K+)",
        "Limited",
        "Genomics, long audio, streaming, on-device"
    ],
    [
        "Hybrid (Jamba, Zamba)",
        "~O(n)",
        "Long (128K-256K+)",
        "Good",
        "Production LLMs needing long context + recall"
    ],
]

js.window.py_table_data = json.dumps({
    "headers": ["Architecture", "Compute", "Best Context Range", "In-Context Recall", "Best Use Cases"],
    "rows": rows
})

print("Architecture selection guide")
print("Pure attention: best quality, worst scaling")
print("Pure SSM: best scaling, weakest recall")
print("Hybrid: practical sweet spot for most production needs")

Let's walk through the decision logic for common scenarios:

  • Chatbot with short conversations (< 8K tokens): Pure attention (Transformer). The quadratic cost is negligible at this length, and you get the best in-context learning. There's no reason to add SSM complexity.
  • Analysing a full codebase or long legal document (64K–256K tokens): Hybrid architecture. You need both long-range understanding (SSM layers handle this cheaply) and the ability to precisely reference specific sections (attention layers handle this). Jamba-style models are designed for exactly this regime.
  • Processing genomic sequences or continuous sensor data (1M+ tokens): Pure SSM or aggressive hybrid (1–2 attention layers). The sequence is too long for any meaningful amount of attention, and the task typically requires pattern recognition over long ranges rather than precise recall of individual tokens.
  • On-device or streaming inference: Pure SSM. The fixed-size state (no growing KV cache) is ideal for memory-constrained environments, and the recurrent mode naturally processes one token at a time without needing to store past tokens.

The trend is clear: as context lengths grow and models are deployed in more diverse settings, hybrid architectures are becoming the production default . Pure attention Transformers dominated because contexts were short (2K–8K tokens) and hardware was optimised for matrix multiplications. As we push toward 256K and beyond, the quadratic cost becomes untenable, and the hybrid approach — using SSMs for the bulk of the processing and attention for the precision layers — offers the best trade-off between quality and cost.

The Mamba-2 SSD framework (previous article) suggests that this convergence may deepen further: if SSMs and attention are mathematically dual, future architectures may not even have discrete "attention layers" and "SSM layers" — they may have layers that smoothly interpolate between the two modes depending on the input and position in the sequence.

Quiz

Test your understanding of hybrid architectures combining attention and SSMs.

Why do pure SSM models (like Mamba) struggle with tasks requiring precise recall from context?

Jamba uses a 7:1 ratio of Mamba-to-attention layers (28 Mamba layers and 4 attention layers). What is the primary benefit of this ratio for long-context inference?

How does Zamba's approach to hybrid architecture differ from Jamba's?

For which scenario would a pure attention Transformer still be the best choice over a hybrid or SSM model?