Scaling Laws for Mamba Models
Scaling laws describe the quantitative relationships between model size, training compute, dataset volume, and predictive performance — and for Mamba-based architectures, these relationships diverge meaningfully from the Transformer baselines that dominated prior scaling research. This page covers how scaling laws apply specifically to state space model (SSM) architectures like Mamba, what empirical findings distinguish Mamba's scaling behavior, where the theory remains contested, and how practitioners and researchers characterize scaling regimes across different deployment contexts.
- Definition and scope
- Core mechanics or structure
- Causal relationships or drivers
- Classification boundaries
- Tradeoffs and tensions
- Common misconceptions
- Checklist or steps
- Reference table or matrix
Definition and scope
Scaling laws, as formalized for neural language models by Kaplan et al. (2020) in the OpenAI technical report "Scaling Laws for Neural Language Models", describe power-law relationships between cross-entropy loss and three primary variables: parameter count (N), training token count (D), and compute budget (C). The central finding was that loss decreases predictably as each variable increases, following approximately L ∝ N^{−α} and L ∝ D^{−β}, with empirically estimated exponents.
For Transformer architectures, Hoffmann et al. (2022) at DeepMind refined these findings in the "Chinchilla" paper, demonstrating that prior models were systematically undertrained — the compute-optimal ratio of parameters to training tokens is approximately 1:20 (one parameter per 20 training tokens at compute-optimal scale).
Mamba scaling laws operate within this same conceptual framework but introduce architectural variables that Transformer-based studies did not need to parameterize. Specifically, Mamba replaces quadratic-complexity attention with a selective state space mechanism, changing the computational cost profile per token at training time. The Mamba architecture overview describes the structural foundations that drive these differences. The scope of Mamba-specific scaling research extends to: parameter efficiency per FLOP, memory bandwidth utilization, recurrent state dimensionality, and sequence length as an independent scaling axis — a dimension that attention-based models handle quadratically and therefore rarely treat as a free variable.
Core mechanics or structure
Mamba's scaling behavior is governed by four interacting structural factors.
Parameter density and recurrent state. A Mamba block's parameters are distributed across input projection, selective SSM parameters (Δ, B, C matrices), and output projection. The selective state space, described in Gu and Dao's 2023 paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", uses a state dimension N (not to be confused with total parameter count) that is typically set to 16 or 64 in published configurations. Increasing N expands recurrent memory capacity but adds parameters sublinearly relative to the model's total capacity.
FLOP scaling with sequence length. Transformer attention scales at O(L²) in FLOPs per token with sequence length L, while Mamba's recurrent scan scales at O(L). At sequence lengths above 2,048 tokens, this difference translates to measurable compute savings per training step. For Mamba's linear-time scaling at inference, the recurrent formulation processes each token in constant time regardless of context length — a structural property with direct implications for how scaling curves behave at long-context settings.
Hardware-aware kernels. The parallel associative scan used during training — implemented as a CUDA-level fused kernel — determines realized throughput efficiency. Because Mamba's compute is dominated by memory bandwidth rather than arithmetic intensity at smaller batch sizes, its effective FLOPs-per-second utilization differs from Transformer models of equivalent parameter count. The hardware-aware algorithms underlying Mamba affect where scaling efficiency saturates on a given GPU tier.
Depth vs. width tradeoffs. In Transformer scaling literature, the ratio of depth (layers) to width (embedding dimension) follows studied optima. Mamba's published configurations (Mamba-130M through Mamba-2.8B in the original paper) suggest similar depth-width optima to Transformers at matched parameter counts, but the interaction with state dimension N adds a third axis that Transformer models lack.
Causal relationships or drivers
Three primary factors causally drive differences in Mamba's scaling curves relative to Transformer baselines.
Compute-per-token reduction at long contexts. Because attention cost grows quadratically with context length, Transformer training on sequences longer than 8,192 tokens becomes disproportionately expensive. Mamba's linear cost means that doubling sequence length from 4,096 to 8,192 doubles — rather than quadruples — the FLOPs per batch. This shifts the compute-optimal token allocation: longer sequences become accessible at lower compute budgets, potentially altering the optimal N-to-D ratio from Chinchilla's 1:20 baseline.
Memory bandwidth as the binding constraint. At scales below 1 billion parameters, Mamba's recurrent scan is frequently memory-bandwidth-limited rather than compute-limited. This means that scaling parameter count does not linearly increase training throughput until batch sizes are large enough to saturate arithmetic pipelines. The practical effect is that Mamba's loss-vs-compute curves may show slower improvement at small scale than parameter count alone predicts.
Selective state space information compression. Mamba's input-dependent SSM parameters allow the model to selectively compress or propagate information across sequence positions. This architectural selectivity means Mamba can, in principle, achieve equivalent perplexity to a Transformer on structured data with fewer effective parameters — but this compression advantage does not transfer uniformly across all data domains. Selective state spaces describes the mechanism in detail.
Classification boundaries
Mamba scaling research spans three distinct regimes, each with different empirical characteristics.
Small-scale regime (< 125M parameters). At this scale, Mamba models trained on standard language modeling benchmarks show performance roughly comparable to same-size Transformer models, with minor advantages in throughput at sequence lengths above 1,024 tokens. Scaling exponents in this regime are noisy and sensitive to hyperparameter choices.
Mid-scale regime (125M–3B parameters). The Mamba paper's primary empirical results span this regime. Mamba-2.8B, trained on the Pile dataset, achieves perplexity competitive with GPT-NeoX-20B on several benchmarks — a 7× parameter reduction for matched performance, though training conditions differ. The Mamba benchmarks and performance page details the specific evaluation configurations.
Large-scale regime (> 3B parameters). Empirical scaling data for pure Mamba models above 3B parameters is sparse in published literature as of the 2023–2024 research cycle. Hybrid architectures — explored under Mamba hybrid models — dominate this regime in practice, interleaving SSM blocks with attention layers to capture both linear-cost sequence processing and the strong in-context learning properties of attention.
Tradeoffs and tensions
The primary tension in Mamba scaling concerns in-context learning versus recurrent compression. Transformers scale in-context learning capability sharply with parameter count, as documented in Brown et al. (2020), "Language Models are Few-Shot Learners". Mamba's recurrent architecture compresses context into a fixed-size state vector, which structurally limits the fidelity of in-context learning for tasks requiring retrieval of specific tokens from long histories.
A second tension involves training stability at scale. Mamba's selective parameterization introduces multiplicative interactions between input-dependent parameters (Δ, B, C), which can produce gradient instability at higher learning rates or larger model scales without careful initialization. The Mamba2 improvements paper by Dao and Gu (2024) addresses some of these stability concerns through a structured state space duality (SSD) formulation.
A third tension involves benchmark transferability of scaling laws. The Chinchilla compute-optimal ratios were derived on language modeling tasks measured by cross-entropy loss. Mamba's architectural compression of sequence information means that equivalent perplexity does not always translate to equivalent downstream task performance — particularly on tasks with high retrieval demands, such as multi-hop reasoning benchmarks. The Mamba limitations and tradeoffs page covers this boundary in depth.
Common misconceptions
Misconception: Mamba always outperforms Transformers at matched parameter count.
The published benchmarks show Mamba matching or exceeding same-size Transformers on standard language modeling perplexity at certain scales — but this does not generalize uniformly. On tasks requiring precise token-level recall from long contexts, Transformer attention's exact retrieval mechanism maintains advantages that Mamba's fixed-state recurrence cannot fully replicate, regardless of parameter scale.
Misconception: Linear compute scaling eliminates the need for Chinchilla-style compute-optimal ratios.
Mamba's O(L) cost reduces training expense per long sequence, but it does not change the fundamental relationship between parameter count and dataset size for achieving minimum loss. The compute-optimal frontier shifts in its absolute position on a cost-per-FLOP basis, but the existence of an optimal N-to-D ratio remains.
Misconception: Scaling Mamba state dimension N is equivalent to scaling model parameters.
Increasing recurrent state dimension N from 16 to 64 quadruples state memory per layer but does not add parameters proportionally — the SSM matrices scale with N linearly, not with the square of the embedding dimension. Practitioners sometimes conflate state capacity with parameter count, producing incorrect efficiency comparisons. The state space models explained reference covers the parameter accounting in detail.
Misconception: Mamba's scaling laws are fully characterized.
Published scaling studies for Mamba cover the 130M–2.8B parameter range on English text corpora. Generalization to multilingual data, code, or multimodal settings remains an open empirical question. Researchers at the Mamba resources and tools level are actively extending this empirical base.
Checklist or steps
The following sequence describes the standard empirical procedure for characterizing scaling behavior in a Mamba model family.
- Define the parameter sweep. Select at least 4 model sizes spanning at least one order of magnitude (e.g., 125M, 350M, 1B, 3B parameters) with matched depth-to-width ratios.
- Fix training token budget per size. Use a compute-matched or Chinchilla-optimal token count as baseline; run at least one undertrained and one overtrained configuration per size to bracket the loss curve.
- Log FLOPs per training step, not wall-clock time. Account for sequence length in FLOP estimates — Mamba's recurrent scan FLOPs must be calculated using the parallel scan formulation, not the naive recurrent formulation.
- Measure perplexity on held-out validation split from the same data distribution as training, using identical tokenization across all model sizes.
- Fit power-law curves to loss vs. parameter count (holding D fixed) and loss vs. token count (holding N fixed) using log-linear regression; report R² values and confidence intervals.
- Run downstream task evaluation on at minimum 3 tasks with differing recall and reasoning demands to separate perplexity-correlated gains from task-specific architectural effects.
- Document state dimension N, SSM expansion factor E, and number of layers as part of the model card — these variables interact with scaling behavior and must be fixed or swept explicitly in any comparison.
- Report efficiency metrics separately: training throughput (tokens/second/GPU), memory per parameter, and inference latency at 1K, 8K, and 32K token contexts. These distinguish Mamba's structural efficiency from its loss-curve position.
The Mamba model training guide provides implementation-level detail on configuring these experiments.
Reference table or matrix
| Parameter Scale | Approximate FLOPs at 2K Tokens | Approximate FLOPs at 32K Tokens | Published Baseline Perplexity (Pile) | Primary Scaling Constraint |
|---|---|---|---|---|
| 130M | ~5.2 × 10¹⁰ | ~8.3 × 10¹¹ (Transformer) / ~8.3 × 10¹⁰ (Mamba) | ~10.0 (Mamba, Gu & Dao 2023) | Memory bandwidth |
| 370M | ~1.5 × 10¹¹ | ~2.4 × 10¹² (Transformer) / ~2.4 × 10¹¹ (Mamba) | ~8.3 (Mamba, Gu & Dao 2023) | Memory bandwidth / Arithmetic |
| 790M | ~3.2 × 10¹¹ | ~5.1 × 10¹² (Transformer) / ~5.1 × 10¹¹ (Mamba) | ~7.4 (Mamba, Gu & Dao 2023) | Arithmetic throughput |
| 1.4B | ~5.6 × 10¹¹ | ~9.0 × 10¹² (Transformer) / ~9.0 × 10¹¹ (Mamba) | ~6.9 (Mamba, Gu & Dao 2023) | Arithmetic throughput |
| 2.8B | ~1.1 × 10¹² | ~1.8 × 10¹³ (Transformer) / ~1.8 × 10¹² (Mamba) | ~6.2 (Mamba, Gu & Dao 2023) | Arithmetic throughput |
FLOP estimates are approximate and reflect one forward pass at the stated sequence length. Transformer estimates assume standard multi-head attention with no sparse approximation. Mamba estimates reflect the parallel associative scan formulation used during training. Perplexity figures are drawn from Table 3 of Gu and Dao (2023).
The Mamba vs. Transformers comparison page provides a complementary breakdown of performance and efficiency metrics across architecture classes. For researchers exploring the broader landscape of SSM-based model design, the Mamba research papers index catalogs the primary literature on which empirical scaling findings rest. The index provides orientation across the full scope of Mamba-related reference material available on this domain.
References
- Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752
- Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." OpenAI. arXiv:2001.08361
- Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." DeepMind. arXiv:2203.05530
- Brown, T. et al. (2020). "Language Models are Few-Shot Learners." arXiv:2005.14165
- Dao, T. & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." arXiv:2405.21060
- EleutherAI. "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027