Linear-Time Scaling in Mamba Models

Linear-time scaling is the architectural property that allows Mamba models to process sequences of arbitrary length without the quadratic computational cost that constrains transformer-based systems. This page covers the definition, mechanism, practical scenarios, and decision boundaries relevant to practitioners and researchers evaluating Mamba's scaling behavior against competing architectures. The distinction between O(n) and O(n²) complexity is not academic — it determines which workloads are computationally feasible at production scale.

Definition and scope

In computational complexity terms, a model that scales linearly with sequence length expends resources proportional to n, where n is the number of tokens or time steps in the input. Transformer architectures, by contrast, impose O(n²) cost through the full self-attention mechanism: every token must attend to every other token, producing an attention matrix that grows quadratically. At a sequence length of 1,000 tokens, that matrix contains 1,000,000 entries; at 100,000 tokens, it contains 10,000,000,000 entries.

Mamba, introduced by Albert Gu and Tri Dao in their 2023 paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces (arXiv:2312.00752), achieves linear scaling by grounding sequence processing in State Space Models (SSMs) rather than attention. The scope of this property covers both training-time compute (FLOPs) and inference-time memory, which behave differently under the two paradigms.

Linear-time scaling applies specifically to the recurrent form of the SSM computation. The Mamba architecture overview distinguishes between a parallelizable convolutional mode used during training and a recurrent mode used during inference — each preserving the O(n) scaling property through different computational strategies.

How it works

The linear-time property in Mamba derives from the selective state space mechanism, which compresses sequence history into a fixed-size hidden state vector rather than retaining explicit token-to-token relationships. The hidden state dimension is denoted N (typically 16 in published Mamba configurations); regardless of how long the input sequence becomes, the state remains bounded at size N.

The mechanism operates through three phases:

State update: At each step t, the model updates its hidden state h(t) using structured matrices A and B, where input x(t) modulates the transition. Because A and B are parameterized per-step through selective projection, the model decides what information to retain — a property absent in classical linear SSMs (NIST SP 800-218 governs secure software development practices for systems deploying such models in regulated environments, though the mathematics are architecture-native).
Output projection: The hidden state is projected to output y(t) through matrix C, again applied in O(1) per step.
Parallel scan during training: Rather than executing the recurrence sequentially, Mamba uses a parallel associative scan (derived from work by Blelloch at Carnegie Mellon University on parallel prefix algorithms) to compute all hidden states simultaneously, achieving O(n log n) work with O(log n) depth — effectively linear in practice on hardware.

The hardware-aware algorithm design exploits GPU memory hierarchy by keeping intermediate states in fast SRAM rather than writing to slower HBM (High Bandwidth Memory), reducing memory bandwidth demand by a factor proportional to the state expansion ratio.

For a direct architectural contrast, see Mamba vs. Transformers: the transformer requires storing a key-value cache that grows as O(n × d_model) per layer during inference, while Mamba's recurrent inference holds a fixed state independent of n.

Common scenarios

Linear-time scaling becomes operationally significant in three categories of deployment:

Long-context document processing — Tasks involving inputs exceeding 8,192 tokens (legal contracts, genomic sequences, audio transcriptions) expose the transformer's quadratic cost directly. The Mamba genomics and bioinformatics application domain routinely processes sequences of 100,000 base pairs or longer, where quadratic attention is computationally prohibitive on standard GPU hardware.

Real-time inference on constrained hardware — Because Mamba inference holds a fixed recurrent state, latency per token remains constant regardless of prior context length. This contrasts with transformer key-value cache inference, where memory consumption and retrieval time grow with sequence position. For time-series forecasting applications running on edge hardware with 8–16 GB VRAM, this distinction governs feasibility.

Large-scale training runs — At training sequence lengths of 32,768 tokens, a transformer's attention computation requires roughly 1,073 times more memory for the attention matrix than a 32-token baseline, whereas Mamba's training cost grows proportionally. Published benchmark comparisons from the original Mamba paper show 5× faster inference throughput versus comparable transformer baselines at sequence length 2,048.

Decision boundaries

Linear-time scaling does not dominate in all conditions. Practitioners and researchers evaluating Mamba's limitations and tradeoffs should apply the following classification boundaries:

Mamba's linear scaling advantage is material when:
- Sequence length exceeds approximately 2,000 tokens in batch inference settings
- Inference latency per token must remain constant as context grows
- GPU memory is the binding constraint rather than model parameter count

Transformer attention retains advantages when:
- Tasks require dense, arbitrary token-to-token retrieval (retrieval-augmented generation over short passages)
- Sequence length is consistently below 1,024 tokens, where quadratic cost is negligible
- Existing fine-tuned transformer checkpoints eliminate retraining costs

Mamba hybrid models occupy an intermediate position, interleaving attention layers with SSM layers to retain targeted retrieval capability while amortizing the quadratic cost across fewer attention operations.

The Mamba scaling laws literature, building on Hoffmann et al.'s Chinchilla scaling analysis (DeepMind, 2022), investigates whether the linear-time property translates into improved parameter efficiency at fixed compute budgets — a question with direct implications for the broader Mamba model landscape indexed at the site root.

Linear-Time Scaling in Mamba Models

Definition and scope

How it works

Common scenarios

Decision boundaries

References

Read Next