Mamba Benchmarks: Performance Metrics and Results
Mamba's benchmark performance has attracted significant attention from the machine learning research community since the architecture's public release, with results spanning language modeling perplexity, throughput, memory consumption, and long-context recall tasks. This page covers the quantitative metrics used to evaluate Mamba models, the mechanisms that explain those results, the benchmark scenarios where Mamba demonstrates measurable advantages or shortfalls, and the decision criteria practitioners apply when interpreting benchmark data. Understanding where Mamba outperforms and where it falls short requires grounding in both the evaluation methodology and the architectural tradeoffs described in the original Mamba paper by Albert Gu and Tri Dao (2023).
Definition and scope
A benchmark in the context of Mamba refers to a standardized evaluation protocol applied to measure model performance along a specific quantitative dimension — throughput (tokens per second), memory footprint (gigabytes of GPU VRAM), perplexity on held-out text corpora, or accuracy on downstream tasks such as LAMBADA, HellaSwag, or PIQA. The scope of Mamba benchmarking spans three primary categories:
- Efficiency benchmarks — wall-clock inference speed, training throughput, and hardware memory utilization measured on standardized GPU hardware (typically NVIDIA A100 or H100 devices).
- Quality benchmarks — language modeling perplexity on corpora such as The Pile or WikiText-103, and zero-shot accuracy on reasoning and comprehension tasks drawn from the EleutherAI Language Model Evaluation Harness (EleutherAI lm-evaluation-harness).
- Scaling benchmarks — how perplexity and downstream accuracy change as parameter counts increase from 130M to 2.8B, evaluated against matched-parameter Transformer baselines.
The mamba-benchmarks-performance reference domain covers these three categories as distinct measurement regimes, each requiring separate interpretive standards. Conflating efficiency gains with quality gains is a persistent source of confusion in practitioner-facing benchmark summaries.
How it works
Mamba's benchmark results emerge directly from its selective state space mechanism, which allows the model to filter which input tokens influence the hidden state at each step. This selectivity is the architectural feature that drives most of the efficiency numbers reported in published evaluations.
In the Gu and Dao (2023) paper (arXiv:2312.00752), the following quantitative results were reported for the Mamba-3B model:
- Throughput: Mamba achieved approximately 5× higher inference throughput compared to a Transformer of equivalent parameter count, measured on a single A100-80GB GPU at sequence length 2048.
- Memory scaling: Memory consumption grew linearly with sequence length in Mamba, versus quadratically in standard attention-based Transformers — a structural property of Mamba's linear-time scaling design.
- Perplexity parity: At the 1.4B parameter scale on The Pile, Mamba matched or exceeded the perplexity of comparably sized GPT-2-style Transformers trained on identical token budgets.
- Long-context recall: On the Selective Copying and Induction Heads synthetic benchmarks — both testing memory over long sequences — Mamba demonstrated near-perfect accuracy at sequence lengths exceeding 1 million tokens where standard Transformers failed to generalize.
The hardware-aware algorithm design, which avoids materializing the full state expansion in HBM (high-bandwidth memory), is the implementation mechanism responsible for converting theoretical linear complexity into observed wall-clock gains. Without kernel fusion in CUDA, the throughput advantage over Transformers would not materialize at the hardware layer.
Common scenarios
Benchmark performance varies substantially across deployment contexts. The primary scenarios where Mamba's published results carry direct operational relevance include:
Long-context language modeling: At sequence lengths above 16,384 tokens, Mamba's memory footprint remains tractable on a single 80GB GPU where an equivalent Transformer would exceed memory capacity. This advantage is documented in the original paper and replicated in the Mamba-2 improvements evaluation suite, which extended the architecture to handle even longer dependencies through structured state space duality.
Genomics and bioinformatics: HyenaDNA and subsequent Mamba-based genomic models evaluated on the Genomic Benchmarks suite (Grešová et al., 2023) showed that linear-time sequence models outperform Transformers on DNA classification tasks at sequence lengths of 1,000 to 160,000 base pairs — a domain covered in detail on the Mamba genomics and bioinformatics reference page.
Audio and time-series: On the Speech Commands classification task, Mamba-based models matched Transformer accuracy while processing sequences 4× longer without additional memory overhead, as reported in ablations accompanying the S4 and Mamba lineage of state space model papers from Carnegie Mellon University.
Comparison — Mamba vs. standard RNNs: Unlike LSTMs or GRUs, Mamba processes training sequences in parallel (not step-by-step), yielding training throughput that matches Transformer efficiency while retaining RNN-class inference behavior. This contrast is analyzed directly on the Mamba vs. RNNs reference page. The practical consequence is that Mamba's training wall-clock times on the same hardware are within 10–20% of matched Transformers, not the 3–4× penalty historically associated with recurrent architectures.
Decision boundaries
Practitioners and researchers selecting benchmarks to report or interpret face several critical decision boundaries:
- Parameter matching vs. compute matching: A Mamba-3B model trained for 300B tokens is not equivalent to a Transformer-3B trained for 300B tokens if batch sizes, context lengths, or hardware differ. The MLCommons MLPerf Training benchmark framework provides standardized controls, though Mamba-specific submissions remain limited as of the architecture's early adoption phase.
- Task-specific vs. general claims: Mamba's throughput advantage is most pronounced at sequence lengths above 2,048. At lengths below 512, the advantage narrows to under 20% in most reported configurations, meaning efficiency claims should not be generalized across all sequence length regimes.
- Quality ceiling at large scale: Published scaling laws for Mamba suggest perplexity improvements that track Transformer scaling curves up to approximately 3B parameters, beyond which direct large-scale comparisons remain sparse in peer-reviewed literature.
- Hardware dependency: Benchmark numbers obtained on A100 hardware do not transfer directly to consumer GPUs without re-profiling; the kernel fusion optimizations underlying Mamba's speed claims require CUDA compute capability 8.0 or higher.
The broader landscape of evaluation techniques applicable to Mamba is covered on the Mamba model evaluation techniques reference page, and the full architectural context grounding these metrics is available from the Mamba overview.
References
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
- EleutherAI Language Model Evaluation Harness
- MLCommons MLPerf Training Benchmarks
- Grešová et al. (2023). Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification. BMC Genomic Data
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling — EleutherAI