Mamba Benchmarks: Performance Metrics and Results

Mamba's benchmark performance has attracted significant attention from the machine learning research community since the architecture's public release, with results spanning language modeling perplexity, throughput, memory consumption, and long-context recall tasks. This page covers the quantitative metrics used to evaluate Mamba models, the mechanisms that explain those results, the benchmark scenarios where Mamba demonstrates measurable advantages or shortfalls, and the decision criteria practitioners apply when interpreting benchmark data. Understanding where Mamba outperforms and where it falls short requires grounding in both the evaluation methodology and the architectural tradeoffs described in the original Mamba paper by Albert Gu and Tri Dao (2023).

Definition and scope

A benchmark in the context of Mamba refers to a standardized evaluation protocol applied to measure model performance along a specific quantitative dimension — throughput (tokens per second), memory footprint (gigabytes of GPU VRAM), perplexity on held-out text corpora, or accuracy on downstream tasks such as LAMBADA, HellaSwag, or PIQA. The scope of Mamba benchmarking spans three primary categories:

  1. Efficiency benchmarks — wall-clock inference speed, training throughput, and hardware memory utilization measured on standardized GPU hardware (typically NVIDIA A100 or H100 devices).
  2. Quality benchmarks — language modeling perplexity on corpora such as The Pile or WikiText-103, and zero-shot accuracy on reasoning and comprehension tasks drawn from the EleutherAI Language Model Evaluation Harness (EleutherAI lm-evaluation-harness).
  3. Scaling benchmarks — how perplexity and downstream accuracy change as parameter counts increase from 130M to 2.8B, evaluated against matched-parameter Transformer baselines.

The mamba-benchmarks-performance reference domain covers these three categories as distinct measurement regimes, each requiring separate interpretive standards. Conflating efficiency gains with quality gains is a persistent source of confusion in practitioner-facing benchmark summaries.

How it works

Mamba's benchmark results emerge directly from its selective state space mechanism, which allows the model to filter which input tokens influence the hidden state at each step. This selectivity is the architectural feature that drives most of the efficiency numbers reported in published evaluations.

In the Gu and Dao (2023) paper (arXiv:2312.00752), the following quantitative results were reported for the Mamba-3B model:

  1. Throughput: Mamba achieved approximately 5× higher inference throughput compared to a Transformer of equivalent parameter count, measured on a single A100-80GB GPU at sequence length 2048.
  2. Memory scaling: Memory consumption grew linearly with sequence length in Mamba, versus quadratically in standard attention-based Transformers — a structural property of Mamba's linear-time scaling design.
  3. Perplexity parity: At the 1.4B parameter scale on The Pile, Mamba matched or exceeded the perplexity of comparably sized GPT-2-style Transformers trained on identical token budgets.
  4. Long-context recall: On the Selective Copying and Induction Heads synthetic benchmarks — both testing memory over long sequences — Mamba demonstrated near-perfect accuracy at sequence lengths exceeding 1 million tokens where standard Transformers failed to generalize.

The hardware-aware algorithm design, which avoids materializing the full state expansion in HBM (high-bandwidth memory), is the implementation mechanism responsible for converting theoretical linear complexity into observed wall-clock gains. Without kernel fusion in CUDA, the throughput advantage over Transformers would not materialize at the hardware layer.

Common scenarios

Benchmark performance varies substantially across deployment contexts. The primary scenarios where Mamba's published results carry direct operational relevance include:

Long-context language modeling: At sequence lengths above 16,384 tokens, Mamba's memory footprint remains tractable on a single 80GB GPU where an equivalent Transformer would exceed memory capacity. This advantage is documented in the original paper and replicated in the Mamba-2 improvements evaluation suite, which extended the architecture to handle even longer dependencies through structured state space duality.

Genomics and bioinformatics: HyenaDNA and subsequent Mamba-based genomic models evaluated on the Genomic Benchmarks suite (Grešová et al., 2023) showed that linear-time sequence models outperform Transformers on DNA classification tasks at sequence lengths of 1,000 to 160,000 base pairs — a domain covered in detail on the Mamba genomics and bioinformatics reference page.

Audio and time-series: On the Speech Commands classification task, Mamba-based models matched Transformer accuracy while processing sequences 4× longer without additional memory overhead, as reported in ablations accompanying the S4 and Mamba lineage of state space model papers from Carnegie Mellon University.

Comparison — Mamba vs. standard RNNs: Unlike LSTMs or GRUs, Mamba processes training sequences in parallel (not step-by-step), yielding training throughput that matches Transformer efficiency while retaining RNN-class inference behavior. This contrast is analyzed directly on the Mamba vs. RNNs reference page. The practical consequence is that Mamba's training wall-clock times on the same hardware are within 10–20% of matched Transformers, not the 3–4× penalty historically associated with recurrent architectures.

Decision boundaries

Practitioners and researchers selecting benchmarks to report or interpret face several critical decision boundaries:

The broader landscape of evaluation techniques applicable to Mamba is covered on the Mamba model evaluation techniques reference page, and the full architectural context grounding these metrics is available from the Mamba overview.

References