Mamba Terminology and Glossary of Key Terms
The vocabulary surrounding Mamba-based sequence modeling is precise and draws from control theory, signal processing, and deep learning in ways that differ meaningfully from transformer-centric literature. This glossary covers the foundational and advanced terms encountered across Mamba architecture documentation, research publications, and implementation frameworks. Familiarity with these definitions is prerequisite to navigating technical architecture breakdowns, benchmark comparisons, and deployment decisions in production environments.
Definition and scope
Mamba terminology spans three overlapping domains: the mathematical foundations inherited from state space model (SSM) theory, the architectural innovations introduced in the original Mamba paper by Albert Gu and Tri Dao (2023), and the engineering vocabulary associated with hardware-aware implementation. The glossary below organizes these into structured classification groups.
Core architectural terms:
- State Space Model (SSM): A class of sequence models derived from continuous-time linear dynamical systems. The canonical SSM maps an input sequence through latent state transitions governed by matrices A, B, C, and D. For broader context on SSMs as a model class, see State Space Models Explained.
- Discretization: The process of converting continuous-time SSM parameters into their discrete-time equivalents using methods such as the Zero-Order Hold (ZOH) or bilinear transform. Discretization produces the recurrence matrices Ā and B̄ that operate on sampled sequences.
- Selective State Space: The defining innovation of Mamba — a mechanism by which the SSM parameters (B, C, and the step size Δ) are functions of the input rather than fixed. This input-dependence enables content-based filtering of information along the sequence. See Selective State Spaces for the full technical treatment.
- Selection Mechanism: The broader term for Mamba's ability to selectively propagate or suppress information across timesteps based on input content, in contrast to time-invariant SSMs such as S4.
- Structured State Space Sequence (S4): The predecessor model to Mamba, developed at Stanford (Gu et al., 2022, "Efficiently Modeling Long Sequences with Structured State Space Models"). S4 uses fixed, structured parameterizations of A via HiPPO matrices to capture long-range dependencies.
- HiPPO Matrix: High-order Polynomial Projection Operators — a family of matrices introduced by Gu et al. that provide theoretically grounded initializations for the state transition matrix A, enabling stable compression of sequence history into fixed-size state representations.
How it works
Understanding Mamba's operational vocabulary requires distinguishing between the recurrent and convolutional representations of the same underlying computation.
Computational mode terms:
- Recurrent mode: Sequential computation where the hidden state h(t) is updated one timestep at a time using h(t) = Āh(t−1) + B̄x(t). This mode is efficient at inference time, requiring O(1) memory per step, but cannot be parallelized across the sequence dimension during training.
- Convolutional mode: A mathematically equivalent parallel formulation using the SSM kernel — the sequence of output responses to unit impulses — expressed as a global convolution over the input sequence. This mode enables GPU-parallel training but requires storing the full input sequence.
- Hardware-Aware Algorithm: The implementation strategy in Mamba that fuses the selective scan into a single GPU kernel, keeping intermediate states in SRAM (on-chip memory) rather than HBM (high-bandwidth memory off-chip). This design reduces memory I/O by an order of magnitude for typical sequence lengths. See Mamba Hardware-Aware Algorithms for benchmarks.
- Selective Scan: The core computational primitive of Mamba — a parallel prefix-sum-style operation over the sequence that applies input-dependent transition matrices. Because Δ, B, and C vary per token, this cannot be reduced to a simple convolution, requiring the hardware-aware kernel approach.
- Step Size (Δ): A learned, input-dependent scalar that controls the discretization of the continuous-time system at each token position. Larger Δ causes the model to "focus" on the current input; smaller Δ causes it to carry forward prior state.
Common scenarios
Practitioners encounter distinct terminology subsets depending on application domain. The Mamba glossary reference covers domain-specific vocabulary extensions.
Vocabulary by application context:
- Long-context modeling: Terms such as state capacity, context compression, and effective receptive field describe how much sequence history is retained. Mamba's fixed-size state vector compresses arbitrarily long histories, unlike attention mechanisms that cache every prior token.
- Comparative benchmarking: Perplexity, throughput (tokens/second), memory footprint (GB), and FLOPs per token are the standard metrics used in Mamba vs. transformer comparisons. The Mamba vs. Transformers reference page covers these benchmarks in detail.
- Hybrid architectures: Mamba-Attention hybrid, interleaved layers, and cross-attention injection describe models that combine Mamba blocks with attention layers at defined depth intervals. See Mamba Hybrid Models.
- Mamba2: Terminology specific to the second-generation architecture includes State Space Duality (SSD), structured masked attention (SMA), and multi-head SSM. SSD establishes a formal equivalence between certain SSM computations and a masked attention variant, enabling new theoretical analysis. See Mamba2 Improvements.
Decision boundaries
The distinction between terms that appear synonymous but carry precise technical differences is operationally significant.
S4 vs. Mamba: S4 uses time-invariant parameters — A, B, C are fixed across all timesteps for a given layer. Mamba makes B, C, and Δ functions of the input x(t), producing a time-varying system. This difference determines whether selective filtering of input content is possible.
Recurrence vs. attention: Both mechanisms aggregate sequence history, but recurrence compresses it into a fixed-size state (O(1) memory per step at inference) whereas attention retains an explicit key-value cache that grows linearly with sequence length. Mamba's linear-time scaling is a direct consequence of the recurrent formulation.
SSM vs. RNN: Structured state space models and recurrent neural networks are both recurrent, but SSMs are defined by linear state transitions with principled continuous-time derivations, whereas classical RNNs (LSTM, GRU) use nonlinear gating without continuous-time grounding. See Mamba vs. RNNs for the full comparison.
The main Mamba reference index provides the entry point for navigating the full technical and applied coverage of Mamba as a model family.
References
- Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752
- Gu, A. et al. (2022). "Efficiently Modeling Long Sequences with Structured State Space Models." ICLR 2022. arXiv:2111.00396
- Gu, A. et al. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS 2020. arXiv:2008.07669
- Dao, T. & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." arXiv:2405.21060
- EleutherAI — Open-source Mamba implementations and evaluation harness