State Space Models: The Foundation of Mamba
State space models (SSMs) form the mathematical backbone of the Mamba architecture, providing a principled framework for mapping continuous-time input sequences to outputs through a latent state representation. This page covers the formal definition, structural mechanics, causal relationships, classification boundaries, and known tensions of SSMs as deployed in modern sequence modeling systems. Understanding SSMs as a technical substrate is essential for practitioners working on Mamba Architecture, competitive benchmarking, and system design decisions across domains from genomics to audio.
- Definition and scope
- Core mechanics or structure
- Causal relationships or drivers
- Classification boundaries
- Tradeoffs and tensions
- Common misconceptions
- Checklist or steps
- Reference table or matrix
Definition and scope
A state space model, in the context of deep sequence modeling, is a system that represents the relationship between an input signal $u(t)$, a hidden state vector $h(t)$, and an output signal $y(t)$ through a pair of differential equations:
$$\dot{h}(t) = Ah(t) + Bu(t)$$
$$y(t) = Ch(t) + Du(t)$$
Here, $A$ is the state transition matrix, $B$ is the input projection matrix, $C$ is the output projection matrix, and $D$ is the direct feedthrough matrix. This formulation originates in control theory and signal processing, appearing in foundational texts published through institutions such as MIT and Stanford's electrical engineering programs, and formalized in the control-theoretic literature associated with Rudolf Kalman's 1960 paper on linear filtering (Kalman, 1960, Journal of Basic Engineering, ASME).
Within deep learning, SSMs operate over discrete token sequences rather than continuous time. Discretization transforms the continuous-time $A$ and $B$ matrices into discrete counterparts $\bar{A}$ and $\bar{B}$ using a step size $\Delta$, typically via the zero-order hold (ZOH) method. The scope of SSMs as a category encompasses all architectures that rely on this latent-state recurrence structure, including S4, S4D, DSS, H3, and Mamba itself. The Mamba model overview positions the architecture explicitly as an SSM with input-dependent (selective) parameterization.
Core mechanics or structure
The operational mechanics of a discrete-time SSM reduce to three computations per sequence step:
- State update: $h_t = \bar{A}h_{t-1} + \bar{B}u_t$
- Output emission: $y_t = Ch_t$
- Matrix discretization: $\bar{A} = e^{\Delta A}$, $\bar{B} = (\Delta A)^{-1}(e^{\Delta A} - I)\Delta B$
Because the recurrence is linear, the entire output sequence can also be computed as a convolution with a kernel $K = (C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, \ldots)$, allowing parallelizable training. This dual-mode property — recurrent inference, convolutional training — is structurally central to SSM efficiency.
The state matrix $A$ governs memory decay. In structured variants, $A$ is constrained to diagonal or normal form to reduce the cost of matrix exponentiation. The Structured State Space (S4) model, introduced by Gu et al. (2021) at Stanford and published through NeurIPS 2022 (Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces," NeurIPS 2022), demonstrated that parameterizing $A$ using the HiPPO (High-order Polynomial Projection Operators) initialization substantially improves long-range dependency capture compared to random initialization.
The state dimension $N$ determines model capacity and memory cost. In standard SSM implementations, the state dimension is fixed and input-independent. Mamba's key departure is making $B$, $C$, and $\Delta$ functions of the input, enabling content-aware state updates. This is the mechanism described as selective state spaces.
Causal relationships or drivers
Three structural properties of SSMs drive their adoption in sequence modeling contexts:
Linear complexity in sequence length: Because SSM recurrence processes one token at a time during inference, the compute cost scales as $O(L)$ in sequence length $L$, versus $O(L^2)$ for self-attention mechanisms. This scaling difference becomes operationally significant at sequence lengths above approximately 2,048 tokens, the threshold at which attention's quadratic memory footprint typically requires chunking or approximation.
Long-range memory through eigenvalue control: The eigenvalues of $A$ determine how quickly historical context decays. HiPPO-initialized matrices are designed so that eigenvalues distribute across the unit circle in a pattern optimized for polynomial approximation of historical signals, as shown in the HiPPO paper (Gu et al., NeurIPS 2020, arXiv:2008.07669).
Convolutional equivalence enabling parallel training: The same SSM that runs as an $O(L)$ recurrence at inference time can be unrolled into a length-$L$ convolution at training time, allowing full GPU parallelism. This eliminates the training-time disadvantage that plagued earlier RNN architectures. For practitioners working on Mamba hardware-aware algorithms, this duality is the key algorithmic entry point.
Classification boundaries
SSMs subdivide along four primary axes:
Structure of $A$: Dense $A$ (general SSMs), diagonal $A$ (DSS, S4D), normal $A$ (S4), and HiPPO-structured $A$. Each imposes different computational costs and expressivity trade-offs.
Input dependence: Time-invariant SSMs hold $A$, $B$, $C$ fixed across the sequence (S4, H3). Time-varying or selective SSMs allow these matrices to depend on the input token (Mamba). The distinction is structurally equivalent to the difference between a linear time-invariant (LTI) system and a linear time-varying (LTV) system in classical control theory.
Discretization method: ZOH, bilinear (Tustin), and Euler methods produce different $\bar{A}$ and $\bar{B}$ under the same continuous-time parameters. ZOH is standard in Mamba; bilinear discretization appears in S4 variants.
Layering strategy: SSM layers are typically interleaved with pointwise nonlinearities (e.g., SiLU/Swish gates) and normalization layers to form deep sequence models. The gated SSM block used in Mamba is distinct from the pure-SSM blocks in S4, and the differences affect both representational capacity and practical stability during training.
Tradeoffs and tensions
The primary operational tension in SSM design is between state size and expressivity. Larger state dimension $N$ increases the model's capacity to retain historical context but raises both parameter count and memory bandwidth requirements during inference. This is not a theoretical concern: a Mamba model with state dimension $N = 16$ uses 16 times the state memory of an equivalent model with $N = 1$, and state memory must be loaded and written every token step.
A second tension exists between input-dependence and parallelizability. Time-invariant SSMs admit pure convolutional computation during both training and inference. Selective (time-varying) SSMs break the global convolution structure, requiring the parallel selective scan algorithm described in the original Mamba paper (Gu & Dao, 2023, arXiv:2312.00752) to recover training efficiency. This algorithm is hardware-specific and relies on IO-aware tiling for H100 and A100 GPUs — making Mamba's efficiency properties hardware-dependent in a way that S4 is not.
A third tension involves the recurrent state bottleneck. SSMs compress all historical context into a fixed-size state vector of dimension $N \times D$ (where $D$ is the model dimension). Transformers, by contrast, cache the full key-value history. For tasks requiring verbatim retrieval of tokens from thousands of positions back, SSMs show measurable degradation, as documented in the Mamba benchmarks and performance literature. This makes SSMs structurally weaker on associative recall tasks than on tasks requiring smooth temporal integration.
Common misconceptions
Misconception: SSMs are a type of RNN.
SSMs share the recurrent inference structure of RNNs but differ fundamentally in their training dynamics. Standard RNNs trained with backpropagation through time (BPTT) suffer from gradient vanishing in proportion to sequence length. SSMs trained via convolutional unrolling do not: gradients flow through the convolutional kernel, not through a recurrence chain. This makes SSMs more closely related to temporal convolutional networks than to LSTMs in their training behavior.
Misconception: The $D$ matrix is a core component of SSM expressivity.
In practice, $D$ is typically set to zero or to a scalar skip connection. The expressivity of an SSM comes almost entirely from $A$, $B$, and $C$. The direct feedthrough term $D$ is retained for formal completeness in the system equations but contributes negligibly to sequence modeling performance.
Misconception: Mamba eliminates the need for attention entirely.
Mamba replaces the self-attention layer with an SSM block but retains gating, normalization, and feedforward components. Hybrid architectures pairing SSM layers with sparse attention heads — such as those reviewed in Mamba hybrid models — consistently outperform pure-SSM models on tasks with strong associative recall requirements, indicating that SSMs and attention address partially overlapping but non-identical computational problems.
Misconception: HiPPO initialization is optional.
Empirical results from Gu et al. (2021) showed that random $A$ initialization fails to learn long-range dependencies on the Long Range Arena benchmark, while HiPPO initialization succeeds. The initialization is not a hyperparameter choice — it is structurally required for SSMs to function on sequences longer than a few hundred tokens.
Checklist or steps
The following sequence describes the structural components that must be specified to fully define a discrete-time SSM layer for a sequence model:
- [ ] Specify state dimension $N$ (typical values: 16, 64, 256)
- [ ] Choose $A$ parameterization: dense, diagonal, normal, or HiPPO-structured
- [ ] Select discretization method: ZOH, bilinear, or Euler
- [ ] Set step size $\Delta$ (learnable scalar or input-dependent projection)
- [ ] Determine input dependence: time-invariant ($B$, $C$ fixed) or selective ($B$, $C$, $\Delta$ functions of input)
- [ ] Define $B$ and $C$ projection dimensions (typically $D$-to-$N$ and $N$-to-$D$)
- [ ] Select training-time computation: global convolution (for LTI) or parallel selective scan (for selective SSMs)
- [ ] Specify nonlinearity and gating structure wrapping the SSM block (e.g., SiLU gate, residual connection)
- [ ] Choose normalization placement: pre-norm or post-norm relative to the SSM block
- [ ] Confirm hardware-specific kernel availability if using selective scan (CUDA kernels required for Mamba-style selective scan)
Reference table or matrix
| Property | S4 (LTI) | S4D (Diagonal) | Mamba (Selective) |
|---|---|---|---|
| $A$ structure | Normal (HiPPO) | Diagonal | Diagonal + input-dep. |
| Input dependence | None | None | $B$, $C$, $\Delta$ vary by token |
| Training compute | $O(L \log L)$ conv | $O(L \log L)$ conv | $O(L)$ parallel scan |
| Inference compute | $O(L)$ recurrence | $O(L)$ recurrence | $O(L)$ recurrence |
| Associative recall | Weak | Weak | Moderate |
| Long-range integration | Strong | Strong | Strong |
| Hardware dependency | Low | Low | High (CUDA kernel) |
| Introduced | NeurIPS 2022 | ICLR 2022 | arXiv 2312.00752 |
| Primary reference | Gu et al. 2021 | Gu et al. 2022 | Gu & Dao 2023 |
Additional variant comparisons covering RNN-to-SSM distinctions are available through Mamba vs RNNs and Mamba vs Transformers. Practitioners evaluating SSMs for sequence modeling applications should cross-reference these structural properties against task-specific recall and integration requirements.
References
- Kalman, R.E. (1960). "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering, ASME.
- Gu, A., Johnson, I., Goel, K., et al. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS 2020. arXiv:2008.07669
- Gu, A., Goel, K., & Ré, C. (2021/2022). "Efficiently Modeling Long Sequences with Structured State Spaces." NeurIPS 2022.
- Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752
- Gu, A., Gupta, A., Goel, K., & Ré, C. (2022). "On the Parameterization and Initialization of Diagonal State Space Models." NeurIPS 2022 (S4D). arXiv:2206.11893
- Long Range Arena Benchmark — Google Research. arXiv:2011.04006