How much information can a Transformer's attention mechanism store? This question connects modern deep learning to classical associative memory theory. In this post, I present a mathematical analysis using relative error bounds showing that linear attention has a capacity of \(O(d)\) while softmax attention achieves \(O(e^{d/2})\)âan exponential difference that helps explain their relative performance characteristics.
Note: This analysis follows the excellent framework presented in "Understanding Transformer from the Perspective of Associative Memory" by Zhong et al. (2025). Their paper provides the complete derivation, and this post presents my own working through of these ideas using a similar approach to build intuition.
Attention as Associative Memory
The key insight is to view attention mechanisms as associative memory systems. We want to store \(N\) key-value pairs \(\{(k_1, v_1), (k_2, v_2), \ldots, (k_N, v_N)\}\) and reliably retrieve value \(v_i\) when queried with key \(k_i\).
This framing immediately raises the central question: what is the maximum number of pairs \(N\) that can be stored as a function of the hidden dimension \(d\), while maintaining reliable retrieval?
Model Assumptions
To make the analysis tractable, we assume:
- Keys \(k_i \in \mathbb{R}^d\): Random unit vectors drawn uniformly from the \(d\)-dimensional hypersphere
- Values \(v_i \in \mathbb{R}^d\): Random unit vectors, independent of keys, with zero mean (\(E[v_i] = 0\))
- Retrieval criterion: We query with \(q = k_i\) (exact match) and expect to retrieve \(\tilde{v}_i\) such that \(\|\tilde{v}_i - v_i\| < \epsilon\|v_i\| = \epsilon\)
- Error tolerance: We allow a relative error of \(\epsilon\) (e.g., \(\epsilon = 0.1\) for 10% error)
- Dimension: \(d\) refers to the head dimension (\(d_k\) in transformer notation)
These assumptions model the "typical case" behavior and allow us to derive capacity limits.
Linear Attention: Superposition and Its Limits
Linear attention stores information in an explicit state matrix:
When we query with \(k_i\), the retrieval process gives:
Since \(k_i\) are unit vectors, \(k_i^T k_i = 1\), so we perfectly retrieve the signal component. However, we also get crosstalk from all other stored patterns.
Error Analysis and Capacity
To find the capacity, we analyze the magnitude of the error term. We'll make the standard assumption that keys \(k_j\) and values \(v_j\) are random, independent unit vectors in \(\mathbb{R}^d\).
We're interested in the expected squared norm of the error vector:
Expanding this expression:
Since all vectors are chosen independently, cross-terms where \(j \neq l\) have zero expectation. We only need to consider terms where \(j = l\):
For two random unit vectors in high-dimensional space, the expected squared dot product is \(E[(k_j^T k_i)^2] = 1/d\). Therefore:
For successful reconstruction, we want the expected squared error to be small relative to the signal. Setting a tolerance \(\epsilon\):
This gives us:
Therefore, the maximum number of patterns we can store is:
Linear attention's capacity scales linearly with dimension. For example, with \(\epsilon = 0.1\) (10% relative error), we can store approximately \(N = 0.01d\) patterns.
Softmax Attention: The Exponential Advantage
Softmax attention uses a different retrieval mechanism:
where \(\tau\) is a temperature parameter. The softmax creates a "winner-take-all" effect that suppresses crosstalk through its exponential nonlinearity.
Error Analysis and Capacity
The weights in softmax attention are:
The retrieved value is \(\tilde{v}_i = \sum_{j=1}^N \alpha_{ij} v_j\). To find the error, let's carefully expand:
Let's define \(s_i = \sum_{j \neq i} \alpha_{ij} = 1 - \alpha_{ii}\) as the total weight given to incorrect keys. This is the key quantity we need to control.
To bound \(s_i\), note that we can write:
where \(A = \exp(1/\tau)\) (the "signal" term) and \(B = \sum_{j \neq i} \exp(k_i^T k_j / \tau)\) (the "noise" term). Since \(s_i = B/(A+B) \leq B/A\), we have:
This bound is exactâno approximation needed! Taking expectations:
Now we can bound the expected squared error. Using the triangle inequality:
where we used that \(\|v_j - v_i\| \leq \|v_j\| + \|v_i\| = 2\) for unit vectors. Therefore:
where the last inequality uses \(s_i \in [0,1]\) so \(s_i^2 \leq s_i\). For relative error less than \(\epsilon\), we need \(E[\|\tilde{v}_i - v_i\|^2] < \epsilon^2\), which requires:
Substituting our bound on \(E[s_i]\):
Now we need to evaluate \(E[\exp(k_i^T k_j / \tau)]\). For random unit vectors, the dot product \(k_i^T k_j\) has mean 0 and variance \(1/d\). In high dimensions, by the Central Limit Theorem, this dot product is approximately Gaussian: \(k_i^T k_j \sim \mathcal{N}(0, 1/d)\).
Gaussian Approximation: While \(k_i^T k_j\) is not exactly Gaussian, this approximation becomes increasingly accurate as \(d\) grows, and gives us the correct scaling behavior. For a Gaussian random variable \(X \sim \mathcal{N}(0, \sigma^2)\), the moment generating function is \(E[e^{\lambda X}] = e^{\lambda^2\sigma^2/2}\).
Using the Gaussian approximation with variance \(\sigma^2 = 1/d\):
Therefore, our condition becomes:
Simplifying:
Taking logarithms and rearranging:
The right-hand side is maximized when its derivative with respect to \(\tau\) equals zero:
This gives the optimal temperature \(\tau_{opt} = 1/d\). Substituting back:
Therefore:
Softmax attention's capacity scales exponentially with dimension. The exponential nonlinearity creates a "winner-take-all" dynamic where the correct key can be distinguished from a large set of distractors while maintaining relative error below \(\epsilon\).
Note on Temperature: In standard transformer implementations, the attention scores are computed as \(q k^T / \sqrt{d_k}\), which in our notation corresponds to \(\tau = \sqrt{d}\), not \(1/\sqrt{d}\). Our capacity-optimal choice \(\tau_{opt} = 1/d\) arises from the mathematical analysis under the unit-norm, exact-match query model and should not be confused with the numerical stability scaling used in practice.
Understanding the Exponential Gap
The difference in capacityâlinear versus exponentialâstems from different retrieval mechanisms:
Feature | Linear Attention | Softmax Attention |
---|---|---|
Mechanism | Linear superposition | Non-linear competitive weighting |
Error Source | Additive crosstalk from all other memories | Residual weights from the tail of the softmax distribution |
Capacity | \(N = O(d)\) (Linear) | \(N = O(e^{d/2})\) (Exponential) |
Computational Cost | \(O(Nd)\) | \(O(N^2d)\) |
Intuitive Analogies
To better understand these mechanisms, consider these analogies:
Linear Attention is like a blurry composite photograph. You store memories by overlaying their faint images. When you recall one, you see the correct image but with blurry traces of every other image you've stored. As you add more images (\(N\)), the entire composite gets blurrier until nothing is recognizable. The capacity is limited by how much blur (\(d\)) you can tolerate.
Softmax Attention is like an indexed digital search. When you store memories, you give each one a unique, high-dimensional tag (the key). When you search with a tag, the system performs a targeted lookup. Because the tags are well-separated in high-dimensional space, the search can distinguish the correct memory while suppressing others. The exponential capacity comes from the properties of high-dimensional "tagging" space.
The Role of Nonlinearity
Linear attention averages all stored patterns weighted by their similarity to the query. This creates unavoidable interference that grows with the number of stored items.
Softmax attention's exponential nonlinearity creates a "winner-take-all" dynamic. The temperature parameter \(\tau\) controls the sharpness of this selectionâlower temperatures create sharper attention patterns. Note that the standard choice \(\tau = 1/\sqrt{d}\) naturally balances the scale of dot products in high dimensions.
Practical Implications
Using our derived bounds with error tolerance \(\epsilon = 0.1\) (10% relative error), we can compute the theoretical capacity for different dimensions:
Dimension \(d\) | Linear Attention Capacity | Softmax Attention Capacity | Ratio |
---|---|---|---|
64 | \(N \leq 1\) | \(N < 2 \times 10^{11}\) | \(> 10^{11}\) |
128 | \(N \leq 2\) | \(N < 2 \times 10^{25}\) | \(> 10^{25}\) |
256 | \(N \leq 3\) | \(N < 1 \times 10^{53}\) | \(> 10^{53}\) |
512 | \(N \leq 6\) | \(N < 4 \times 10^{108}\) | \(> 10^{108}\) |
1024 | \(N \leq 11\) | \(N < 6 \times 10^{219}\) | \(> 10^{219}\) |
Key observations:
- Linear attention: Capacity follows \(N \leq \lfloor \epsilon^2 d \rfloor + 1 = \lfloor 0.01d \rfloor + 1\). Even with d=1024, it can only store about 11 items reliably.
- Softmax attention: Capacity follows \(N < \frac{\epsilon^2}{4} e^{d/2} + 1 \approx 0.0025 \cdot e^{d/2} + 1\). The exponential scaling yields astronomically large theoretical capacities.
- Practical context: While softmax's theoretical capacity exceeds any conceivable sequence length, this explains why it can handle contexts of thousands or even millions of tokens without significant degradation, whereas linear attention struggles with even modest context lengths.
The exponential capacity advantage means that despite the computational cost of \(O(N^2)\), softmax attention remains the preferred choice for applications requiring precise retrieval from large contexts.
Conclusion
The mathematical analysis reveals a trade-off in attention design:
- Linear attention offers computational efficiency but limited capacity (\(O(d)\))
- Softmax attention provides exponential capacity (\(O(e^d)\)) at higher computational cost
This exponential gap helps explain why softmax attention has been successful in practice. For tasks requiring precise retrieval from large contexts, the exponential capacity provides a significant advantage.
References
- Understanding Transformer from the Perspective of Associative Memory. Shu Zhong, Mingyu Xu, Tenglong Ao, Guang Shi. arXiv preprint, 2025. arXiv:2505.19488