GammaNet - Stable Feature-Space Decay in Linear RNNs

Table of Contents

The Problem with Coordinates
The Gated Recurrence
Decaying in Directions: The Feature Map
Folding F into the Weights
The Missing Piece: Surgical Replacement
Folding Fails for the Delta Rule
An Even More Immediate Problem: Instability
Deriving the Stable Feature Bases
The Γ Parameterization
Summary
References

The Problem with Coordinates#

The associative memory framework established earlier shows that almost every component of a modern deep learning system performs the same fundamental operation: it looks up values by comparing an input against stored keys. What that framework does not fully address is a subtler design constraint that becomes critical once the memory is recurrent — once it accumulates and forgets associations over time.

The constraint is this. A memory state decays. In the simplest designs, it decays channel-wise: each coordinate of the state is multiplied by its own forgetting rate $\gamma_i \in [0,1]$ . This is a coordinate operation — it acts on each basis direction of the state space independently, treating the standard basis vectors as the natural "features" to forget.

But there is no reason the standard basis should be the natural feature space. The model's learned representations are linear combinations of many coordinates at once. A single "entity identity" feature might be spread across dozens of dimensions; a single "syntactic role" feature might point diagonally through the embedding space. When the memory decays channel-wise, it does not forget entity identity a little — it shreds the feature into pieces and forgets each piece at an independent rate, leaving a garbled remainder that the model must then learn to reconstruct before it can do anything useful.

The right principle is: a model should be allowed to operate in directions, not coordinates. For any coordinate-sensitive operation — a channel-wise decay, an elementwise activation, a per-dimension gate — there must be enough linear mixing before and after that operation to ensure it acts on the model's actual learned features, not on whatever the coordinate axes happen to be. In a standard MLP, the weight matrix $W^{\text{in}}$ provides this mixing before the ReLU and $W^{\text{out}}$ provides it after. For a recurrent memory, the same logic demands that the decay act on learned directions in the state space, not on raw coordinates.

GammaNet is what emerges when you apply this demand precisely and ask what structures are left standing.

The Gated Recurrence#

Recall from the associative memory framework that linear attention collapses its key-value associations into a running matrix state $S_t$ with updates $S_t = S_{t-1} + v_t k_t^T$ . Without any forgetting mechanism, this state accumulates all past associations with equal weight, regardless of how long ago they were written. [1, 2] The obvious remedy is a per-channel forgetting rate, proposed in Gated Linear Attention: [5, 6, 7]

S_t = S_{t-1} \operatorname{diag}(\gamma_t) + v_t k_t^T, \qquad o_t = S_t q_t \tag{GLA}

where $\gamma_t \in [0,1]^{d_k}$ assigns a separate decay rate to each channel. Different channels can forget at different speeds — useful if, say, short-range syntactic features should be forgotten faster than long-range entity associations. The readout $o_t = S_t q_t$ retrieves whatever the current state has stored in the direction of $q_t$ .

This is a reasonable first design. But the channel-wise decay has exactly the problem described above: it operates in coordinates, not directions.

Decaying in Directions: The Feature Map#

The fix is to replace the coordinate-wise decay with a decay that acts along learned directions. Introduce a fixed invertible matrix $F \in \mathbb{R}^{d_k \times d_k}$ whose columns define the preferred decay directions:

S_t = S_{t-1} \cdot F \operatorname{diag}(\gamma_t) F^{-1} + v_t k_t^T \tag{GLA-F}

The operator $F\operatorname{diag}(\gamma_t)F^{-1}$ applies a change of basis into the $F$ -feature space, performs the coordinate-wise decay there, and then changes back. Its effect is to decay the state along the columns of $F$ at rates $\gamma_{t,1},\ldots,\gamma_{t,d}$ , rather than along the standard basis vectors. This is precisely the "linear mixing around the coordinate operation" that the introduction called for: $F^{-1}$ mixes before the decay, $\operatorname{diag}(\gamma_t)$ acts coordinate-wise in feature space, and $F$ mixes back.

The model can now learn which directions in the state space correspond to features that should be forgotten quickly and which should persist — rather than being forced to align its internal representations with the standard basis or waste capacity on the coordinate transformation.

Folding F into the Weights#

Having motivated the feature map, a natural question is whether adding $F$ to (GLA) actually gives the model new expressive power. The answer, for this basic recurrence, is no.

Folding refers to the observation that a fixed linear map sandwiched between two learnable matrices can always be absorbed into those matrices without changing the model's function class. Adding a fixed rotation before the first layer of an MLP, for example, is equivalent to simply learning a rotated first-layer weight matrix — the function class is identical.

The same applies here. Define the change of representation $\tilde{S}_t = S_t F$ and the modified projections

\tilde{k}_t = F^T k_t, \qquad \tilde{q}_t = F^{-1} q_t

Then:

\tilde{S}_t = (S_{t-1} F\operatorname{diag}(\gamma_t)F^{-1} + v_t k_t^T)\,F = \tilde{S}_{t-1}\operatorname{diag}(\gamma_t) + v_t \tilde{k}_t^T

and $o_t = S_t q_t = (S_t F)(F^{-1}q_t) = \tilde{S}_t \tilde{q}_t$ . Model (GLA-F) is exactly equivalent to the standard (GLA), with modified projection matrices $\tilde{W}_k = F^T W_k$ and $\tilde{W}_q = F^{-1} W_q$ . Since these are still arbitrary learnable matrices, $F$ vanishes into the weights and adds nothing.

This means that for the gated recurrence alone, the feature-decay motivation — while conceptually correct — is already satisfied for free. Any feature basis the model wants to operate in can be implicitly learned through the key and query projections, without ever appearing explicitly in the architecture.

The Missing Piece: Surgical Replacement#

The gated recurrence has a structural limitation beyond forgetting: it can only add new key-value associations on top of existing ones. When the model needs to update the value stored at a key direction it already knows about — to revise a belief, correct an entity attribute, or track a changing state — it cannot do so cleanly. The old association persists, corrupted by the new write.

The right operation is to first erase the old value before writing the new one. Suppose the old state $S_{t-1}$ has an association stored in some direction $\kappa_t$ : reading it out gives $S_{t-1}\kappa_t$ . To remove exactly this association while preserving everything orthogonal to $\kappa_t$ , subtract the rank-one outer product $S_{t-1}\kappa_t\kappa_t^T$ :

S_{t-1} - S_{t-1}\kappa_t\kappa_t^T = S_{t-1}(I - \kappa_t\kappa_t^T)

This annihilates the $\kappa_t$ component of the state and leaves all orthogonal content intact. Allowing a partial erase controlled by $\beta_t \in [0,1]$ and then writing the new value gives the delta rule: [2, 3, 4]

S_t = S_{t-1}(I - \beta_t \kappa_t\kappa_t^T) + \beta_t v_t k_t^T \tag{DeltaNet}

where $\kappa_t = k_t / \|k_t\|$ is the normalized key used as the erase direction.

Combining temporal forgetting with surgical replacement gives the full recurrence known as Kimi Delta Attention that will be our starting point for GammaNet: [6, 7]

S_t = S_{t-1} \operatorname{diag}(\gamma_t)(I - \beta_t \kappa_t\kappa_t^T) + \beta_t v_t k_t^T \tag{KDA}

The first term decays old associations and erases the specific one about to be overwritten; the second term writes the new one.

Folding Fails for the Delta Rule#

Now apply the same feature-map upgrade: replace the channel-wise decay with $F\operatorname{diag}(\gamma_t)F^{-1}$ :

S_t = S_{t-1} \cdot F\operatorname{diag}(\gamma_t)F^{-1} \cdot (I - \beta_t\kappa_t\kappa_t^T) + \beta_t v_t k_t^T \tag{KDA-F}

Attempt the same folding. Define $\tilde{S}_t = S_t F$ and $\tilde{k}_t = F^T k_t$ . Then:

\tilde{S}_t = \tilde{S}_{t-1}\operatorname{diag}(\gamma_t) \cdot \underbrace{F^{-1}(I - \beta_t\kappa_t\kappa_t^T)F}_{\text{this does not simplify}} + \beta_t v_t \tilde{k}_t^T

Expanding the middle factor:

F^{-1}(I - \beta_t\kappa_t\kappa_t^T)F = I - \beta_t(F^{-1}\kappa_t)(\kappa_t^T F)

This is a rank-one subtraction with different left and right factors: $F^{-1}\kappa_t$ on the left and $(F^T\kappa_t)^T$ on the right. For this to be a symmetric projector $I - \beta_t\tilde{\kappa}_t\tilde{\kappa}_t^T$ — the only form the standard model can produce — we would need $F^{-1}\kappa_t \propto F^T\kappa_t$ , which requires $F^{-1} \propto F^T$ , i.e., $F$ is orthogonal (up to scaling). For any non-orthogonal $F$ , the erase term is a biorthogonal rank-one operator that no choice of projection matrix $W_a$ can reproduce from a symmetric projector.

Why does folding break here when it worked before? In the gated recurrence, the key $k_t$ appeared only once — in the write term $v_t k_t^T$ — so the coordinate change $\tilde{k}_t = F^T k_t$ absorbed $F$ cleanly. In the KDA recurrence, the key plays two roles: the erase direction $\kappa_t = k_t/\|k_t\|$ and the write address $k_t$ . Changing coordinates transforms both simultaneously, but the erase term conjugates $F$ around the projector (left-multiplying $F^{-1}$ and right-multiplying $F$ ) while the write term absorbs $F$ only on the right — structurally different transformations that leave a residual dependence on $F$ impossible to hide in projection weights. [4, 6, 7]

For the KDA recurrence, $F$ is not redundant. It genuinely changes what the model can compute, and architectural choices about $F$ matter.

An Even More Immediate Problem: Instability#

Before asking what $F$ can express, there is a more immediate concern. For general non-orthogonal $F$ , the KDA recurrence can be unstable: the state grows exponentially even with no writes.

Consider the homogeneous part of (KDA-F) with $v_t = 0$ :

S_t = S_{t-1} \cdot F\operatorname{diag}(\gamma_t)F^{-1}(I - \beta_t\kappa_t\kappa_t^T)

For orthogonal $F$ : $\|F\operatorname{diag}(\gamma)F^T\|_2 = \|\operatorname{diag}(\gamma)\|_2 \leq 1$ (orthogonal maps preserve singular values), and $\|I - \beta\kappa\kappa^T\|_2 = 1$ for $\beta \in [0,1]$ . Every step is non-expansive. For standard (KDA) with $F = I$ , this gives unconditional stability. [6, 7]

For non-orthogonal $F$ , this fails. Take:

F = \begin{bmatrix}1&2\\0&1\end{bmatrix}, \quad \operatorname{diag}(\gamma) = \begin{bmatrix}0.2&0\\0&0.8\end{bmatrix}, \quad \beta = 0.8, \quad \kappa = \tfrac{1}{\sqrt{2}}\begin{bmatrix}-1\\1\end{bmatrix}

Computing $A = F\operatorname{diag}(\gamma)F^{-1}(I - \beta\kappa\kappa^T)$ directly yields a matrix with spectral radius $\approx 1.05 > 1$ . Repeated application makes the state grow exponentially.

The root cause is a metric mismatch: $F\operatorname{diag}(\gamma)F^{-1}$ is contractive only in the $F$ -induced norm $\|F^{-1}(\cdot)\|_2$ , while the projector $I - \beta\kappa\kappa^T$ is non-expansive in the Euclidean norm. For non-orthogonal $F$ these norms are incompatible — the projector can amplify directions that the decay was supposed to contract.

This settles the question of whether non-orthogonal $F$ is merely a reparameterization of (KDA). Standard (KDA) is unconditionally stable; (KDA-F) with non-orthogonal $F$ is not. A stable model and an unstable model cannot be reparameterizations of each other. The feature basis $F$ is a genuine architectural choice with real consequences.

Deriving the Stable Feature Bases#

We want to characterize all fixed invertible $F$ for which (KDA-F) is non-expansive for every admissible $\gamma_t \in [0,1]^d$ , $\beta_t \in [0,1]$ , and unit-norm $\kappa_t$ .

Since $\|I - \beta\kappa\kappa^T\|_2 = 1$ always, the condition reduces to:

\|F\operatorname{diag}(\gamma)F^{-1}\|_2 \leq 1 \quad \text{for every diagonal } \operatorname{diag}(\gamma) \text{ with entries in } [0,1]

Write $F$ in column form with columns $f_i$ and $F^{-1}$ in row form with rows $g_i^T$ . Setting $\operatorname{diag}(\gamma) = E_i$ (the $i$ -th coordinate projector) gives:

\|F E_i F^{-1}\|_2 = \|f_i g_i^T\|_2 = \|f_i\|_2 \cdot \|g_i\|_2

Since $g_i^T f_i = 1$ , Cauchy-Schwarz forces $\|f_i\|_2 \cdot \|g_i\|_2 \geq 1$ . The stability requirement demands this equals exactly 1, which by Cauchy-Schwarz equality requires $g_i \propto f_i$ . Combined with $g_i^T f_j = 0$ for $j \neq i$ (rows of $F^{-1}$ are dual to columns of $F$ ), this forces $f_i^T f_j = 0$ for all $i \neq j$ : the columns of $F$ must be mutually orthogonal. This is the condition $F^T F = \operatorname{diag}$ , which characterizes exactly:

F = Q\Gamma

for orthogonal $Q$ and positive diagonal $\Gamma$ . This class is both necessary and sufficient for unconditional stability.

The Γ Parameterization#

With $F = Q\Gamma$ , the decay operator is $Q\Gamma\operatorname{diag}(\gamma_t)\Gamma^{-1}Q^T = Q\operatorname{diag}(\gamma_t)Q^T$ (since diagonal matrices commute). The erase projector becomes:

F^{-1}\kappa_t\kappa_t^T F = \Gamma^{-1}Q^T\kappa_t\kappa_t^T Q\Gamma = \Gamma^{-1}a_t a_t^T\Gamma

where $a_t = Q^T\kappa_t$ . Since $\kappa_t = \operatorname{normalize}(k_t) = \operatorname{normalize}(W_k x_t)$ , and $Q$ is orthogonal (so it preserves norms):

a_t = Q^T\operatorname{normalize}(W_k x_t) = \operatorname{normalize}(Q^T W_k x_t) = \operatorname{normalize}(W_a x_t)

where $W_a := Q^T W_k$ is simply a different learned matrix. The orthogonal factor $Q$ folds into the erase projection — it is a gauge choice exactly as in the gated recurrence, and for the same reason: $Q$ acts only as a fixed linear map adjacent to a learnable weight matrix. What cannot fold is $\Gamma$ , which appears asymmetrically (as $\Gamma^{-1}$ on the left and $\Gamma$ on the right of the erase projector) and therefore cannot be absorbed by a single weight matrix.

Working in the $Q$ -rotated basis (absorbed into all projection matrices), the GammaNet recurrence is:

\boxed{S_t = S_{t-1} \cdot \operatorname{diag}(\gamma_t) \cdot \left(I - \beta_t \,\Gamma^{-1} a_t a_t^T \Gamma\right) + \beta_t\, v_t k_t^T} \tag{GammaNet}

with $a_t = \operatorname{normalize}(W_a x_t)$ , $k_t = W_k x_t$ , and $\Gamma$ a fixed learnable positive diagonal matrix. [6, 7]

The separately learned $W_a \neq W_k$ decouples where memory is addressed (via $k_t$ ) from which feature direction is erased (via $a_t$ ). Setting $W_a = W_k$ forces the model to use the same linear map for addressing and erasing — a meaningful structural constraint. Allowing them to differ lets the model address by entity identity and erase by attribute type.

Summary#

The path from the gated recurrence to GammaNet in three steps:

The gated recurrence decays in coordinates, not directions. Replacing channel-wise decay with a feature-basis decay $F\operatorname{diag}(\gamma)F^{-1}$ is the conceptually correct fix — but for the gated recurrence alone, $F$ is redundant: it folds into the key and query projections with no change in expressive power. [1, 2, 5]
The delta rule breaks folding. Adding surgical replacement makes the key play two roles simultaneously. The coordinate change that absorbed $F$ in the gated case now transforms the erase and write terms differently, leaving a residual dependence on $F$ that no projection matrix can reproduce. More immediately, non-orthogonal $F$ makes the recurrence unstable — ruling out any claim that it is a reparameterization of the unconditionally stable baseline. [4, 6, 7]
Stability forces $F = Q\Gamma$ , and $Q$ folds. The only feature bases guaranteeing unconditional stability are those with orthogonal columns — $F = Q\Gamma$ for orthogonal $Q$ and positive diagonal $\Gamma$ . The orthogonal factor $Q$ folds into the erase projection $W_a$ , leaving $\Gamma$ as the irreducible fixed feature geometry. $\Gamma$ cannot fold because it appears asymmetrically in the erase projector, and it is precisely this asymmetry that lets the model erase along learned feature directions rather than raw coordinates.

References#

[1] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.

[2] Schlag, I., Irie, K., & Schmidhuber, J. (2021). Linear Transformers Are Secretly Fast Weight Programmers.

[3] Widrow, B., & Hoff, M. E. (1960). Adaptive Switching Circuits.

[4] Yang, S., Wang, B., Zhang, Y., Shen, Y., & Kim, Y. (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length.

[5] Yang, S., Wang, B., Shen, Y., Panda, R., & Kim, Y. (2024). Gated Linear Attention Transformers with Hardware-Efficient Training.

[6] Yang, S., Kautz, J., & Hatamizadeh, A. (2025). Gated Delta Networks: Improving Mamba2 with Delta Rule.

[7] Kimi Team, Zhang, Y., Lin, Z., et al. (2025). Kimi Linear: An Expressive, Efficient Attention Architecture.

Cite this post

@online{gamma-net,
  author    = {Lucas Sun},
  title     = {GammaNet - Stable Feature-Space Decay in Linear RNNs},
  year      = {2026},
  month     = {05},
  day       = {02},
  url       = {https://xtimecrystal.com/posts/260502-gamma-net/},
}