Lucas Sun

GammaNet - Stable Feature-Space Decay in Linear RNNs

The Problem with Coordinates#

The associative memory framework established earlier shows that almost every component of a modern deep learning system performs the same fundamental operation: it looks up values by comparing an input against stored keys. What that framework does not fully address is a subtler design constraint that becomes critical once the memory is recurrent — once it accumulates and forgets associations over time.

The constraint is this. A memory state decays. In the simplest designs, it decays channel-wise: each coordinate of the state is multiplied by its own forgetting rate γi[0,1]\gamma_i \in [0,1]. This is a coordinate operation — it acts on each basis direction of the state space independently, treating the standard basis vectors as the natural "features" to forget.

But there is no reason the standard basis should be the natural feature space. The model's learned representations are linear combinations of many coordinates at once. A single "entity identity" feature might be spread across dozens of dimensions; a single "syntactic role" feature might point diagonally through the embedding space. When the memory decays channel-wise, it does not forget entity identity a little — it shreds the feature into pieces and forgets each piece at an independent rate, leaving a garbled remainder that the model must then learn to reconstruct before it can do anything useful.

The right principle is: a model should be allowed to operate in directions, not coordinates. For any coordinate-sensitive operation — a channel-wise decay, an elementwise activation, a per-dimension gate — there must be enough linear mixing before and after that operation to ensure it acts on the model's actual learned features, not on whatever the coordinate axes happen to be. In a standard MLP, the weight matrix WinW^{\text{in}} provides this mixing before the ReLU and WoutW^{\text{out}} provides it after. For a recurrent memory, the same logic demands that the decay act on learned directions in the state space, not on raw coordinates.

GammaNet is what emerges when you apply this demand precisely and ask what structures are left standing.


The Gated Recurrence#

Recall from the associative memory framework that linear attention collapses its key-value associations into a running matrix state StS_t with updates St=St1+vtktTS_t = S_{t-1} + v_t k_t^T. Without any forgetting mechanism, this state accumulates all past associations with equal weight, regardless of how long ago they were written. [1, 2] The obvious remedy is a per-channel forgetting rate, proposed in Gated Linear Attention: [5, 6, 7]

St=St1diag(γt)+vtktT,ot=Stqt(GLA)S_t = S_{t-1} \operatorname{diag}(\gamma_t) + v_t k_t^T, \qquad o_t = S_t q_t \tag{GLA}

where γt[0,1]dk\gamma_t \in [0,1]^{d_k} assigns a separate decay rate to each channel. Different channels can forget at different speeds — useful if, say, short-range syntactic features should be forgotten faster than long-range entity associations. The readout ot=Stqto_t = S_t q_t retrieves whatever the current state has stored in the direction of qtq_t.

This is a reasonable first design. But the channel-wise decay has exactly the problem described above: it operates in coordinates, not directions.


Decaying in Directions: The Feature Map#

The fix is to replace the coordinate-wise decay with a decay that acts along learned directions. Introduce a fixed invertible matrix FRdk×dkF \in \mathbb{R}^{d_k \times d_k} whose columns define the preferred decay directions:

St=St1Fdiag(γt)F1+vtktT(GLA-F)S_t = S_{t-1} \cdot F \operatorname{diag}(\gamma_t) F^{-1} + v_t k_t^T \tag{GLA-F}

The operator Fdiag(γt)F1F\operatorname{diag}(\gamma_t)F^{-1} applies a change of basis into the FF-feature space, performs the coordinate-wise decay there, and then changes back. Its effect is to decay the state along the columns of FF at rates γt,1,,γt,d\gamma_{t,1},\ldots,\gamma_{t,d}, rather than along the standard basis vectors. This is precisely the "linear mixing around the coordinate operation" that the introduction called for: F1F^{-1} mixes before the decay, diag(γt)\operatorname{diag}(\gamma_t) acts coordinate-wise in feature space, and FF mixes back.

The model can now learn which directions in the state space correspond to features that should be forgotten quickly and which should persist — rather than being forced to align its internal representations with the standard basis or waste capacity on the coordinate transformation.


Folding F into the Weights#

Having motivated the feature map, a natural question is whether adding FF to (GLA) actually gives the model new expressive power. The answer, for this basic recurrence, is no.

Folding refers to the observation that a fixed linear map sandwiched between two learnable matrices can always be absorbed into those matrices without changing the model's function class. Adding a fixed rotation before the first layer of an MLP, for example, is equivalent to simply learning a rotated first-layer weight matrix — the function class is identical.

The same applies here. Define the change of representation S~t=StF\tilde{S}_t = S_t F and the modified projections

k~t=FTkt,q~t=F1qt\tilde{k}_t = F^T k_t, \qquad \tilde{q}_t = F^{-1} q_t

Then:

S~t=(St1Fdiag(γt)F1+vtktT)F=S~t1diag(γt)+vtk~tT\tilde{S}_t = (S_{t-1} F\operatorname{diag}(\gamma_t)F^{-1} + v_t k_t^T)\,F = \tilde{S}_{t-1}\operatorname{diag}(\gamma_t) + v_t \tilde{k}_t^T

and ot=Stqt=(StF)(F1qt)=S~tq~to_t = S_t q_t = (S_t F)(F^{-1}q_t) = \tilde{S}_t \tilde{q}_t. Model (GLA-F) is exactly equivalent to the standard (GLA), with modified projection matrices W~k=FTWk\tilde{W}_k = F^T W_k and W~q=F1Wq\tilde{W}_q = F^{-1} W_q. Since these are still arbitrary learnable matrices, FF vanishes into the weights and adds nothing.

This means that for the gated recurrence alone, the feature-decay motivation — while conceptually correct — is already satisfied for free. Any feature basis the model wants to operate in can be implicitly learned through the key and query projections, without ever appearing explicitly in the architecture.


The Missing Piece: Surgical Replacement#

The gated recurrence has a structural limitation beyond forgetting: it can only add new key-value associations on top of existing ones. When the model needs to update the value stored at a key direction it already knows about — to revise a belief, correct an entity attribute, or track a changing state — it cannot do so cleanly. The old association persists, corrupted by the new write.

The right operation is to first erase the old value before writing the new one. Suppose the old state St1S_{t-1} has an association stored in some direction κt\kappa_t: reading it out gives St1κtS_{t-1}\kappa_t. To remove exactly this association while preserving everything orthogonal to κt\kappa_t, subtract the rank-one outer product St1κtκtTS_{t-1}\kappa_t\kappa_t^T:

St1St1κtκtT=St1(IκtκtT)S_{t-1} - S_{t-1}\kappa_t\kappa_t^T = S_{t-1}(I - \kappa_t\kappa_t^T)

This annihilates the κt\kappa_t component of the state and leaves all orthogonal content intact. Allowing a partial erase controlled by βt[0,1]\beta_t \in [0,1] and then writing the new value gives the delta rule: [2, 3, 4]

St=St1(IβtκtκtT)+βtvtktT(DeltaNet)S_t = S_{t-1}(I - \beta_t \kappa_t\kappa_t^T) + \beta_t v_t k_t^T \tag{DeltaNet}

where κt=kt/kt\kappa_t = k_t / \|k_t\| is the normalized key used as the erase direction.

Combining temporal forgetting with surgical replacement gives the full recurrence known as Kimi Delta Attention that will be our starting point for GammaNet: [6, 7]

St=St1diag(γt)(IβtκtκtT)+βtvtktT(KDA)S_t = S_{t-1} \operatorname{diag}(\gamma_t)(I - \beta_t \kappa_t\kappa_t^T) + \beta_t v_t k_t^T \tag{KDA}

The first term decays old associations and erases the specific one about to be overwritten; the second term writes the new one.


Folding Fails for the Delta Rule#

Now apply the same feature-map upgrade: replace the channel-wise decay with Fdiag(γt)F1F\operatorname{diag}(\gamma_t)F^{-1}:

St=St1Fdiag(γt)F1(IβtκtκtT)+βtvtktT(KDA-F)S_t = S_{t-1} \cdot F\operatorname{diag}(\gamma_t)F^{-1} \cdot (I - \beta_t\kappa_t\kappa_t^T) + \beta_t v_t k_t^T \tag{KDA-F}

Attempt the same folding. Define S~t=StF\tilde{S}_t = S_t F and k~t=FTkt\tilde{k}_t = F^T k_t. Then:

S~t=S~t1diag(γt)F1(IβtκtκtT)Fthis does not simplify+βtvtk~tT\tilde{S}_t = \tilde{S}_{t-1}\operatorname{diag}(\gamma_t) \cdot \underbrace{F^{-1}(I - \beta_t\kappa_t\kappa_t^T)F}_{\text{this does not simplify}} + \beta_t v_t \tilde{k}_t^T

Expanding the middle factor:

F1(IβtκtκtT)F=Iβt(F1κt)(κtTF)F^{-1}(I - \beta_t\kappa_t\kappa_t^T)F = I - \beta_t(F^{-1}\kappa_t)(\kappa_t^T F)

This is a rank-one subtraction with different left and right factors: F1κtF^{-1}\kappa_t on the left and (FTκt)T(F^T\kappa_t)^T on the right. For this to be a symmetric projector Iβtκ~tκ~tTI - \beta_t\tilde{\kappa}_t\tilde{\kappa}_t^T — the only form the standard model can produce — we would need F1κtFTκtF^{-1}\kappa_t \propto F^T\kappa_t, which requires F1FTF^{-1} \propto F^T, i.e., FF is orthogonal (up to scaling). For any non-orthogonal FF, the erase term is a biorthogonal rank-one operator that no choice of projection matrix WaW_a can reproduce from a symmetric projector.

Why does folding break here when it worked before? In the gated recurrence, the key ktk_t appeared only once — in the write term vtktTv_t k_t^T — so the coordinate change k~t=FTkt\tilde{k}_t = F^T k_t absorbed FF cleanly. In the KDA recurrence, the key plays two roles: the erase direction κt=kt/kt\kappa_t = k_t/\|k_t\| and the write address ktk_t. Changing coordinates transforms both simultaneously, but the erase term conjugates FF around the projector (left-multiplying F1F^{-1} and right-multiplying FF) while the write term absorbs FF only on the right — structurally different transformations that leave a residual dependence on FF impossible to hide in projection weights. [4, 6, 7]

For the KDA recurrence, FF is not redundant. It genuinely changes what the model can compute, and architectural choices about FF matter.


An Even More Immediate Problem: Instability#

Before asking what FF can express, there is a more immediate concern. For general non-orthogonal FF, the KDA recurrence can be unstable: the state grows exponentially even with no writes.

Consider the homogeneous part of (KDA-F) with vt=0v_t = 0:

St=St1Fdiag(γt)F1(IβtκtκtT)S_t = S_{t-1} \cdot F\operatorname{diag}(\gamma_t)F^{-1}(I - \beta_t\kappa_t\kappa_t^T)

For orthogonal FF: Fdiag(γ)FT2=diag(γ)21\|F\operatorname{diag}(\gamma)F^T\|_2 = \|\operatorname{diag}(\gamma)\|_2 \leq 1 (orthogonal maps preserve singular values), and IβκκT2=1\|I - \beta\kappa\kappa^T\|_2 = 1 for β[0,1]\beta \in [0,1]. Every step is non-expansive. For standard (KDA) with F=IF = I, this gives unconditional stability. [6, 7]

For non-orthogonal FF, this fails. Take:

F=[1201],diag(γ)=[0.2000.8],β=0.8,κ=12[11]F = \begin{bmatrix}1&2\\0&1\end{bmatrix}, \quad \operatorname{diag}(\gamma) = \begin{bmatrix}0.2&0\\0&0.8\end{bmatrix}, \quad \beta = 0.8, \quad \kappa = \tfrac{1}{\sqrt{2}}\begin{bmatrix}-1\\1\end{bmatrix}

Computing A=Fdiag(γ)F1(IβκκT)A = F\operatorname{diag}(\gamma)F^{-1}(I - \beta\kappa\kappa^T) directly yields a matrix with spectral radius 1.05>1\approx 1.05 > 1. Repeated application makes the state grow exponentially.

The root cause is a metric mismatch: Fdiag(γ)F1F\operatorname{diag}(\gamma)F^{-1} is contractive only in the FF-induced norm F1()2\|F^{-1}(\cdot)\|_2, while the projector IβκκTI - \beta\kappa\kappa^T is non-expansive in the Euclidean norm. For non-orthogonal FF these norms are incompatible — the projector can amplify directions that the decay was supposed to contract.

This settles the question of whether non-orthogonal FF is merely a reparameterization of (KDA). Standard (KDA) is unconditionally stable; (KDA-F) with non-orthogonal FF is not. A stable model and an unstable model cannot be reparameterizations of each other. The feature basis FF is a genuine architectural choice with real consequences.


Deriving the Stable Feature Bases#

We want to characterize all fixed invertible FF for which (KDA-F) is non-expansive for every admissible γt[0,1]d\gamma_t \in [0,1]^d, βt[0,1]\beta_t \in [0,1], and unit-norm κt\kappa_t.

Since IβκκT2=1\|I - \beta\kappa\kappa^T\|_2 = 1 always, the condition reduces to:

Fdiag(γ)F121for every diagonal diag(γ) with entries in [0,1]\|F\operatorname{diag}(\gamma)F^{-1}\|_2 \leq 1 \quad \text{for every diagonal } \operatorname{diag}(\gamma) \text{ with entries in } [0,1]

Write FF in column form with columns fif_i and F1F^{-1} in row form with rows giTg_i^T. Setting diag(γ)=Ei\operatorname{diag}(\gamma) = E_i (the ii-th coordinate projector) gives:

FEiF12=figiT2=fi2gi2\|F E_i F^{-1}\|_2 = \|f_i g_i^T\|_2 = \|f_i\|_2 \cdot \|g_i\|_2

Since giTfi=1g_i^T f_i = 1, Cauchy-Schwarz forces fi2gi21\|f_i\|_2 \cdot \|g_i\|_2 \geq 1. The stability requirement demands this equals exactly 1, which by Cauchy-Schwarz equality requires gifig_i \propto f_i. Combined with giTfj=0g_i^T f_j = 0 for jij \neq i (rows of F1F^{-1} are dual to columns of FF), this forces fiTfj=0f_i^T f_j = 0 for all iji \neq j: the columns of FF must be mutually orthogonal. This is the condition FTF=diagF^T F = \operatorname{diag}, which characterizes exactly:

F=QΓF = Q\Gamma

for orthogonal QQ and positive diagonal Γ\Gamma. This class is both necessary and sufficient for unconditional stability.


The Γ Parameterization#

With F=QΓF = Q\Gamma, the decay operator is QΓdiag(γt)Γ1QT=Qdiag(γt)QTQ\Gamma\operatorname{diag}(\gamma_t)\Gamma^{-1}Q^T = Q\operatorname{diag}(\gamma_t)Q^T (since diagonal matrices commute). The erase projector becomes:

F1κtκtTF=Γ1QTκtκtTQΓ=Γ1atatTΓF^{-1}\kappa_t\kappa_t^T F = \Gamma^{-1}Q^T\kappa_t\kappa_t^T Q\Gamma = \Gamma^{-1}a_t a_t^T\Gamma

where at=QTκta_t = Q^T\kappa_t. Since κt=normalize(kt)=normalize(Wkxt)\kappa_t = \operatorname{normalize}(k_t) = \operatorname{normalize}(W_k x_t), and QQ is orthogonal (so it preserves norms):

at=QTnormalize(Wkxt)=normalize(QTWkxt)=normalize(Waxt)a_t = Q^T\operatorname{normalize}(W_k x_t) = \operatorname{normalize}(Q^T W_k x_t) = \operatorname{normalize}(W_a x_t)

where Wa:=QTWkW_a := Q^T W_k is simply a different learned matrix. The orthogonal factor QQ folds into the erase projection — it is a gauge choice exactly as in the gated recurrence, and for the same reason: QQ acts only as a fixed linear map adjacent to a learnable weight matrix. What cannot fold is Γ\Gamma, which appears asymmetrically (as Γ1\Gamma^{-1} on the left and Γ\Gamma on the right of the erase projector) and therefore cannot be absorbed by a single weight matrix.

Working in the QQ-rotated basis (absorbed into all projection matrices), the GammaNet recurrence is:

St=St1diag(γt)(IβtΓ1atatTΓ)+βtvtktT(GammaNet)\boxed{S_t = S_{t-1} \cdot \operatorname{diag}(\gamma_t) \cdot \left(I - \beta_t \,\Gamma^{-1} a_t a_t^T \Gamma\right) + \beta_t\, v_t k_t^T} \tag{GammaNet}

with at=normalize(Waxt)a_t = \operatorname{normalize}(W_a x_t), kt=Wkxtk_t = W_k x_t, and Γ\Gamma a fixed learnable positive diagonal matrix. [6, 7]

The separately learned WaWkW_a \neq W_k decouples where memory is addressed (via ktk_t) from which feature direction is erased (via ata_t). Setting Wa=WkW_a = W_k forces the model to use the same linear map for addressing and erasing — a meaningful structural constraint. Allowing them to differ lets the model address by entity identity and erase by attribute type.


Summary#

The path from the gated recurrence to GammaNet in three steps:

  1. The gated recurrence decays in coordinates, not directions. Replacing channel-wise decay with a feature-basis decay Fdiag(γ)F1F\operatorname{diag}(\gamma)F^{-1} is the conceptually correct fix — but for the gated recurrence alone, FF is redundant: it folds into the key and query projections with no change in expressive power. [1, 2, 5]

  2. The delta rule breaks folding. Adding surgical replacement makes the key play two roles simultaneously. The coordinate change that absorbed FF in the gated case now transforms the erase and write terms differently, leaving a residual dependence on FF that no projection matrix can reproduce. More immediately, non-orthogonal FF makes the recurrence unstable — ruling out any claim that it is a reparameterization of the unconditionally stable baseline. [4, 6, 7]

  3. Stability forces F=QΓF = Q\Gamma, and QQ folds. The only feature bases guaranteeing unconditional stability are those with orthogonal columns — F=QΓF = Q\Gamma for orthogonal QQ and positive diagonal Γ\Gamma. The orthogonal factor QQ folds into the erase projection WaW_a, leaving Γ\Gamma as the irreducible fixed feature geometry. Γ\Gamma cannot fold because it appears asymmetrically in the erase projector, and it is precisely this asymmetry that lets the model erase along learned feature directions rather than raw coordinates.

References#

[1] Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.

[2] Schlag, I., Irie, K., & Schmidhuber, J. (2021). Linear Transformers Are Secretly Fast Weight Programmers.

[3] Widrow, B., & Hoff, M. E. (1960). Adaptive Switching Circuits.

[4] Yang, S., Wang, B., Zhang, Y., Shen, Y., & Kim, Y. (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length.

[5] Yang, S., Wang, B., Shen, Y., Panda, R., & Kim, Y. (2024). Gated Linear Attention Transformers with Hardware-Efficient Training.

[6] Yang, S., Kautz, J., & Hatamizadeh, A. (2025). Gated Delta Networks: Improving Mamba2 with Delta Rule.

[7] Kimi Team, Zhang, Y., Lin, Z., et al. (2025). Kimi Linear: An Expressive, Efficient Attention Architecture.

Cite this post
@online{gamma-net,
  author    = {Lucas Sun},
  title     = {GammaNet - Stable Feature-Space Decay in Linear RNNs},
  year      = {2026},
  month     = {05},
  day       = {02},
  url       = {https://xtimecrystal.com/posts/260502-gamma-net/},
}