Lucas Sun

Bra-Ket (Dirac) Notation in Deep Learning (Part 1)

This series will use bra-ket notation as its default language for deep learning. Standard matrix notation is efficient for implementation, but it often hides the row/column logic that matters mathematically. Bra-ket notation makes the key-value structure explicit: bras produce coefficients by dot product, kets are the vectors being combined, and sums of ket-bras expose low-rank structure directly. This is often closer to how modern work analyzes MLPs, attention, and related architectures.

1. Introduction to Bra-Ket Notation#

Let HRdH \cong \mathbb{R}^d be a finite-dimensional inner-product space. A ket xH|x\rangle \in H is a column vector, a bra uH\langle u| \in H^* is a row vector, the bracket ux\langle u|x\rangle is a dot product, and the ket-bra vu|v\rangle\langle u| is a rank-one operator.

A matrix can be written in row form or column form:

W=[ c1c2r1c3c4  c1c2r2c3c4  c1c2r3c3c4 ]=[c1c2c3c4].W= \left[ \begin{array}{c} \hphantom{\mid\ \mathbf c_1\quad\mathbf c_2}\mathclap{\rule{2.2em}{0.4pt}\langle r_1|\rule{2.2em}{0.4pt}}\hphantom{\mathbf c_3\quad\mathbf c_4\ \mid}\\ \hphantom{\mid\ \mathbf c_1\quad\mathbf c_2}\mathclap{\rule{2.2em}{0.4pt}\langle r_2|\rule{2.2em}{0.4pt}}\hphantom{\mathbf c_3\quad\mathbf c_4\ \mid}\\ \hphantom{\mid\ \mathbf c_1\quad\mathbf c_2}\mathclap{\rule{2.2em}{0.4pt}\langle r_3|\rule{2.2em}{0.4pt}}\hphantom{\mathbf c_3\quad\mathbf c_4\ \mid} \end{array} \right] = \left[ \begin{array}{cccc} \mid & \mid & \mid & \mid \\ |c_1\rangle & |c_2\rangle & |c_3\rangle & |c_4\rangle \\ \mid & \mid & \mid & \mid \end{array} \right].

The row form emphasizes dot products:

Wx=[r1xr2xr3x],W|x\rangle= \begin{bmatrix} \langle r_1|x\rangle\\ \langle r_2|x\rangle\\ \langle r_3|x\rangle \end{bmatrix},

while the column form emphasizes linear combination:

Wa=j=1najcj.W|a\rangle = \sum_{j=1}^n a_j |c_j\rangle.

This is the basic key-value pattern: brackets with bras produce scalar coefficients, and those coefficients combine kets.

2. Multi-Layer Perceptron#

An MLP has exactly this form. If its hidden dimension is mm, then

MLP(x)=i=1mϕ ⁣(wiinx)wiout.\mathrm{MLP}(|x\rangle) = \sum_{i=1}^m \phi\!\left(\langle w_i^{\mathrm{in}}|x\rangle\right) |w_i^{\mathrm{out}}\rangle.

Here wiin\langle w_i^{\mathrm{in}}| are keys and wiout|w_i^{\mathrm{out}}\rangle are values. Each term is:

x wiinx scalar ϕ scalar ×wiout vector.|x\rangle \xrightarrow{\ \langle w_i^{\mathrm{in}}|x\rangle\ } \text{scalar} \xrightarrow{\ \phi\ } \text{scalar} \xrightarrow{\ \times |w_i^{\mathrm{out}}\rangle\ } \text{vector}.

So an MLP is a sum of mm key-value terms with nonlinear coefficients.

For comparison, a linear operator has the SVD form

A=i=1rσiviui,A=\sum_{i=1}^r \sigma_i |v_i\rangle\langle u_i|, Ax=i=1rσiuixvi.A|x\rangle = \sum_{i=1}^r \sigma_i \langle u_i|x\rangle |v_i\rangle.

The MLP has the same key-value structure, except that its coefficients are nonlinear functions of the brackets.

3. Self-Attention#

Self-attention has the same overall form, but now each coefficient depends on the full set of key-query brackets. By treating the keys as bras and the query as a ket, the dot product elegantly matches the operator structure. For token ii and sequence length TT,

Attn(qi)=j=1Tsoftmaxj ⁣(k1qi,,kTqi)vj.\mathrm{Attn}(|q_i\rangle) = \sum_{j=1}^T \mathrm{softmax}_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right) |v_j\rangle.

Equivalently, if

ϕj(z1,,zT)=ezj=1Tez,\phi_j(z_1,\dots,z_T) = \frac{e^{z_j}}{\sum_{\ell=1}^T e^{z_\ell}},

then

Attn(qi)=j=1Tϕj ⁣(k1qi,,kTqi)vj.\mathrm{Attn}(|q_i\rangle) = \sum_{j=1}^T \phi_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right) |v_j\rangle.

This mirrors the MLP form closely:

MLP(x)=i=1mϕ ⁣(wiinx)wiout,Attn(qi)=j=1Tϕj ⁣(k1qi,,kTqi)vj.\begin{aligned} \mathrm{MLP}(|x\rangle) &= \sum_{i=1}^m \phi\!\left(\langle w_i^{\mathrm{in}}|x\rangle\right) |w_i^{\mathrm{out}}\rangle,\\[8pt] \mathrm{Attn}(|q_i\rangle) &= \sum_{j=1}^T \phi_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right) |v_j\rangle. \end{aligned}

The difference is simple:

  • in an MLP, each coefficient depends on one bracket,
  • in attention, each coefficient depends on the whole set of brackets.

In both cases, brackets produce coefficients and coefficients combine values.

4. Effective Rank#

An informal bra-ket definition of effective rank is:

Mi=1reffviki.M \approx \sum_{i=1}^{r_{\mathrm{eff}}} |v_i\rangle\langle k_i|.

So reffr_{\mathrm{eff}} is approximately the number of key-value pairs needed for a good approximation.

This makes the structural comparison clear:

  • linear map: fixed coefficients, fixed key-value pairs,
  • MLP: nonlinear coefficients, fixed key-value pairs,
  • attention: nonlinear coefficients, token-dependent key-value pairs.

The effective rank is upper bounded by the number of available pairs. For an MLP this is at most the hidden dimension mm; for a single attention head at one token this is at most the token count TT. In both cases the operator is built from a bounded number of ket-bra terms in a finite-dimensional Hilbert space, while the nonlinearity comes from input-dependent coefficients:

ϕ ⁣(wiinx),ϕj ⁣(k1qi,,kTqi).\phi\!\left(\langle w_i^{\mathrm{in}}|x\rangle\right), \qquad \phi_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right).

5. Conclusion#

Bra-ket notation isolates four objects cleanly:

  • bras: keys,
  • kets: values and queries,
  • brackets: dot-product coefficients,
  • sums of ket-bras: low-rank operator structure.

With this notation, the common structure of linear maps, MLPs, and self-attention is immediate. All three are sums of value vectors weighted by coefficients derived from inner products; they differ only in how those coefficients are produced. This is why bra-ket notation will be the default in the rest of this series: it makes the key-value structure explicit and keeps the relevant linear-algebraic content visible.

Part 2 will use this perspective to derive linear attention, from the original formulation to modern variants such as Gated DeltaNet.

Cite this post ```bibtex @online{bra-ket-notation-deep-learning-1, author = {Lucas Sun}, title = {Bra-Ket (Dirac) Notation in Deep Learning (Part 1)}, year = {2026}, month = {04}, day = {26}, url = {https://xtimecrystal.com/posts/260426-bra-ket-notation/}, } ```