Bra-Ket (Dirac) Notation in Deep Learning (Part 1)

Table of Contents

1. Introduction to Bra-Ket Notation
2. Multi-Layer Perceptron
3. Self-Attention
4. Effective Rank
5. Conclusion

This series will use bra-ket notation as its default language for deep learning. Standard matrix notation is efficient for implementation, but it often hides the row/column logic that matters mathematically. Bra-ket notation makes the key-value structure explicit: bras produce coefficients by dot product, kets are the vectors being combined, and sums of ket-bras expose low-rank structure directly. This is often closer to how modern work analyzes MLPs, attention, and related architectures.

1. Introduction to Bra-Ket Notation#

Let $H \cong \mathbb{R}^d$ be a finite-dimensional inner-product space. A ket $|x\rangle \in H$ is a column vector, a bra $\langle u| \in H^*$ is a row vector, the bracket $\langle u|x\rangle$ is a dot product, and the ket-bra $|v\rangle\langle u|$ is a rank-one operator.

A matrix can be written in row form or column form:

W= \left[ \begin{array}{c} \hphantom{\mid\ \mathbf c_1\quad\mathbf c_2}\mathclap{\rule{2.2em}{0.4pt}\langle r_1|\rule{2.2em}{0.4pt}}\hphantom{\mathbf c_3\quad\mathbf c_4\ \mid}\\ \hphantom{\mid\ \mathbf c_1\quad\mathbf c_2}\mathclap{\rule{2.2em}{0.4pt}\langle r_2|\rule{2.2em}{0.4pt}}\hphantom{\mathbf c_3\quad\mathbf c_4\ \mid}\\ \hphantom{\mid\ \mathbf c_1\quad\mathbf c_2}\mathclap{\rule{2.2em}{0.4pt}\langle r_3|\rule{2.2em}{0.4pt}}\hphantom{\mathbf c_3\quad\mathbf c_4\ \mid} \end{array} \right] = \left[ \begin{array}{cccc} \mid & \mid & \mid & \mid \\ |c_1\rangle & |c_2\rangle & |c_3\rangle & |c_4\rangle \\ \mid & \mid & \mid & \mid \end{array} \right].

The row form emphasizes dot products:

W|x\rangle= \begin{bmatrix} \langle r_1|x\rangle\\ \langle r_2|x\rangle\\ \langle r_3|x\rangle \end{bmatrix},

while the column form emphasizes linear combination:

W|a\rangle = \sum_{j=1}^n a_j |c_j\rangle.

This is the basic key-value pattern: brackets with bras produce scalar coefficients, and those coefficients combine kets.

2. Multi-Layer Perceptron#

An MLP has exactly this form. If its hidden dimension is $m$ , then

\mathrm{MLP}(|x\rangle) = \sum_{i=1}^m \phi\!\left(\langle w_i^{\mathrm{in}}|x\rangle\right) |w_i^{\mathrm{out}}\rangle.

Here $\langle w_i^{\mathrm{in}}|$ are keys and $|w_i^{\mathrm{out}}\rangle$ are values. Each term is:

|x\rangle \xrightarrow{\ \langle w_i^{\mathrm{in}}|x\rangle\ } \text{scalar} \xrightarrow{\ \phi\ } \text{scalar} \xrightarrow{\ \times |w_i^{\mathrm{out}}\rangle\ } \text{vector}.

So an MLP is a sum of $m$ key-value terms with nonlinear coefficients.

For comparison, a linear operator has the SVD form

A=\sum_{i=1}^r \sigma_i |v_i\rangle\langle u_i|,

A|x\rangle = \sum_{i=1}^r \sigma_i \langle u_i|x\rangle |v_i\rangle.

The MLP has the same key-value structure, except that its coefficients are nonlinear functions of the brackets.

3. Self-Attention#

Self-attention has the same overall form, but now each coefficient depends on the full set of key-query brackets. By treating the keys as bras and the query as a ket, the dot product elegantly matches the operator structure. For token $i$ and sequence length $T$ ,

\mathrm{Attn}(|q_i\rangle) = \sum_{j=1}^T \mathrm{softmax}_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right) |v_j\rangle.

Equivalently, if

\phi_j(z_1,\dots,z_T) = \frac{e^{z_j}}{\sum_{\ell=1}^T e^{z_\ell}},

then

\mathrm{Attn}(|q_i\rangle) = \sum_{j=1}^T \phi_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right) |v_j\rangle.

This mirrors the MLP form closely:

\begin{aligned} \mathrm{MLP}(|x\rangle) &= \sum_{i=1}^m \phi\!\left(\langle w_i^{\mathrm{in}}|x\rangle\right) |w_i^{\mathrm{out}}\rangle,\\[8pt] \mathrm{Attn}(|q_i\rangle) &= \sum_{j=1}^T \phi_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right) |v_j\rangle. \end{aligned}

The difference is simple:

in an MLP, each coefficient depends on one bracket,
in attention, each coefficient depends on the whole set of brackets.

In both cases, brackets produce coefficients and coefficients combine values.

4. Effective Rank#

An informal bra-ket definition of effective rank is:

M \approx \sum_{i=1}^{r_{\mathrm{eff}}} |v_i\rangle\langle k_i|.

So $r_{\mathrm{eff}}$ is approximately the number of key-value pairs needed for a good approximation.

This makes the structural comparison clear:

linear map: fixed coefficients, fixed key-value pairs,
MLP: nonlinear coefficients, fixed key-value pairs,
attention: nonlinear coefficients, token-dependent key-value pairs.

The effective rank is upper bounded by the number of available pairs. For an MLP this is at most the hidden dimension $m$ ; for a single attention head at one token this is at most the token count $T$ . In both cases the operator is built from a bounded number of ket-bra terms in a finite-dimensional Hilbert space, while the nonlinearity comes from input-dependent coefficients:

\phi\!\left(\langle w_i^{\mathrm{in}}|x\rangle\right), \qquad \phi_j\!\left( \langle k_1|q_i\rangle,\dots,\langle k_T|q_i\rangle \right).

5. Conclusion#

Bra-ket notation isolates four objects cleanly:

bras: keys,
kets: values and queries,
brackets: dot-product coefficients,
sums of ket-bras: low-rank operator structure.

With this notation, the common structure of linear maps, MLPs, and self-attention is immediate. All three are sums of value vectors weighted by coefficients derived from inner products; they differ only in how those coefficients are produced. This is why bra-ket notation will be the default in the rest of this series: it makes the key-value structure explicit and keeps the relevant linear-algebraic content visible.

Part 2 will use this perspective to derive linear attention, from the original formulation to modern variants such as Gated DeltaNet.

Cite this post

```bibtex @online{bra-ket-notation-deep-learning-1, author = {Lucas Sun}, title = {Bra-Ket (Dirac) Notation in Deep Learning (Part 1)}, year = {2026}, month = {04}, day = {26}, url = {https://xtimecrystal.com/posts/260426-bra-ket-notation/}, } ```