Hi! I'm Lucas Sun. Welcome to my blog.
I'll write here about things I find genuinely interesting — math, machine learning, and whatever else pulls my attention.
What to expect
Posts here will tend to be technical. Topics I expect to return to: - Training dynamics and optimizer theory - Mathematical foundations of learning - Experiments I run and find surprising - Occasional notes on things outside ML No particular posting schedule. Quality over frequency.Why write?#
Writing is thinking. A half-formed idea that feels solid in my head usually falls apart the moment I try to write it down precisely. So the act of writing a post is primarily for my own benefit — to find out whether I actually understand something.
The secondary benefit is that occasionally someone finds it useful. That's a nice bonus.
"The most valuable thing I could do is to try to get ideas out of my head and into a form where other people can engage with them."
Some mathematics#
Since this blog will involve a lot of math, let me make sure the rendering works. Here are a few examples.
Euler's identity#
Perhaps the most famous equation in mathematics:
It connects the five most fundamental constants: , , , , and .
The Gaussian integral#
A result that appears everywhere in probability and physics:
Proof sketch
The standard trick is to compute instead of directly. Let . Then: Converting to polar coordinates , : Therefore .The Basel problem#
Euler's 1734 result, which stunned the mathematical world:
More generally, for even positive integers, the Riemann zeta function satisfies:
where are the Bernoulli numbers.
Softmax and attention#
The softmax function, which appears everywhere in machine learning:
The scaled dot-product attention mechanism used in transformers:
where , , .
Code#
I'll sometimes include code. Here's a minimal transformer attention block in Python:
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (batch, heads, seq_q, d_k)
K: (batch, heads, seq_k, d_k)
V: (batch, heads, seq_k, d_v)
"""
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k**0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V), weights
And a simple gradient descent loop:
for step in range(num_steps):
loss = criterion(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
References#
A non-exhaustive list of references
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. NeurIPS 2017. - Euler, L. (1734). De summis serierum reciprocarum. Commentarii academiae scientiarum Petropolitanae.Final notes#
This post is mostly a test of the blog infrastructure — LaTeX rendering, code highlighting, collapsible sections, table of contents, tags. Everything seems to work.
Future posts will be more substantive. If you want to follow along, my email is lucas.gx.sun@gmail.com.