DRIVIA · FORMULA LAB v6
study
00:00
total 0h
Day streak0🔥
Mastered0 / 17
Chars0
Accuracy100%
WPM0
Char streak0
Episode 5 · v4 · Astro build

Read the math like a sentence.

Every formula has a sound. Hear it. Say it back. Watch it animate apart. Touch the interactive widget. Type the code. No more bouncing off Greek letters.

Chapter 01 · 1 of 17

Vectors — arrows of numbers.

A vector is an ordered list of numbers. Stack them in a column, add other vectors to them, scale them by a single number. Foundation of everything.

v=[v1v2vn],kv+w=[kv1+w1kvn+wn]\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}, \quad k\mathbf{v} + \mathbf{w} = \begin{bmatrix} k v_1 + w_1 \\ \vdots \\ k v_n + w_n \end{bmatrix}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Vector v equals, open bracket, v sub one, v sub two, dot dot dot, v sub n, close bracket. K vector v plus vector w, equals, component-wise sum.
SAY IT BACK · ELOCUTION
Vector v equals, open bracket, v sub one, v sub two, dot dot dot, v sub n, close bracket. K vector v plus vector w, equals, component-wise sum.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
A vector is a list of numbers, written as a column. Add two vectors by adding matching entries. Multiply a vector by a single number (a scalar) by multiplying every entry. Combining the two operations — k·v + w — is called a linear combination, and that single idea generates every other concept in linear algebra.
Symbol glossary — click any symbol to hear it
v vector an ordered list of n numbers
vᵢ i-th component the entry at position i
n n dimension — number of components
k scalar a single number (not a vector)
+ vector addition component-wise sum
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
v = [2.00, 1.00]w = [1.00, 2.00]k·v + w = [5.00, 4.00]
In code — type it yourself
import numpy as np
v = np.array([1, 2, 3])
w = np.array([4, 5, 6])
combo = 2 * v + w        # [6, 9, 12]
print(combo.shape)        # (3,)
Type · Trace · Master
chars0/137
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
v = [1, 2], w = [3, 4], k = 2. Compute (k·v + w)[0].
Your notes (saved locally)
saved
Chapter 02 · 2 of 17

Vector Norm (Length) — how long is the arrow.

Pythagoras for n dimensions. Square each component, sum, take the square root. The geometric length of a vector.

v=v12+v22++vn2=i=1nvi2\lVert \mathbf{v} \rVert = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Norm of v equals, square root of, v sub one squared, plus v sub two squared, plus dot dot dot, plus v sub n squared.
SAY IT BACK · ELOCUTION
Norm of v equals, square root of, v sub one squared, plus v sub two squared, plus dot dot dot, plus v sub n squared.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
The norm (length, magnitude) of a vector is just Pythagoras in n dimensions. For v = [3, 4]: √(9 + 16) = √25 = 5. Sign doesn't matter — we square first. Output is always ≥ 0, and it's zero only when every component is zero.
Symbol glossary — click any symbol to hear it
‖v‖ L2 norm geometric length of v
square root non-negative root
Σ sum add everything that follows
vᵢ² squared component always non-negative
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
v = [3.00, 4.00]‖v‖ = √(9.00 + 16.00) = 5.0000
In code — type it yourself
import numpy as np
v = np.array([3, 4])
length = np.linalg.norm(v)   # 5.0
# manual: np.sqrt((v ** 2).sum())
Type · Trace · Master
chars0/108
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
v = [3, 4]. Compute $\lVert v \rVert$ (the 2-norm / length).
Your notes (saved locally)
saved
Chapter 03 · 3 of 17

Unit Vector (Normalization) — direction, no length.

Divide every component of v by its own length. The result has length 1 and points the same way. Strips magnitude, keeps direction.

v^=vv\hat{\mathbf{v}} = \dfrac{\mathbf{v}}{\lVert \mathbf{v} \rVert}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
v hat equals, v, divided by, norm of v.
SAY IT BACK · ELOCUTION
v hat equals, v, divided by, norm of v.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
A unit vector is a vector with length 1. To make one, just divide every component by the vector's own length. The hat — — is the universal notation for "normalized." Unit vectors strip the scale and let you compare pure directions, which is exactly what cosine similarity will do next.
Symbol glossary — click any symbol to hear it
unit vector v scaled to length 1
‖v‖ length of v the divisor ↗ first seen Ch.2
1 target length ‖v̂‖ = 1 by construction
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
v = [3.00, 4.00]‖v‖ = √(9.00 + 16.00) = 5.0000
In code — type it yourself
import numpy as np
v = np.array([3, 4])
v_hat = v / np.linalg.norm(v)   # [0.6, 0.8]
np.linalg.norm(v_hat)            # 1.0
Type · Trace · Master
chars0/123
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
v = [3, 4]. Compute (v / ‖v‖)[0] to 3 decimal places.
Your notes (saved locally)
saved
Chapter 04 · 4 of 17

Dot Product — similarity scalar.

Multiply matching components, sum them. Output is one number. Foundation of attention, projections, cosine similarity.

ab=i=1naibi=abcosθ\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \lVert \mathbf{a} \rVert \lVert \mathbf{b} \rVert \cos\theta
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
a dot b equals, the sum from i equals one to n, of a sub i times b sub i. Same as, norm of a, times norm of b, times cosine theta.
SAY IT BACK · ELOCUTION
a dot b equals, the sum from i equals one to n, of a sub i times b sub i. Same as, norm of a, times norm of b, times cosine theta.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
The dot product has two equal definitions. Algebraically: multiply matching components, sum them. Geometrically: ‖a‖ · ‖b‖ · cos θ, where θ is the angle between a and b. Same number, two stories. Output is a SCALAR. Aligned vectors → big positive. Opposite → big negative. Perpendicular → exactly zero.
Symbol glossary — click any symbol to hear it
a·b dot product scalar output
Σ sum over i from 1 to n ↗ first seen Ch.2
θ angle between the angle between a and b
perpendicular a·b = 0 means a ⊥ b
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
a = [3.00, 1.00]b = [1.00, 2.00]a · b = 5.00θ = 45.0°cos θ = 0.707
In code — type it yourself
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a @ b              # 32
# or: np.dot(a, b)
Type · Trace · Master
chars0/118
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
a = [1, 2, 3], b = [4, 5, 6]. Compute a · b.
Your notes (saved locally)
saved
Chapter 05 · 5 of 17

Cosine Similarity — the angle judge.

Dot product divided by the lengths. Output sits in [−1, 1]. The metric behind every embedding model, RAG retrieval, semantic search.

cosθ=abab\cos\theta = \dfrac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{a} \rVert \, \lVert \mathbf{b} \rVert}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Cosine of theta equals, a dot b, divided by, norm of a times norm of b.
SAY IT BACK · ELOCUTION
Cosine of theta equals, a dot b, divided by, norm of a times norm of b.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Take the dot product, then divide by the product of the lengths. The result is a number between −1 and 1. The magnitudes cancel — only the angle matters. This is why every embedding model (OpenAI, Anthropic, Cohere, Voyage) returns vectors you compare with cosine similarity: pure direction, scale-invariant.
Symbol glossary — click any symbol to hear it
cos θ cosine of θ output ∈ [−1, 1]
θ angle angle between a and b ↗ first seen Ch.4
‖·‖ norm the divisor — vector length
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
a = [3.00, 1.00]b = [1.00, 2.00]a · b = 5.00θ = 45.0°cos θ = 0.707
In code — type it yourself
import numpy as np
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Type · Trace · Master
chars0/106
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
a = [1, 0], b = [1, 0]. Compute cos(θ) between them.
Your notes (saved locally)
saved
Chapter 06 · 6 of 17

Matrix · Vector Product — the linear transformation.

Every matrix is a function. Multiply A times x and out comes a new vector. Rotation, scaling, projection — all a single matrix away.

(Ax)i=j=1nAijxj(\mathbf{A}\mathbf{x})_i = \sum_{j=1}^{n} A_{ij}\, x_j
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
A times x, sub i, equals, the sum from j equals one to n, of, A sub i j, times x sub j.
SAY IT BACK · ELOCUTION
A times x, sub i, equals, the sum from j equals one to n, of, A sub i j, times x sub j.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
A matrix A acts on a vector x and produces a new vector. The i-th entry of the output is the dot product of row i of A with x. Shapes: A is m×n, x is n×1, Ax is m×1. Every linear function from ℝⁿ to ℝᵐ can be written this way.
Symbol glossary — click any symbol to hear it
A matrix A shape m × n
Aᵢⱼ entry i,j row i, column j
x input vector shape n × 1
Ax output vector shape m × 1
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
A = [
]
x = [2.00, 1.00]A·x = [1.50, 2.00]
In code — type it yourself
import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([1, 1])
y = A @ x                    # [3, 7]
Type · Trace · Master
chars0/108
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
A = [[1,0],[0,1]], x = [3, 4]. Compute (A·x)[0].
Your notes (saved locally)
saved
Chapter 07 · 7 of 17

Matrix Multiplication — the workhorse.

How every neural net layer moves numbers. Dot every row of A with every column of B. Composition of two linear transformations.

(AB)ij=k=1nAikBkj(\mathbf{AB})_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
A B sub i j, equals, the sum from k equals one to n, of, A sub i k, times, B sub k j.
SAY IT BACK · ELOCUTION
A B sub i j, equals, the sum from k equals one to n, of, A sub i k, times, B sub k j.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
The (i, j) entry of AB is the dot product of A's i-th row and B's j-th column. Shapes: A is m×n, B is n×p, AB is m×p. The shared dimension n must match — otherwise the product is undefined. Conceptually, AB means "do B first, then A" — composition of transformations.
Symbol glossary — click any symbol to hear it
A matrix A left operand (m × n) ↗ first seen Ch.6
B matrix B right operand (n × p)
AB product shape m × p
k summation index across the shared dimension ↗ first seen Ch.1
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
A = [
]
x = [2.00, 1.00]A·x = [1.50, 2.00]
In code — type it yourself
C = A @ B
C = np.matmul(A, B)
C = np.dot(A, B)
Type · Trace · Master
chars0/46
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
A = [[1,2],[3,4]], B = [[1,0],[0,1]]. What is (A·B)[0][0]?
Your notes (saved locally)
saved
Chapter 08 · 8 of 17

Determinant (2×2) — the area scaler.

For 2×2: ad − bc. Tells you how the matrix stretches area. Zero determinant → matrix collapses space, has no inverse.

det ⁣[abcd]=adbc\det\!\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Determinant of a, b, c, d, equals, a times d, minus, b times c.
SAY IT BACK · ELOCUTION
Determinant of a, b, c, d, equals, a times d, minus, b times c.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
For a 2×2 matrix, the determinant is just ad − bc. Geometrically, this is the signed area of the parallelogram spanned by the matrix's columns. |det| tells you how much the matrix scales area; the sign tells you whether it flipped orientation. det = 0 means the matrix collapses 2D onto a line — it's singular and has no inverse.
Symbol glossary — click any symbol to hear it
det determinant signed area scaling
ad main diagonal product top-left × bottom-right
bc anti-diagonal product top-right × bottom-left
0 zero determinant singular — no inverse
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
A = [[2.00, -1.00], [1.00, 2.00]]det(A) = 2.00·2.00-1.00·1.00 = 5.000
▶ Now watch 3Blue1Brown animate the same idea
3Blue1Brown
The determinant
click to play full video on YouTube ↗
In code — type it yourself
import numpy as np
A = np.array([[1, 2], [3, 4]])
d = np.linalg.det(A)         # -2.0
# manual 2x2:
a, b, c, dd = A.flatten()
d = a*dd - b*c
Type · Trace · Master
chars0/140
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
A = [[1,0],[0,1]]. Compute det(A).
Your notes (saved locally)
saved
Chapter 09 · 9 of 17

Eigenvalues & Eigenvectors — the directions that survive.

A scalar λ and a non-zero vector v such that Av = λv. The matrix only stretches v — it doesn't rotate it. The DNA of every matrix.

Av=λv\mathbf{A}\mathbf{v} = \lambda \mathbf{v}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
A times v equals, lambda times v. Lambda equals one half, trace of A, plus or minus, square root of, trace of A squared minus four det A.
SAY IT BACK · ELOCUTION
A times v equals, lambda times v. Lambda equals one half, trace of A, plus or minus, square root of, trace of A squared minus four det A.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
An eigenvector of a matrix A is a non-zero vector v that A merely stretches — Av = λv. The scalar λ is the eigenvalue. For any 2×2 matrix, eigenvalues solve the quadratic λ² − (trace)λ + det = 0. Trace = sum of diagonal entries. Det = ad − bc. Eigenpairs are the DNA of a matrix — PCA, PageRank, and Schrödinger's equation all reduce to "find the eigenpairs."
Symbol glossary — click any symbol to hear it
λ eigenvalue scaling factor along v
v eigenvector direction A preserves ↗ first seen Ch.1
I identity matrix the 'do-nothing' matrix
tr trace sum of diagonal entries
det determinant ad − bc for 2×2 ↗ first seen Ch.8
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
Drag x onto a yellow lineA·x stays on it (just stretched).eigenvalues: λ₁ = 3.000, λ₂ = 1.000
In code — type it yourself
import numpy as np
A = np.array([[2, 1], [1, 2]])
lams, vecs = np.linalg.eig(A)
# lams  -> [3., 1.]
# vecs  -> columns are eigenvectors
Type · Trace · Master
chars0/135
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
A = [[2,0],[0,3]]. Find the LARGER eigenvalue of A.
Your notes (saved locally)
saved
Chapter 10 · 10 of 17

Sigmoid Activation — the squasher.

Takes any real number and squashes it between 0 and 1. The classic on/off neuron.

σ(x)=11+ex\sigma(x) = \dfrac{1}{1 + e^{-x}}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Sigma of x equals one over, one plus, e to the negative x.
SAY IT BACK · ELOCUTION
Sigma of x equals one over, one plus, e to the negative x.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read it as: σ of x means "apply the sigmoid function to x". Output is one divided by (one plus e to the negative x). Big positive x gives ~1. Big negative x gives ~0. Middle x=0 gives 0.5.
Symbol glossary — click any symbol to hear it
σ sigma the sigmoid function
x x input — any real number ↗ first seen Ch.6
e Euler's number ≈ 2.71828
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
σ(x) live curve
x = 0.00 · output = 0.500
x 0.00
In code — type it yourself
import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
Type · Trace · Master
chars0/66
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
Compute $\sigma(-3)$ to 3 decimal places.
Your notes (saved locally)
saved
Chapter 11 · 11 of 17

ReLU Activation — the gatekeeper.

If positive, pass through. If negative, zero. Dead simple, dominant in modern networks.

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Rell-you of x equals, the maximum of zero and x.
SAY IT BACK · ELOCUTION
Rell-you of x equals, the maximum of zero and x.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: "ReLU of x" is whichever is bigger — zero, or x itself. ReLU dominates modern networks because it is fast and the gradient is either 1 or 0 — clean backprop.
Symbol glossary — click any symbol to hear it
max max pick the bigger of two values
x x input — pre-activation value ↗ first seen Ch.6
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
ReLU(x) live curve
x = 0.00 · output = 0.000
x 0.00
In code — type it yourself
def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)
Type · Trace · Master
chars0/92
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
Compute $\text{ReLU}(-3)$.
Your notes (saved locally)
saved
Chapter 12 · 12 of 17

Softmax — the probability picker.

Turns a vector of arbitrary numbers into a probability distribution that sums to 1.

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \dfrac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Softmax of z sub i, equals, e to the z sub i, divided by, the sum from j equals one to K, of e to the z sub j.
SAY IT BACK · ELOCUTION
Softmax of z sub i, equals, e to the z sub i, divided by, the sum from j equals one to K, of e to the z sub j.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: softmax takes the i-th element of vector z, divides e to that element by the sum of e to all elements. Result: probability per class, all summing to 1. Σ means "add up everything that follows".
Symbol glossary — click any symbol to hear it
zᵢ z sub i the i-th logit
Σ sigma (sum) sum across all that follow ↗ first seen Ch.2
K K number of classes
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
Softmax bars — drag the logits
probabilities sum to 1.000
0.250
class 1
0.250
class 2
0.250
class 3
0.250
class 4
z1 0.00
z2 0.00
z3 0.00
z4 0.00
In code — type it yourself
def softmax(z):
    z = z - np.max(z, axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)
Type · Trace · Master
chars0/124
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
Logits z = [1, 2, 3]. What is softmax(z)[2] (i.e. the probability of class 2)?
Your notes (saved locally)
saved
Chapter 13 · 13 of 17

Mean Squared Error — the regression scorecard.

Average squared gap between prediction and truth. Punishes big mistakes more than small ones.

L=1ni=1n(yiy^i)2L = \dfrac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
L equals, one over n, times the sum from i equals one to n, of, y sub i minus y-hat sub i, squared.
SAY IT BACK · ELOCUTION
L equals, one over n, times the sum from i equals one to n, of, y sub i minus y-hat sub i, squared.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: loss L equals one over n times the sum, for every sample i from 1 to n, of (true minus predicted) squared. Squaring punishes big errors disproportionately.
Symbol glossary — click any symbol to hear it
L L (loss) how wrong the model is — lower is better
n n number of samples ↗ first seen Ch.1
yᵢ y sub i true value for sample i
ŷᵢ y-hat sub i model prediction for sample i
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
In code — type it yourself
def mse(y, y_hat):
    return np.mean((y - y_hat) ** 2)
Type · Trace · Master
chars0/55
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
y = [1, 2, 3], ŷ = [1.1, 2.1, 3.1]. Compute MSE.
Your notes (saved locally)
saved
Chapter 14 · 14 of 17

Cross-Entropy Loss — the classification scorecard.

Punishes confident wrong predictions hard. Default loss for softmax classifiers.

L=i=1Kyilog(y^i)L = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
L equals, negative the sum from i equals one to K, of, y sub i, times, log of y-hat sub i.
SAY IT BACK · ELOCUTION
L equals, negative the sum from i equals one to K, of, y sub i, times, log of y-hat sub i.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: loss equals negative sum across all K classes of (true label) times log(predicted probability). Confident-correct gives tiny loss. Confident-wrong gives huge loss. The minus sign flips it positive.
Symbol glossary — click any symbol to hear it
y y true label (one-hot)
ŷ y-hat predicted probability
log log natural logarithm
K K number of classes ↗ first seen Ch.12
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
In code — type it yourself
def cross_entropy(y, y_hat):
    return -np.sum(y * np.log(y_hat + 1e-12))
Type · Trace · Master
chars0/74
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
Single-class case: true label y=1, predicted probability ŷ = 0.9. Compute -log(ŷ).
Your notes (saved locally)
saved
Chapter 15 · 15 of 17

Gradient Descent — the update rule.

Take current weights, subtract a small step in the direction that reduces loss. Repeat until smart.

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Theta at time t plus one, equals, theta at time t, minus, eta times, the gradient with respect to theta, of L of theta at time t.
SAY IT BACK · ELOCUTION
Theta at time t plus one, equals, theta at time t, minus, eta times, the gradient with respect to theta, of L of theta at time t.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: next weights (θ_t+1) equal current weights (θ_t) minus η times the gradient of loss with respect to weights. Gradient points UPHILL; subtracting moves DOWNHILL. Eta is the learning rate.
Symbol glossary — click any symbol to hear it
θ theta model parameters (weights) ↗ first seen Ch.4
η eta learning rate (e.g. 0.001)
nabla (gradient) vector of partial derivatives — points uphill
L L loss function
t t training step
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
Interactive — touch it
Gradient Descent — drag the ball, watch it roll downhill
step 0 · loss · ∇L = (, )
η learning rate 0.050
momentum 0.00
In code — type it yourself
grad = compute_gradient(loss, theta)
theta = theta - eta * grad
Type · Trace · Master
chars0/63
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
θ = 5, ∇L(θ) = 2, η = 0.1. Compute θ_{t+1}.
Your notes (saved locally)
saved
Chapter 16 · 16 of 17

Backpropagation Chain Rule — the credit assignment.

How a loss at the output gets blamed all the way back to the first weight that influenced it.

Lw=Laazzw\dfrac{\partial L}{\partial w} = \dfrac{\partial L}{\partial a} \cdot \dfrac{\partial a}{\partial z} \cdot \dfrac{\partial z}{\partial w}
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Partial L with respect to w, equals, partial L with respect to a, times, partial a with respect to z, times, partial z with respect to w.
SAY IT BACK · ELOCUTION
Partial L with respect to w, equals, partial L with respect to a, times, partial a with respect to z, times, partial z with respect to w.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: derivative of loss with respect to a weight equals product of three derivatives along the path. Chain rule from calculus, run backwards through every layer. That is backpropagation.
Symbol glossary — click any symbol to hear it
partial partial derivative
w w a single weight
a a activation
z z pre-activation
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
In code — type it yourself
loss.backward()
optimizer.step()
Type · Trace · Master
chars0/32
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
Chain rule: ∂L/∂a = 2, ∂a/∂z = 0.5, ∂z/∂w = 3. Compute ∂L/∂w.
Your notes (saved locally)
saved
Chapter 17 · 17 of 17

Scaled Dot-Product Attention — the transformer's heart.

The formula that makes LLMs work. Each token decides how much to listen to every other token.

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\dfrac{Q K^T}{\sqrt{d_k}}\right) V
click ANIMATE — Kokori will speak each step as it appears
ready
Pronunciation
Attention of Q, K, V, equals, softmax of, Q times K transpose, divided by, square root of d sub k, all times V.
SAY IT BACK · ELOCUTION
Attention of Q, K, V, equals, softmax of, Q times K transpose, divided by, square root of d sub k, all times V.
Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.
In plain English
Read: Q · K^T gets similarity scores. Divide by √d_k for stability. Softmax gives weights summing to 1. Multiply by V. Each token gets a weighted blend of all tokens' values.
Symbol glossary — click any symbol to hear it
Q Q (queries) what each token is looking for
K K (keys) what each token offers
V V (values) the payload each token carries
Kᵀ K transpose K with rows and columns swapped
dₖ d sub k dimension of key vectors
Plug in a value — see each operation
🔢 Plug in a value · see every step
live arithmetic — type x and watch the formula compute
In code — type it yourself
import torch
import torch.nn.functional as F
def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    weights = F.softmax(scores, dim=-1)
    return weights @ V
Type · Trace · Master
chars0/202
errors0
accuracy
wpm0
Click to focus · Tab auto-indent · Enter newline+indent
Try it — 10 practice problems
⌨ Practice — solve 10 yourself
solved 0 / 10
Problem 1 of 10
In scaled dot-product attention, we divide by √d_k. If d_k = 64, what is the scaling factor √d_k?
Your notes (saved locally)
saved

⌨ Keyboard Shortcuts

Jump to chapter N
19
Hear pronunciation of nearest chapter
H
Toggle record
R
Animate formula breakdown
A
Focus typing on nearest chapter
T
Typing: auto-indent / skip to end of line
Tab
Typing: newline + auto-indent next line
Enter
Reset current chapter typing
Esc Esc
Toggle this overlay
?