Drivia Formula Lab v4 · Read · Hear · Speak · Type

Chapter 01 · 1 of 17

Vectors — arrows of numbers.

A vector is an ordered list of numbers. Stack them in a column, add other vectors to them, scale them by a single number. Foundation of everything.

▶ Watch first — go deeper before you practice

3Blue1Brown

Vectors — what even are they?

▶ youtube.com/watch?v=fNk_zzaMoSs

3Blue1Brown

Linear combinations, span, basis

▶ youtube.com/watch?v=k7RM-ot2NWY

\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}, \quad k\mathbf{v} + \mathbf{w} = \begin{bmatrix} k v_1 + w_1 \\ \vdots \\ k v_n + w_n \end{bmatrix}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Vector v equals, open bracket, v sub one, v sub two, dot dot dot, v sub n, close bracket. K vector v plus vector w, equals, component-wise sum.

SAY IT BACK · ELOCUTION

Vector v equals, open bracket, v sub one, v sub two, dot dot dot, v sub n, close bracket. K vector v plus vector w, equals, component-wise sum.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

A vector is a list of numbers, written as a column. Add two vectors by adding matching entries. Multiply a vector by a single number (a scalar) by multiplying every entry. Combining the two operations — k·v + w — is called a linear combination, and that single idea generates every other concept in linear algebra.

Symbol glossary — click any symbol to hear it

v vector an ordered list of n numbers

vᵢ i-th component the entry at position i

n n dimension — number of components

k scalar a single number (not a vector)

+ vector addition component-wise sum

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

k = 2.00

v = [2.00, 1.00]w = [1.00, 2.00]k·v + w = [5.00, 4.00]

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Vectors — what even are they?

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
v = np.array([1, 2, 3])
w = np.array([4, 5, 6])
combo = 2 * v + w        # [6, 9, 12]
print(combo.shape)        # (3,)

Type · Trace · Master

chars0/137

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

v = [1, 2], w = [3, 4], k = 2. Compute (k·v + w)[0].

—

Your notes (saved locally)

saved

Chapter 02 · 2 of 17

Vector Norm (Length) — how long is the arrow.

Pythagoras for n dimensions. Square each component, sum, take the square root. The geometric length of a vector.

▶ Watch first — go deeper before you practice

3Blue1Brown

Vectors — what even are they?

▶ youtube.com/watch?v=fNk_zzaMoSs

3Blue1Brown

Dot products and duality

▶ youtube.com/watch?v=LyGKycYT2v0

\lVert \mathbf{v} \rVert = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Norm of v equals, square root of, v sub one squared, plus v sub two squared, plus dot dot dot, plus v sub n squared.

SAY IT BACK · ELOCUTION

Norm of v equals, square root of, v sub one squared, plus v sub two squared, plus dot dot dot, plus v sub n squared.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

The norm (length, magnitude) of a vector is just Pythagoras in n dimensions. For v = [3, 4]: √(9 + 16) = √25 = 5. Sign doesn't matter — we square first. Output is always ≥ 0, and it's zero only when every component is zero.

Symbol glossary — click any symbol to hear it

‖v‖ L2 norm geometric length of v

√ square root non-negative root

Σ sum add everything that follows

vᵢ² squared component always non-negative

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

v = [3.00, 4.00]‖v‖ = √(9.00 + 16.00) = 5.0000

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Vectors — what even are they?

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
v = np.array([3, 4])
length = np.linalg.norm(v)   # 5.0
# manual: np.sqrt((v ** 2).sum())

Type · Trace · Master

chars0/108

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

v = [3, 4]. Compute $\lVert v \rVert$ (the 2-norm / length).

—

Your notes (saved locally)

saved

Chapter 03 · 3 of 17

Unit Vector (Normalization) — direction, no length.

Divide every component of v by its own length. The result has length 1 and points the same way. Strips magnitude, keeps direction.

▶ Watch first — go deeper before you practice

3Blue1Brown

Vectors — what even are they?

▶ youtube.com/watch?v=fNk_zzaMoSs

3Blue1Brown

Dot products and duality

▶ youtube.com/watch?v=LyGKycYT2v0

\hat{\mathbf{v}} = \dfrac{\mathbf{v}}{\lVert \mathbf{v} \rVert}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

v hat equals, v, divided by, norm of v.

SAY IT BACK · ELOCUTION

v hat equals, v, divided by, norm of v.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

A unit vector is a vector with length 1. To make one, just divide every component by the vector's own length. The hat — v̂ — is the universal notation for "normalized." Unit vectors strip the scale and let you compare pure directions, which is exactly what cosine similarity will do next.

Symbol glossary — click any symbol to hear it

v̂ unit vector v scaled to length 1

‖v‖ length of v the divisor ↗ first seen Ch.2

1 target length ‖v̂‖ = 1 by construction

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

v = [3.00, 4.00]‖v‖ = √(9.00 + 16.00) = 5.0000

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Vectors — what even are they?

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
v = np.array([3, 4])
v_hat = v / np.linalg.norm(v)   # [0.6, 0.8]
np.linalg.norm(v_hat)            # 1.0

Type · Trace · Master

chars0/123

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

v = [3, 4]. Compute (v / ‖v‖)[0] to 3 decimal places.

—

Your notes (saved locally)

saved

Chapter 04 · 4 of 17

Dot Product — similarity scalar.

Multiply matching components, sum them. Output is one number. Foundation of attention, projections, cosine similarity.

▶ Watch first — go deeper before you practice

3Blue1Brown

Dot products and duality

▶ youtube.com/watch?v=LyGKycYT2v0

3Blue1Brown

Vectors — what even are they?

▶ youtube.com/watch?v=fNk_zzaMoSs

\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \lVert \mathbf{a} \rVert \lVert \mathbf{b} \rVert \cos\theta

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

a dot b equals, the sum from i equals one to n, of a sub i times b sub i. Same as, norm of a, times norm of b, times cosine theta.

SAY IT BACK · ELOCUTION

a dot b equals, the sum from i equals one to n, of a sub i times b sub i. Same as, norm of a, times norm of b, times cosine theta.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

The dot product has two equal definitions. Algebraically: multiply matching components, sum them. Geometrically: ‖a‖ · ‖b‖ · cos θ, where θ is the angle between a and b. Same number, two stories. Output is a SCALAR. Aligned vectors → big positive. Opposite → big negative. Perpendicular → exactly zero.

Symbol glossary — click any symbol to hear it

a·b dot product scalar output

Σ sum over i from 1 to n ↗ first seen Ch.2

θ angle between the angle between a and b

⊥ perpendicular a·b = 0 means a ⊥ b

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

a = [3.00, 1.00]b = [1.00, 2.00]a · b = 5.00θ = 45.0°cos θ = 0.707

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Dot products and duality

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a @ b              # 32
# or: np.dot(a, b)

Type · Trace · Master

chars0/118

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

a = [1, 2, 3], b = [4, 5, 6]. Compute a · b.

—

Your notes (saved locally)

saved

Chapter 05 · 5 of 17

Cosine Similarity — the angle judge.

Dot product divided by the lengths. Output sits in [−1, 1]. The metric behind every embedding model, RAG retrieval, semantic search.

▶ Watch first — go deeper before you practice

3Blue1Brown

Dot products and duality

▶ youtube.com/watch?v=LyGKycYT2v0

Steve Brunton

SVD — singular value decomposition

▶ youtube.com/watch?v=gXbThCXjZFM

\cos\theta = \dfrac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{a} \rVert \, \lVert \mathbf{b} \rVert}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Cosine of theta equals, a dot b, divided by, norm of a times norm of b.

SAY IT BACK · ELOCUTION

Cosine of theta equals, a dot b, divided by, norm of a times norm of b.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Take the dot product, then divide by the product of the lengths. The result is a number between −1 and 1. The magnitudes cancel — only the angle matters. This is why every embedding model (OpenAI, Anthropic, Cohere, Voyage) returns vectors you compare with cosine similarity: pure direction, scale-invariant.

Symbol glossary — click any symbol to hear it

cos θ cosine of θ output ∈ [−1, 1]

θ angle angle between a and b ↗ first seen Ch.4

‖·‖ norm the divisor — vector length

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

a = [3.00, 1.00]b = [1.00, 2.00]a · b = 5.00θ = 45.0°cos θ = 0.707

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Dot products and duality

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Type · Trace · Master

chars0/106

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

a = [1, 0], b = [1, 0]. Compute cos(θ) between them.

—

Your notes (saved locally)

saved

Chapter 06 · 6 of 17

Matrix · Vector Product — the linear transformation.

Every matrix is a function. Multiply A times x and out comes a new vector. Rotation, scaling, projection — all a single matrix away.

▶ Watch first — go deeper before you practice

3Blue1Brown

Linear transformations and matrices

▶ youtube.com/watch?v=kYB8IZa5AuE

3Blue1Brown

Linear combinations, span, basis

▶ youtube.com/watch?v=k7RM-ot2NWY

(\mathbf{A}\mathbf{x})_i = \sum_{j=1}^{n} A_{ij}\, x_j

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

A times x, sub i, equals, the sum from j equals one to n, of, A sub i j, times x sub j.

SAY IT BACK · ELOCUTION

A times x, sub i, equals, the sum from j equals one to n, of, A sub i j, times x sub j.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

A matrix A acts on a vector x and produces a new vector. The i-th entry of the output is the dot product of row i of A with x. Shapes: A is m×n, x is n×1, Ax is m×1. Every linear function from ℝⁿ to ℝᵐ can be written this way.

Symbol glossary — click any symbol to hear it

A matrix A shape m × n

Aᵢⱼ entry i,j row i, column j

x input vector shape n × 1

Ax output vector shape m × 1

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

A = [

a₁₁ = 1.00a₁₂ = -0.50a₂₁ = 0.50a₂₂ = 1.00

]

x = [2.00, 1.00]A·x = [1.50, 2.00]

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Linear transformations and matrices

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([1, 1])
y = A @ x                    # [3, 7]

Type · Trace · Master

chars0/108

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

A = [[1,0],[0,1]], x = [3, 4]. Compute (A·x)[0].

—

Your notes (saved locally)

saved

Chapter 07 · 7 of 17

Matrix Multiplication — the workhorse.

How every neural net layer moves numbers. Dot every row of A with every column of B. Composition of two linear transformations.

▶ Watch first — go deeper before you practice

3Blue1Brown

Matrix multiplication as composition

▶ youtube.com/watch?v=XkY2DOUCWMU

3Blue1Brown

Linear transformations and matrices

▶ youtube.com/watch?v=kYB8IZa5AuE

(\mathbf{AB})_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

A B sub i j, equals, the sum from k equals one to n, of, A sub i k, times, B sub k j.

SAY IT BACK · ELOCUTION

A B sub i j, equals, the sum from k equals one to n, of, A sub i k, times, B sub k j.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

The (i, j) entry of AB is the dot product of A's i-th row and B's j-th column. Shapes: A is m×n, B is n×p, AB is m×p. The shared dimension n must match — otherwise the product is undefined. Conceptually, AB means "do B first, then A" — composition of transformations.

Symbol glossary — click any symbol to hear it

A matrix A left operand (m × n) ↗ first seen Ch.6

B matrix B right operand (n × p)

AB product shape m × p

k summation index across the shared dimension ↗ first seen Ch.1

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

A = [

a₁₁ = 1.00a₁₂ = -0.50a₂₁ = 0.50a₂₂ = 1.00

]

x = [2.00, 1.00]A·x = [1.50, 2.00]

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Matrix multiplication as composition

click to play full video on YouTube ↗

In code — type it yourself

C = A @ B
C = np.matmul(A, B)
C = np.dot(A, B)

Type · Trace · Master

chars0/46

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

A = [[1,2],[3,4]], B = [[1,0],[0,1]]. What is (A·B)[0][0]?

—

Your notes (saved locally)

saved

Chapter 08 · 8 of 17

Determinant (2×2) — the area scaler.

For 2×2: ad − bc. Tells you how the matrix stretches area. Zero determinant → matrix collapses space, has no inverse.

▶ Watch first — go deeper before you practice

3Blue1Brown

The determinant

▶ youtube.com/watch?v=Ip3X9LOh2dk

3Blue1Brown

Inverse matrices, column space, null space

▶ youtube.com/watch?v=uQhTuRlWMxw

\det\!\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Determinant of a, b, c, d, equals, a times d, minus, b times c.

SAY IT BACK · ELOCUTION

Determinant of a, b, c, d, equals, a times d, minus, b times c.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

For a 2×2 matrix, the determinant is just ad − bc. Geometrically, this is the signed area of the parallelogram spanned by the matrix's columns. |det| tells you how much the matrix scales area; the sign tells you whether it flipped orientation. det = 0 means the matrix collapses 2D onto a line — it's singular and has no inverse.

Symbol glossary — click any symbol to hear it

det determinant signed area scaling

ad main diagonal product top-left × bottom-right

bc anti-diagonal product top-right × bottom-left

0 zero determinant singular — no inverse

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

A = [[2.00, -1.00], [1.00, 2.00]]det(A) = 2.00·2.00 − -1.00·1.00 = 5.000

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

The determinant

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
A = np.array([[1, 2], [3, 4]])
d = np.linalg.det(A)         # -2.0
# manual 2x2:
a, b, c, dd = A.flatten()
d = a*dd - b*c

Type · Trace · Master

chars0/140

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

A = [[1,0],[0,1]]. Compute det(A).

—

Your notes (saved locally)

saved

Chapter 09 · 9 of 17

Eigenvalues & Eigenvectors — the directions that survive.

A scalar λ and a non-zero vector v such that Av = λv. The matrix only stretches v — it doesn't rotate it. The DNA of every matrix.

▶ Watch first — go deeper before you practice

3Blue1Brown

Eigenvectors and eigenvalues

▶ youtube.com/watch?v=PFDu9oVAE-g

Steve Brunton

SVD — singular value decomposition

▶ youtube.com/watch?v=gXbThCXjZFM

\mathbf{A}\mathbf{v} = \lambda \mathbf{v}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

A times v equals, lambda times v. Lambda equals one half, trace of A, plus or minus, square root of, trace of A squared minus four det A.

SAY IT BACK · ELOCUTION

A times v equals, lambda times v. Lambda equals one half, trace of A, plus or minus, square root of, trace of A squared minus four det A.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

An eigenvector of a matrix A is a non-zero vector v that A merely stretches — Av = λv. The scalar λ is the eigenvalue. For any 2×2 matrix, eigenvalues solve the quadratic λ² − (trace)λ + det = 0. Trace = sum of diagonal entries. Det = ad − bc. Eigenpairs are the DNA of a matrix — PCA, PageRank, and Schrödinger's equation all reduce to "find the eigenpairs."

Symbol glossary — click any symbol to hear it

λ eigenvalue scaling factor along v

v eigenvector direction A preserves ↗ first seen Ch.1

I identity matrix the 'do-nothing' matrix

tr trace sum of diagonal entries

det determinant ad − bc for 2×2 ↗ first seen Ch.8

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

a = 2.00b = 1.00c = 1.00d = 2.00

Drag x onto a yellow line — A·x stays on it (just stretched).eigenvalues: λ₁ = 3.000, λ₂ = 1.000

▶ Now watch 3Blue1Brown animate the same idea

3Blue1Brown

Eigenvectors and eigenvalues

click to play full video on YouTube ↗

In code — type it yourself

import numpy as np
A = np.array([[2, 1], [1, 2]])
lams, vecs = np.linalg.eig(A)
# lams  -> [3., 1.]
# vecs  -> columns are eigenvectors

Type · Trace · Master

chars0/135

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

A = [[2,0],[0,3]]. Find the LARGER eigenvalue of A.

—

Your notes (saved locally)

saved

Chapter 10 · 10 of 17

Sigmoid Activation — the squasher.

Takes any real number and squashes it between 0 and 1. The classic on/off neuron.

▶ Watch first — go deeper before you practice

3Blue1Brown

But what is a neural network?

▶ youtube.com/watch?v=aircAruvnKk

StatQuest

Neural Networks, clearly explained

▶ youtube.com/watch?v=CqOfi41LfDw

\sigma(x) = \dfrac{1}{1 + e^{-x}}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Sigma of x equals one over, one plus, e to the negative x.

SAY IT BACK · ELOCUTION

Sigma of x equals one over, one plus, e to the negative x.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read it as: σ of x means "apply the sigmoid function to x". Output is one divided by (one plus e to the negative x). Big positive x gives ~1. Big negative x gives ~0. Middle x=0 gives 0.5.

Symbol glossary — click any symbol to hear it

σ sigma the sigmoid function

x x input — any real number ↗ first seen Ch.6

e Euler's number ≈ 2.71828

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

σ(x) live curve

x = 0.00 · output = 0.500

x 0.00

In code — type it yourself

import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Type · Trace · Master

chars0/66

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

Compute $\sigma(-3)$ to 3 decimal places.

—

Your notes (saved locally)

saved

Chapter 11 · 11 of 17

ReLU Activation — the gatekeeper.

If positive, pass through. If negative, zero. Dead simple, dominant in modern networks.

▶ Watch first — go deeper before you practice

Andrej Karpathy

Spelled-out intro to NN & backprop

▶ youtube.com/watch?v=VMj-3S1tku0

Andrej Karpathy

makemore part 2 — MLP

▶ youtube.com/watch?v=TCH_1BHY58I

\text{ReLU}(x) = \max(0, x)

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Rell-you of x equals, the maximum of zero and x.

SAY IT BACK · ELOCUTION

Rell-you of x equals, the maximum of zero and x.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: "ReLU of x" is whichever is bigger — zero, or x itself. ReLU dominates modern networks because it is fast and the gradient is either 1 or 0 — clean backprop.

Symbol glossary — click any symbol to hear it

max max pick the bigger of two values

x x input — pre-activation value ↗ first seen Ch.6

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

ReLU(x) live curve

x = 0.00 · output = 0.000

x 0.00

In code — type it yourself

def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)

Type · Trace · Master

chars0/92

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

Compute $\text{ReLU}(-3)$.

—

Your notes (saved locally)

saved

Chapter 12 · 12 of 17

Softmax — the probability picker.

Turns a vector of arbitrary numbers into a probability distribution that sums to 1.

▶ Watch first — go deeper before you practice

StatQuest

Neural networks part 5 — ArgMax/SoftMax

▶ youtube.com/watch?v=KpKog-L9veg

Andrej Karpathy

Building makemore — bigram

▶ youtube.com/watch?v=PaCmpygFfXo

\text{softmax}(z_i) = \dfrac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Softmax of z sub i, equals, e to the z sub i, divided by, the sum from j equals one to K, of e to the z sub j.

SAY IT BACK · ELOCUTION

Softmax of z sub i, equals, e to the z sub i, divided by, the sum from j equals one to K, of e to the z sub j.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: softmax takes the i-th element of vector z, divides e to that element by the sum of e to all elements. Result: probability per class, all summing to 1. Σ means "add up everything that follows".

Symbol glossary — click any symbol to hear it

zᵢ z sub i the i-th logit

Σ sigma (sum) sum across all that follow ↗ first seen Ch.2

K K number of classes

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

Softmax bars — drag the logits

probabilities sum to 1.000

0.250

class 1

0.250

class 2

0.250

class 3

0.250

class 4

z1 0.00

z2 0.00

z3 0.00

z4 0.00

In code — type it yourself

def softmax(z):
    z = z - np.max(z, axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

Type · Trace · Master

chars0/124

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

Logits z = [1, 2, 3]. What is softmax(z)[2] (i.e. the probability of class 2)?

—

Your notes (saved locally)

saved

Chapter 13 · 13 of 17

Mean Squared Error — the regression scorecard.

Average squared gap between prediction and truth. Punishes big mistakes more than small ones.

▶ Watch first — go deeper before you practice

StatQuest

Linear regression

▶ youtube.com/watch?v=nk2CQITm_eo

Steve Brunton

Gradient descent — calculus + code

▶ youtube.com/watch?v=f6kdp27TYZs

L = \dfrac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

L equals, one over n, times the sum from i equals one to n, of, y sub i minus y-hat sub i, squared.

SAY IT BACK · ELOCUTION

L equals, one over n, times the sum from i equals one to n, of, y sub i minus y-hat sub i, squared.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: loss L equals one over n times the sum, for every sample i from 1 to n, of (true minus predicted) squared. Squaring punishes big errors disproportionately.

Symbol glossary — click any symbol to hear it

L L (loss) how wrong the model is — lower is better

n n number of samples ↗ first seen Ch.1

yᵢ y sub i true value for sample i

ŷᵢ y-hat sub i model prediction for sample i

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

In code — type it yourself

def mse(y, y_hat):
    return np.mean((y - y_hat) ** 2)

Type · Trace · Master

chars0/55

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

y = [1, 2, 3], ŷ = [1.1, 2.1, 3.1]. Compute MSE.

—

Your notes (saved locally)

saved

Chapter 14 · 14 of 17

Cross-Entropy Loss — the classification scorecard.

Punishes confident wrong predictions hard. Default loss for softmax classifiers.

▶ Watch first — go deeper before you practice

StatQuest

Logistic regression

▶ youtube.com/watch?v=yIYKR4sgzI8

Andrej Karpathy

makemore part 4 — backprop by hand

▶ youtube.com/watch?v=q8SA3rM6ckI

L = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

L equals, negative the sum from i equals one to K, of, y sub i, times, log of y-hat sub i.

SAY IT BACK · ELOCUTION

L equals, negative the sum from i equals one to K, of, y sub i, times, log of y-hat sub i.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: loss equals negative sum across all K classes of (true label) times log(predicted probability). Confident-correct gives tiny loss. Confident-wrong gives huge loss. The minus sign flips it positive.

Symbol glossary — click any symbol to hear it

y y true label (one-hot)

ŷ y-hat predicted probability

log log natural logarithm

K K number of classes ↗ first seen Ch.12

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

In code — type it yourself

def cross_entropy(y, y_hat):
    return -np.sum(y * np.log(y_hat + 1e-12))

Type · Trace · Master

chars0/74

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

Single-class case: true label y=1, predicted probability ŷ = 0.9. Compute -log(ŷ).

—

Your notes (saved locally)

saved

Chapter 15 · 15 of 17

Gradient Descent — the update rule.

Take current weights, subtract a small step in the direction that reduces loss. Repeat until smart.

▶ Watch first — go deeper before you practice

3Blue1Brown

Gradient descent, how networks learn

▶ youtube.com/watch?v=IHZwWFHWa-w

Steve Brunton

Gradient descent — calculus + code

▶ youtube.com/watch?v=f6kdp27TYZs

3Blue1Brown

Derivatives, fluid intuition

▶ youtube.com/watch?v=S0_qX4VJhMQ

\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Theta at time t plus one, equals, theta at time t, minus, eta times, the gradient with respect to theta, of L of theta at time t.

SAY IT BACK · ELOCUTION

Theta at time t plus one, equals, theta at time t, minus, eta times, the gradient with respect to theta, of L of theta at time t.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: next weights (θ_t+1) equal current weights (θ_t) minus η times the gradient of loss with respect to weights. Gradient points UPHILL; subtracting moves DOWNHILL. Eta is the learning rate.

Symbol glossary — click any symbol to hear it

θ theta model parameters (weights) ↗ first seen Ch.4

η eta learning rate (e.g. 0.001)

∇ nabla (gradient) vector of partial derivatives — points uphill

L L loss function

t t training step

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

Interactive — touch it

Gradient Descent — drag the ball, watch it roll downhill

step 0 · loss — · ∇L = (—, —)

η learning rate 0.050

momentum 0.00

In code — type it yourself

grad = compute_gradient(loss, theta)
theta = theta - eta * grad

Type · Trace · Master

chars0/63

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

θ = 5, ∇L(θ) = 2, η = 0.1. Compute θ_{t+1}.

—

Your notes (saved locally)

saved

Chapter 16 · 16 of 17

Backpropagation Chain Rule — the credit assignment.

How a loss at the output gets blamed all the way back to the first weight that influenced it.

▶ Watch first — go deeper before you practice

3Blue1Brown

What is backpropagation really doing?

▶ youtube.com/watch?v=Ilg3gGewQ5U

3Blue1Brown

Backpropagation calculus

▶ youtube.com/watch?v=tIeHLnjs5U8

StatQuest

Backpropagation — clearly explained

▶ youtube.com/watch?v=IN2XmBhILt4

3Blue1Brown

Chain rule, product rule, intuition

▶ youtube.com/watch?v=YG15m2VwSjA

\dfrac{\partial L}{\partial w} = \dfrac{\partial L}{\partial a} \cdot \dfrac{\partial a}{\partial z} \cdot \dfrac{\partial z}{\partial w}

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Partial L with respect to w, equals, partial L with respect to a, times, partial a with respect to z, times, partial z with respect to w.

SAY IT BACK · ELOCUTION

Partial L with respect to w, equals, partial L with respect to a, times, partial a with respect to z, times, partial z with respect to w.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: derivative of loss with respect to a weight equals product of three derivatives along the path. Chain rule from calculus, run backwards through every layer. That is backpropagation.

Symbol glossary — click any symbol to hear it

∂ partial partial derivative

L L loss ↗ first seen Ch.15

w w a single weight

a a activation

z z pre-activation

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

In code — type it yourself

loss.backward()
optimizer.step()

Type · Trace · Master

chars0/32

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

Chain rule: ∂L/∂a = 2, ∂a/∂z = 0.5, ∂z/∂w = 3. Compute ∂L/∂w.

—

Your notes (saved locally)

saved

Chapter 17 · 17 of 17

Scaled Dot-Product Attention — the transformer's heart.

The formula that makes LLMs work. Each token decides how much to listen to every other token.

▶ Watch first — go deeper before you practice

3Blue1Brown

Transformers, visually — chapter 5

▶ youtube.com/watch?v=wjZofJX0v4M

3Blue1Brown

But what is a GPT? Visual intro

▶ youtube.com/watch?v=eMlx5fFNoYc

Andrej Karpathy

Let's build GPT: from scratch

▶ youtube.com/watch?v=kCc8FmEb1nY

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\dfrac{Q K^T}{\sqrt{d_k}}\right) V

click ANIMATE — Kokori will speak each step as it appears

ready

Pronunciation

Attention of Q, K, V, equals, softmax of, Q times K transpose, divided by, square root of d sub k, all times V.

SAY IT BACK · ELOCUTION

Attention of Q, K, V, equals, softmax of, Q times K transpose, divided by, square root of d sub k, all times V.

Tap Hear it first to listen, then Say it back. Speak naturally — we score how closely you match.

In plain English

Read: Q · K^T gets similarity scores. Divide by √d_k for stability. Softmax gives weights summing to 1. Multiply by V. Each token gets a weighted blend of all tokens' values.

Symbol glossary — click any symbol to hear it

Q Q (queries) what each token is looking for

K K (keys) what each token offers

V V (values) the payload each token carries

Kᵀ K transpose K with rows and columns swapped

dₖ d sub k dimension of key vectors

Plug in a value — see each operation

🔢 Plug in a value · see every step

live arithmetic — type x and watch the formula compute

x =

In code — type it yourself

import torch
import torch.nn.functional as F
def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    weights = F.softmax(scores, dim=-1)
    return weights @ V

Type · Trace · Master

chars0/202

errors0

accuracy—

wpm0

Click to focus · Tab auto-indent · Enter newline+indent

Try it — 10 practice problems

⌨ Practice — solve 10 yourself

solved 0 / 10

auto-read on next/prev

Problem 1 of 10

In scaled dot-product attention, we divide by √d_k. If d_k = 64, what is the scaling factor √d_k?

—

Your notes (saved locally)

saved

Read the math like a sentence.

Vectors — arrows of numbers.

Vector Norm (Length) — how long is the arrow.

Unit Vector (Normalization) — direction, no length.

Dot Product — similarity scalar.

Cosine Similarity — the angle judge.

Matrix · Vector Product — the linear transformation.

Matrix Multiplication — the workhorse.

Determinant (2×2) — the area scaler.

Eigenvalues & Eigenvectors — the directions that survive.

Sigmoid Activation — the squasher.

ReLU Activation — the gatekeeper.

Softmax — the probability picker.

Mean Squared Error — the regression scorecard.

Cross-Entropy Loss — the classification scorecard.

Gradient Descent — the update rule.

Backpropagation Chain Rule — the credit assignment.

Scaled Dot-Product Attention — the transformer's heart.

⌨ Keyboard Shortcuts