I’ve been reading Anil Ananthaswamy’s Why Machines Learn and it crystallised something I’d been circling around while building transformers from scratch. Every mechanism that matters in modern ML (embeddings, attention, classification, similarity search) reduces to the same primitive: the dot product. Multiply corresponding elements, sum the results.
The operation
Two vectors of the same length, multiplied element-wise and summed.
def dot(a, b):
return sum(x * y for x, y in zip(a, b))
dot([1, 2, 3], [4, 5, 6]) # 1*4 + 2*5 + 3*6 = 32
In PyTorch: torch.dot(a, b). In numpy: a @ b. Both expose it as a primitive because matrix multiplications, attention, and similarity search are built on top of it.
Geometrically, the dot product measures how much two vectors point in the same direction. If they’re aligned, the dot product is large and positive. If they’re perpendicular, it’s zero. If they point in opposite directions, it’s large and negative.
a · b = |a| |b| cos(θ)
That cosine term is doing all the work. Two vectors can have any magnitude, but the angle between them captures their relationship. This is why cosine similarity - the dot product of normalised vectors - shows up everywhere in ML.
Vectors as meaning
The first conceptual leap: words can be vectors. Not arbitrary vectors - vectors where position encodes meaning.
Word2Vec (2013) showed this was possible. Train a shallow network to predict context words, and the hidden layer weights become vector representations where semantic relationships map to geometric relationships.
king = [0.2, 0.8, -0.1, 0.5, ...] # 300 dimensions
queen = [0.2, 0.7, -0.1, 0.6, ...]
man = [0.1, 0.3, 0.4, 0.2, ...]
woman = [0.1, 0.2, 0.4, 0.3, ...]
The famous result: king - man + woman ≈ queen. Subtract the “male” direction, add the “female” direction, and you land near “queen” in vector space. This isn’t a trick. It falls out of the training process because the model learns that “king” and “queen” appear in similar contexts, offset by the same gender axis that separates “man” and “woman”.
The dot product is how you check these relationships. dot(king, queen) is high because they’re semantically similar. dot(king, bicycle) is low because they’re not.
Why high dimensions work
In two or three dimensions, there isn’t room for many orthogonal directions. You can represent “gender” or “royalty” but not both independently. In 300 dimensions, you have 300 independent axes. In 768 dimensions (BERT) or 12,288 dimensions (GPT-3), you have a vast space where thousands of semantic features can coexist without interfering with each other.
This is counterintuitive. Our spatial intuition breaks down above three dimensions. We imagine high-dimensional spaces as “crowded” - but they’re the opposite. In high dimensions, random vectors are almost always nearly orthogonal. Pick two random vectors in 300-dimensional space and the angle between them will be close to 90 degrees. There’s so much room that unrelated concepts naturally stay out of each other’s way.
import numpy as np
# Random vectors in high dimensions are nearly orthogonal
dims = 300
a = np.random.randn(dims)
b = np.random.randn(dims)
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# cos_sim ≈ 0.0 (nearly orthogonal, every time)
This is why embeddings work. You don’t have to carefully design which dimension means what. You give the model hundreds of dimensions and let gradient descent figure out the geometry. The model learns to use some directions for syntax, others for semantics, others for sentiment - all orthogonal, all coexisting.
Probing experiments give us a rough picture of what different layers encode:
- Early layers - “The” is a determiner, ”.” ends a sentence, “running” is a verb form
- Middle layers - in “The bank approved the loan”, “bank” is the subject of “approved”
- Deeper layers - “bank” means financial institution here, not riverbank. The sentence is about finance, not geography
- Final layers - task-specific features. For translation: which French word to emit next. For sentiment: this sentence is neutral
- Most dimensions - unknown. We can probe for features we think of, but the model allocates its dimensions however gradient descent sees fit
Ananthaswamy makes the point that this is one of the deep reasons ML works at all. The “curse of dimensionality” that plagues classical statistics becomes a blessing for representation learning. More dimensions means more room for structure.
From similarity to attention
Attention is a dot product with a story attached.
In the transformer, the query vector asks “what am I looking for?” and the key vector says “this is what I contain.” The dot product between them measures relevance.
# Attention scores: how relevant is each key to this query?
scores = query @ keys.T # dot product between query and every key
weights = softmax(scores / sqrt(d_k))
output = weights @ values # weighted average of values
Each position in a sequence gets projected into three vectors: query, key, and value. The dot product between a query and all keys determines which values get attended to. High dot product means “these two positions are relevant to each other.” The softmax turns scores into a probability distribution. The weighted sum of values is the output.
The attention mechanism is dot products between queries and keys, softmaxed, then used to take a weighted average of values.
The scaling factor sqrt(d_k) exists because dot products grow with dimension. In 64 dimensions, the expected magnitude of a dot product between random vectors is about 8. Without scaling, softmax would saturate and gradients would vanish. This is the “scaled” in “scaled dot-product attention” from the original paper.
Dot products as learned comparisons
A linear layer in a neural network is a matrix of dot products. Each output neuron computes the dot product of its weight vector with the input, then adds a bias.
# A linear layer: each row of W is a "template"
# The dot product measures how well the input matches each template
output = W @ input + bias
Each row of the weight matrix is a learned template. The dot product between that template and the input measures how well the input matches what the neuron is looking for. A high dot product means “this input aligns with what I’ve learned to detect.”
Classification is the same thing. The final layer of a classifier has one weight vector per class. The dot product between the input representation and each class vector produces a score. The class with the highest score wins.
# Classification: which class vector is closest to the input?
logits = class_vectors @ hidden_state # dot product with each class
prediction = argmax(logits)
The model doesn’t learn rules. It learns geometry. Training pushes inputs from the same class toward the same region of vector space, and pushes the class vectors to point at those regions. At inference time, the dot product measures which region an input belongs to.
Retrieval is the same operation
Vector databases - Pinecone, pgvector, Qdrant - store embeddings and retrieve the most similar ones. The similarity metric is almost always cosine similarity or dot product.
# Embed a query
query_vec = embed("What is attention in transformers?")
# Find the most similar documents
scores = [dot(query_vec, doc_vec) for doc_vec in database]
top_results = argsort(scores, descending=True)[:10]
RAG (retrieval-augmented generation) works by embedding a question, finding documents with high dot-product similarity, and stuffing them into the prompt. The entire retrieval step is dot products. The “semantic search” that makes RAG useful is just measuring angles in embedding space.
The geometry of understanding
What Ananthaswamy’s book helped me see clearly is that neural networks learn geometries rather than functions in the classical sense. Training warps vector space until similar things are nearby and different things are far apart, and the dot product measures distance in that warped space.
Embeddings are points in this space, weight matrices rotate and scale it, and attention heads learn measures of relevance between points. The underlying operation is the same: multiply and sum.
The transformer didn’t invent a new kind of computation. It found a better way to wire dot products together - letting every position compare itself to every other position, in parallel, through multiple learned lenses. That architectural shape, not any new maths, is what the 2017 paper contributed.
Once the dot product makes sense, the rest of a neural network is plumbing around it.