Artificial Intelligence · Part 3 of 3

Why Your AI Search Works

Concentration of measure in machine learning and physics

Why Cosine Similarity Powers Modern AI

Cosine similarity — the dot product of normalised vectors — is exactly the quantity we have been studying.

Consider a search query embedded as a unit vector in a 768-dimensional space. Your index contains millions of documents, each also a unit vector. The vast majority — articles about cooking, sports, tax law — have nothing to do with your query. These are effectively random directions, so by the results of Articles 1 and 2, their cosine similarities with your query are approximately \mathcal{N}(0, 1/d). This is the noise floor: a tight band around zero, with standard deviation 1/\sqrt{768} \approx 0.036. Virtually all noise lives within \pm 0.1.

Noise Floor — Cosine similarity distribution of random unit vectors, with dimension slider

Now suppose a document has cosine similarity 0.3 with your query. Is that noise, or signal? It sits at 0.3 / 0.036 \approx 8 standard deviations above the noise floor. The probability of a random vector landing there by chance is less than 10^{-15}. Among a billion unrelated documents, the expected number exceeding this threshold is essentially zero. You do not need to know anything about how “similar” documents are distributed — the noise floor is so tight that any genuine similarity towers above it.

This separation grows with dimension. A cosine similarity of s sits at s\sqrt{d} standard deviations above the noise. At d = 100, a similarity of 0.3 is 3\sigma — noticeable. At d = 768, it is 8\sigma — unambiguous. At d = 1536, it is 12\sigma. Higher dimensionality makes similarity search more discriminating, not less.

The embedding model’s job is to place related items above the noise floor. Concentration of measure guarantees that floor is tight. This is why vector search works at scale: not because embeddings are “good” in some vague sense, but because concentration turns any learnable similarity — however modest — into a statistically unambiguous detection.

How Many Vectors Can a Space Hold?

Concentration does not just mean that one random pair of vectors is nearly orthogonal. It means that all pairs are, simultaneously.

In 3 dimensions, you have 3 orthogonal axes and that's it — any new direction must overlap with one of them. In 10,000 dimensions, you can fit millions of nearly perpendicular directions without any of them crowding each other.

The heatmap below shows a gram matrix — the pairwise dot products among random unit vectors. Each cell is the cosine similarity between two vectors. Pick a number of vectors, then increase the dimension and watch the off-diagonal entries vanish.

Porcupine Saturation — Low-D vs High-D gram matrices

This is why embedding spaces work at scale. In low dimensions, the gram matrix saturates quickly — off-diagonal similarities creep up as you add vectors, and directions start to crowd. In high dimensions, even hundreds of vectors remain nearly orthogonal. The more dimensions you have, the more room there is.

Maxwell-Boltzmann as Geometry

The connection between dot products and AI search might not surprise you — both involve vectors. But the same mathematics describes something far older: the motion of gas molecules.

Consider an ideal gas of N identical particles, each of mass m, in a box with fixed total energy E. Each particle has three velocity components, so the system's state is a point in \mathbb{R}^{3N}. The constraint on total kinetic energy is:

\tfrac{1}{2}m\bigl(v_1^2 + v_2^2 + \cdots + v_{3N}^2\bigr) = E \quad\Longrightarrow\quad v_1^2 + v_2^2 + \cdots + v_{3N}^2 = \frac{2E}{m}

A sphere of radius R = \sqrt{2E/m} in d = 3N dimensions. Same constraint as Article 1 — coordinates whose squares must sum to a fixed total — just in velocity space instead of on the unit sphere.

Each velocity component — say, the x-velocity of particle 7 — is a single coordinate of this sphere. This is exactly the x_1 from Articles 1 and 2. The only difference is scale: on a sphere of radius R instead of 1, every coordinate is multiplied by R, so the variance becomes R^2/d instead of 1/d. Substituting R^2 = 2E/m and d = 3N:

\operatorname{Var}(v_x) = \frac{R^2}{d} = \frac{2E}{3Nm}

By the concentration result from Article 2, each velocity component is approximately Gaussian with this variance:

P(v_x) \;\propto\; \exp\!\left(-\frac{v_x^2}{2 \;\cdot\; 2E/(3Nm)}\right) \;=\; \exp\!\left(-\frac{3Nm\,v_x^2}{4E}\right)

Look at that exponent. The textbook Maxwell-Boltzmann distribution writes it as \exp(-m\,v_x^2/2kT), where T is temperature and k is Boltzmann's constant. Matching the two:

kT \;=\; \frac{R^2 \cdot m}{d} \;=\; \frac{2E}{3N}

We did not assume this — we derived it. Temperature is the sphere's radius squared divided by its dimension, times the particle mass. The Boltzmann constant is just the unit conversion that makes this geometric ratio into kelvins.

Given an ideal gas with fixed total energy, the Maxwell-Boltzmann distribution is not an additional assumption — it is a geometric consequence of the high-dimensional sphere.

The physical assumptions are the ones already stated: an ideal gas (energy is a sum of squares — no interactions, no potential energy) with fixed total energy. The one statistical assumption is that the system is equally likely to be found anywhere on the resulting sphere. Given that, everything else — Gaussian velocities, the form of the distribution, the meaning of temperature — is the mathematics of Articles 1 and 2 applied to a sphere in 3N dimensions. With N \sim 10^{23} particles, the concentration is so extreme that deviations from the Gaussian are not merely unlikely but essentially impossible. This is why thermodynamics is deterministic in practice despite being statistical in principle.

Key Takeaways

1

Variance 1/d, by pure symmetry. The dot product of a random unit vector with any fixed direction has variance exactly 1/d — requiring no computation beyond x_1^2 + \cdots + x_d^2 = 1.

2

One exponent controls the shape. The density (1 - x^2)^{(d-3)/2} transitions from arcsine (d{=}2) through uniform (d{=}3) to Gaussian (large d) — circumference vs. slant, with the circumference winning decisively as dimension grows.

3

This is why your AI search works. Concentration of measure crushes the noise floor to a band of width 1/\sqrt{d} around zero. Any genuine cosine similarity s sits at s\sqrt{d} standard deviations above the noise — a gap that grows with dimension. The embedding model creates the signal; geometry guarantees the silence.

4

Thermodynamics as geometry. The Maxwell-Boltzmann distribution is concentration of measure on the energy shell. With 10^{23} particles, statistical mechanics is a theorem about high-dimensional spheres.

Back to That Conversation

This series started with a question about attention mechanisms: why do transformers need learned Q, K, V projections at all? Now we can see why. In 768 dimensions, random dot products are crushed into a noise floor of width 1/\sqrt{768} \approx 0.036. No signal can survive there by accident. The weight matrices exist to create alignment — to project tokens into a subspace where the dot products that matter rise above the geometric noise. That is what concentration of measure demands.