Inevitably Orthogonal

During my Masters in AI, a group of us were discussing the attention mechanism — the Q, K, V weight matrices that make transformers work. We'd been looking at toy 2D examples, where the word embedding was doing most of the heavy lifting, and the other students couldn't see why you'd need learned projections at all. My argument was roughly: dot products in high-dimensional spaces are almost always near zero. If your vectors start out pointing in essentially random directions, there's no signal to work with — you have to project them into a subspace where similarity emerges. That's what the weight matrices do.

This surprised people. In 2D, two random vectors are quite likely to be somewhat aligned just by accident. But in 768 dimensions? There are so many ways to be orthogonal that alignment by chance essentially never happens.

That evening I ran a Monte Carlo experiment to see for myself, and the result became this article.

Written in collaboration with Claude (Opus 4.6). The ideas, direction, and editorial judgement are mine; the writing, code, and visualisations were a conversation between us.

What the Experiment Shows

Here is that experiment, running live. For each dimension $d$ , we sample random points on the surface of the unit sphere, take the dot product with a fixed reference direction $(1, 0, \ldots, 0)$ , and histogram the results. Watch what happens as dimension grows.

Note what we are measuring: the dot product of $(x_1, \ldots, x_d)$ with $(1, 0, \ldots, 0)$ is simply $x_1$ , the first coordinate. And since the sphere has no preferred direction, any fixed reference gives the same distribution — the choice of $(1, 0, \ldots, 0)$ is without loss of generality. So “dot product with a fixed direction” and “first coordinate $x_1$ ” are the same problem.

How do you sample a random point on a sphere?

Sample each coordinate independently from a standard Gaussian: $z = (z_1, z_2, \ldots, z_d)$ where each $z_i \sim \mathcal{N}(0,1)$ . Then normalise: $x = z / \|z\|$ .

This works because the multivariate Gaussian is spherically symmetric — its density depends only on the distance from the origin, not on the direction. So the direction $z/\|z\|$ is uniformly distributed on the sphere. The random length $\|z\|$ gets thrown away by normalising.

You might think you could just pick each coordinate uniformly from $[-1, 1]$ and normalise. This does not work: the cube has corners, so diagonal directions would be overrepresented. The Gaussian has no corners — it looks the same in every direction — which is exactly what makes the trick work.

Histogram Panel — Dot product distributions for d = 2, 3, 4, 5, 10, 50, 200, 1000

What happens as we increase the dimension $d$ is striking:

The top row shows the distribution changing shape:

$d = 2$ (the circle). U-shaped, piling up at the extremes $\pm 1$ .
$d = 3$ (the ordinary sphere). Flat — uniform on $[-1, 1]$ . A beautiful and non-obvious fact.
$d = 4$ . A semicircular bump at zero. Concentration begins.
$d = 5$ . Quadratic — the bump sharpens.

The bottom row shows the distribution collapsing: by $d = 10$ there is a clear spike near zero, and by $d = 1000$ it is a needle. The higher the dimension, the tighter the concentration.

Two things are happening at once:

The shape of the distribution changes — from U-shaped through flat to bell-curved.
The distribution narrows, collapsing toward zero.

Both deserve explanation, but the narrowing is the more consequential fact: as dimension grows, a random direction on the sphere becomes nearly orthogonal to any fixed direction. Mathematicians call this concentration of measure.

We will tackle the narrowing first — measuring it, then proving why it happens — and return to the changing shape in Article 2.

Measuring the Concentration

The histograms show the distribution narrowing, but let us be quantitative. We run another Monte Carlo experiment: for each dimension $d$ , sample many random unit vectors, record their first coordinates, and measure the variance — how spread out the values are.

Monte Carlo Variance — Empirical variance vs. dimension, with exact 1/d overlay

The empirical dots hug the red curve — the variance appears to follow $1/d$ exactly, for every dimension from $d = 2$ onwards. (Try increasing the sample count to see the scatter tighten.) A Monte Carlo experiment cannot prove this, but the pattern is striking enough to demand an explanation. Is it really exactly $1/d$ ? And if so, why?

The Symmetry Argument

We can arrive at this result from symmetry alone. If $X = (x_1, \ldots, x_d)$ is a random unit vector, then:

x_1^2 + x_2^2 + \cdots + x_d^2 = 1

Take the expected value of both sides:

\mathbb{E}[x_1^2] + \mathbb{E}[x_2^2] + \cdots + \mathbb{E}[x_d^2] = 1

But by rotational symmetry, the distribution of $X$ looks the same from every coordinate's perspective: all the $\mathbb{E}[x_i^2]$ are equal. Therefore:

d \cdot \mathbb{E}[x_1^2] = 1 \quad \Longrightarrow \quad \mathbb{E}[x_1^2] = \frac{1}{d}

Since $\mathbb{E}[x_1] = 0$ (by the symmetry $x_1 \to -x_1$ ), the variance is:

\operatorname{Var}(x_1) = \mathbb{E}[x_1^2] - \mathbb{E}[x_1]^2 = \frac{1}{d} - 0 = \frac{1}{d}

That is the entire proof. $d$ equal terms summing to 1 must each be $1/d$ . The dots in the Monte Carlo plot sit on the $1/d$ curve because they have to.

The Hat-Box Theorem

Before we can explain all those shapes, we need to understand why $d = 3$ is flat. Look back at the histogram — the first coordinate is uniform on $[-1, 1]$ . Every height on the sphere is equally likely. Understanding why unlocks the entire shape story.

Imagine slicing the sphere into thin horizontal bands of equal height $\Delta h$ . Each band is a ring on the sphere's surface. How much surface area does each ring capture? Two effects compete:

Circumference. Bands near the poles wrap around a smaller latitude circle. Look at the cross-section: at latitude angle $\phi$ from the equator, the radius of the latitude circle is the horizontal leg of a right triangle — $R\cos\phi$ . So the circumference shrinks as $\cos\phi$ .
Slant height. Near the poles, the sphere's surface is nearly horizontal. A horizontal slice of height $\Delta h$ cuts diagonally across this nearly-flat surface, capturing a longer strip — $\Delta h / \cos\phi$ . Near the equator the surface is nearly vertical, so the strip is close to $\Delta h$ itself.
Why the same angle? The radius and the tangent are perpendicular (marked at P in the visualisation). So if the radius tilts at $\phi$ from the horizontal, the tangent must tilt at $\phi$ from the vertical — the same $\phi$ that sets the circumference.

Drag the point along the meridian and watch both effects in action. The unrolled strip on the right shows the band’s actual shape: wide and short at the equator, narrow and tall near the pole.

The punchline: circumference shrinks as $\cos\phi$ , slant height grows as $1/\cos\phi$ , and their product — the area of the band — is $\cos\phi \times 1/\cos\phi = 1$ . Constant, regardless of position. Every equal-height band captures the same surface area. That is why the $d = 3$ histogram is flat.

Archimedes discovered this around 225 BCE by an independent route: he showed that a sphere and a snug cylinder wrapped around it have identical lateral surface area — the “hat-box” theorem. Slice both at the same two heights and the areas match. He considered it his greatest result, and asked for the sphere-in-cylinder figure to be carved on his tombstone.

But this perfect cancellation is specific to $d = 3$ . In other dimensions, one effect wins — and which one, and by how much, determines all the shapes we saw in the histograms. That is what we will derive in Article 2.

Next: How a Polynomial Becomes a Bell Curve

We have measured the concentration and proved why it happens: $d$ coordinates sharing a fixed sum forces each to be small, giving variance exactly $1/d$ . But look at the histograms again — they are not just narrow, they have a specific shape that changes with $d$ . The $d = 2$ histogram is U-shaped, $d = 3$ is flat, and large $d$ is Gaussian. In the next article, we generalise the hat-box argument to every dimension and discover a single formula that controls this entire transition.