Embedding Geometry: Skip Gram From Scratch

Draft Essay ML Embeddings Skip Gram March, 2026 Eric Fruhinsholz

Embeddings feel mystical because their dimensions have no human labels. This essay builds a tiny skip gram model from scratch and shows what is real inside the space: not axis names, but geometry, neighbors, and stable relationships that survive random initialization.

I. What an Embedding Actually Is

Ask most people what an embedding is and they will point to a vector with hundreds of numbers. The next question is always the same.

What does dimension 17 mean

The uncomfortable answer is that it does not have a name. Not because we are hiding the meaning, but because no single axis is meant to align with a human concept. The axes are free to rotate, mix, and rearrange themselves during training as long as prediction improves.

Before we go further, let’s make the word embedding concrete.

An embedding is a learned vector representation for each token. In this article, a token will simply be a word, because that keeps the experiment readable. In real LLMs, a word is often split into multiple tokens.

The reason is practical. If we tried to create one token for every plural, every conjugation, every declension, and every spelling variation, the vocabulary size would explode. Tokenization is the mechanism that keeps the vocabulary manageable by breaking text into reusable pieces.

A table of vectors.

If the vocabulary size is V and the embedding dimension is D, then the embedding matrix E has shape V × D. Each row E[i] is the embedding vector for token i.

\[ E \in \mathbb{R}^{V \times D} \]

In a tiny demo where V = 50 and D = 20, E is a 50 × 20 matrix. In a production LLM, V can be tens of thousands, sometimes around 100,000 depending on the tokenizer. That is still far smaller than the number you would get if every inflected form had its own token.

That sounds abstract until you build one yourself. So I did.

In this article we will train a small skip gram model on a tiny, controlled corpus. This is not how modern LLMs are trained. In a Transformer, embeddings are learned simultaneously with attention matrices, MLP layers, and the output projection. Everything evolves together under the same loss.

Skip gram is a microscope. It isolates the embedding mechanism so we can see the geometry form without the architectural noise.

Then we will watch the space organize itself. Adjectives cluster together. Verbs find each other. Semantically related words drift closer, like plane and car.

Note: In this toy demo we use a very small embedding dimension (D = 20) so we can inspect the space and keep the experiment lightweight. In modern language models, embedding dimensions are typically much larger, often 768, 2048, 4096 or more depending on model size. A higher D does not make individual coordinates more interpretable. It increases the capacity of the space, allowing the model to represent many overlapping relationships and constraints simultaneously, with less geometric interference between tokens.

II. Axes Rotate, Geometry Survives

The punchline is subtle. If you retrain with a different random seed, the raw coordinates change. Dimension 1 can become dimension 4, or some linear combination of several dimensions. Yet the neighbor structure largely survives.

Imagine a tiny embedding space with D = 5. After training run A, you might see:

car = [ 0.82, 0.10, -0.05, 0.02, 0.11 ] plane = [ 0.79, 0.08, -0.02, 0.01, 0.09 ] elephant = [ 0.05, -0.61, 0.44, 0.02, -0.33 ]

It looks like dimension 1 carries something like “vehicle”. But a single dimension is never the full story. What matters is overall similarity, usually measured with cosine similarity.

Now retrain from a different random initialization. You might see:

car = [ 0.04, 0.63, -0.12, 0.51, 0.08 ] plane = [ 0.02, 0.60, -0.10, 0.49, -0.05 ] elephant = [ 0.71, -0.04, 0.33, -0.12, 0.44 ]

Now “vehicle” is no longer sitting in dimension 1. It is distributed across dimensions 2 and 4. Change the random seed or simply shuffle the training pairs and the coordinates will change again. The axes moved. The geometry did not.

Meaning is not stored in a coordinate.

Meaning is stored in the geometry.

III. Measuring Geometry With Cosine Similarity

If embeddings are geometry, we should be able to measure that geometry. The most common metric is cosine similarity, which compares the angle between two vectors.

\[ \cos(u, v) = \frac{u \cdot v}{\lVert u \rVert \, \lVert v \rVert} \]

Using the vectors above, we get:

Pair	Cosine similarity	Interpretation
cos(car, plane)	≈ 0.999	Almost perfectly aligned directions, strong semantic proximity
cos(car, elephant)	≈ -0.11	Not aligned, weak or opposite directional relationship

The key point is not any single coordinate. The key point is the direction of the full vector. Even if the axes rotate across training runs, the geometry that encodes these relationships remains stable.

Meaning is stored in relative direction, not axis names.

Note: these numbers are intentionally tiny and human scale. Real models use larger dimensions, but the metric is the same.

IV. How Skip Gram Actually Works

So far we have treated embeddings as if they simply existed. But how do we obtain these vectors?

We start with an embedding matrix E of size V × D. At the beginning, its values mean nothing. They are initialized randomly.

\[ E \in \mathbb{R}^{V \times D}, \quad E[i] \sim \text{small random values} \]

Initialization is not entirely arbitrary. In practice, variance scaling methods such as Glorot Xavier are used to keep training stable. In our small skip gram experiment, a simple small random initialization is sufficient.

From Sentences to Training Pairs

Let’s make this concrete with a tiny corpus:

the car moves fast the plane moves fast the elephant is large

Skip gram does not look at whole sentences at once. It slides a window over each sentence and generates (center, context) pairs.

For a sentence of length T and a window size w, the number of training pairs is:

\[ N(T,w) = \sum_{i=0}^{T-1} \Big(\min(w,i) + \min(w, T-1-i)\Big) \]

Intuition: for each position i, we can pair the center word with up to w words on the left and up to w words on the right, but near the sentence boundaries there are fewer words available.

Example: the sentence the car moves fast has T = 4 tokens. With window size w = 2, the total number of pairs is:

\[ N(4,2) = (0+2) + (1+2) + (2+1) + (2+0) = 2 + 3 + 3 + 2 = 10 \]

Here are the 10 pairs produced from that single sentence:

(center = the, context = car) (center = the, context = moves) (center = car, context = the) (center = car, context = moves) (center = car, context = fast) (center = moves, context = the) (center = moves, context = car) (center = moves, context = fast) (center = fast, context = car) (center = fast, context = moves)

When you repeat this over every sentence in the corpus, you get a large list of pairs. Training is simply iterating over that list and applying small gradient updates each time.

Notice the pattern. Car and plane repeatedly appear near the same context words. Elephant appears in different contexts.

The corpus defines the pressure field that shapes the embedding space.

Two Matrices

Skip gram uses two parameter matrices:

\[ E \in \mathbb{R}^{V \times D} \quad\text{and}\quad W \in \mathbb{R}^{D \times V} \]

E maps token ids to vectors. W maps vectors back to vocabulary scores.

Concretely, the center word embedding is a vector \(h \in \mathbb{R}^D\). Multiplying by \(W \in \mathbb{R}^{D \times V}\) produces one score per vocabulary word:

\[ \text{logits} = hW \in \mathbb{R}^{V} \]

You can read \(\text{logits}_j\) as the model’s raw score for the hypothesis: “word \(j\) is a plausible context word for this center word”.

Each vocabulary word \(j\) corresponds to one column vector \(W_j \in \mathbb{R}^D\). The score is a dot product:

\[ \text{logit}_j = h \cdot W_j \]

If \(h\) and \(W_j\) point in similar directions, the dot product is large and \(\text{logit}_j\) is high. If they are orthogonal or opposite, the score is small or negative.

W is a decoder: it converts an embedding direction into a distribution over words.

During training, every word effectively has two representations: one as a center word from E, and one as a context word from W. In practice, we often keep only E, or average both spaces. Modern Transformer models frequently tie these matrices together.

One Training Step

Every pair triggers a tiny correction. Shared contexts produce shared corrections. Over many repetitions, geometry emerges.

Words that share contexts receive similar updates.

Geometry is the compressed memory of repeated context.

V. The Math Behind The Updates

We now describe precisely what happens during one training step. Let c be the center word and t the context word observed in the corpus.

1. From Token to Scores

We look up the embedding of the center word:

\[ h = E[c] \in \mathbb{R}^{D} \]

We compute a compatibility score between the center embedding and every vocabulary word considered as a possible context. The result is a vector of size V.

\[ \text{logits} = h W \in \mathbb{R}^{V} \quad\text{with}\quad \text{logit}_j = h \cdot W_j \]

2. Turning Scores into Probabilities

The logits are arbitrary real numbers. We convert them into probabilities using softmax:

\[ P(j \mid c) = \frac{\exp(\text{logit}_j)} {\sum_k \exp(\text{logit}_k)} \] \[ P(\cdot \mid c) \in \mathbb{R}^{V} \qquad \text{with} \qquad \sum_{j=1}^{V} P(j \mid c) = 1 \] \[ \text{where} \quad \text{logits} \in \mathbb{R}^{V} \]

We use the exponential because it produces positive values, amplifies differences, and stays smoothly differentiable. Most importantly: the derivative of \(\exp(z)\) is \(\exp(z)\), which keeps gradients clean.

\[ \frac{d}{dz}\exp(z) = \exp(z) \]

3. Why Compute a Loss and a Gradient

Training is an optimization problem. The model starts with random parameters. It makes predictions. Some predictions are wrong.

We need a numerical measure of how wrong the model is. This measure is the loss:

\[ L = -\log P(t \mid c) \]

Measuring error is not enough. We also need a direction for improvement. The gradient tells us how the loss changes when we slightly modify a parameter.

The loss measures how wrong we are.

The gradient tells us how to become less wrong.

Backpropagation is the systematic application of the chain rule. It computes how the loss depends on intermediate quantities and pushes responsibility backward through the computation.

\[ \theta \leftarrow \theta - \eta \nabla_{\theta} L \] \[ \theta \;=\; \text{all trainable parameters (embeddings } E \text{ and output matrix } W) \] \[ L \;=\; \text{loss function measuring prediction error} \] \[ \nabla_{\theta} L \;=\; \text{gradient of the loss with respect to } \theta \] \[ \eta \;=\; \text{learning rate controlling step size} \]

The learning rate 𝜂 controls how aggressively we correct the vectors. Too small, learning is painfully slow. Too large, training becomes unstable and may diverge. In practice we tune 𝜂 empirically by watching whether the loss decreases smoothly.

4. Where the Gradient Comes From

Start from:

\[ L = -\log\left(\frac{\exp(\text{logit}_t)}{\sum_k \exp(\text{logit}_k)}\right) \]

Using \(\log(a/b)=\log a - \log b\) and \(\log(\exp(x))=x\):

\[ L = -\text{logit}_t + \log\sum_k \exp(\text{logit}_k) \]

Differentiating with respect to \(\text{logit}_j\) yields:

\[ \frac{\partial L}{\partial \text{logit}_j} = P(j \mid c) - y_j \]

The term \(y_j\) comes from the target distribution defined by the corpus. The corpus tells us that exactly one word is correct: the observed context word \(t\).

We encode this information using a vector \(y \in \mathbb{R}^V\) defined as:

\[ y_j = \begin{cases} 1 & \text{if } j = t \\ 0 & \text{otherwise} \end{cases} \]

This vector contains zeros everywhere except at the index of the true word. It is called a one hot vector.

For all \(j \neq t\), \(y_j = 0\). Since the derivative of a constant is zero, those components contribute only \(P(j \mid c)\).

For the true word \(j = t\), we have:

\[ \frac{\partial L}{\partial \text{logit}_t} = \frac{\partial}{\partial \text{logit}_t} \Big( -\text{logit}_t + \log \sum_k \exp(\text{logit}_k) \Big) \] \[ = -1 + \frac{\exp(\text{logit}_t)} {\sum_k \exp(\text{logit}_k)} \] \[ = P(t \mid c) - 1 \]

In vector form, this becomes:

\[ \frac{\partial L}{\partial \text{logits}} = P - y \]

The one hot vector is not an implementation trick.

It encodes the fact that the corpus provides one observed reality.

The gradient \(P - y\) is simply the distance between belief and observation.

5. Interpreting the Update

For the true word:

\[ \frac{\partial L}{\partial \text{logit}_t} = P_t - 1 \]

If \(P_t\) is too small, then \(P_t - 1\) is negative. In gradient descent we subtract the gradient, so a negative gradient increases \(\text{logit}_t\). This raises the probability of the correct context word.

The correct word is pushed up. All others are pushed down.

6. Updating Both Matrices

\[ \frac{\partial L}{\partial W} = h^\top (P - y) \quad\in\quad \mathbb{R}^{D \times V} \qquad\text{and}\qquad \frac{\partial L}{\partial E[c]} = W (P - y) \quad\in\quad \mathbb{R}^{D} \] \[ \text{where}\quad h \in \mathbb{R}^{D},\; (P-y) \in \mathbb{R}^{V},\; W \in \mathbb{R}^{D \times V},\; E[c] \in \mathbb{R}^{D} \]

\[ W \leftarrow W - \eta \frac{\partial L}{\partial W} \quad,\quad E[c] \leftarrow E[c] - \eta \frac{\partial L}{\partial E[c]} \]

At each training step, all columns of W are updated. Only the embedding vector of the current center word is updated in E. Over the entire corpus, every word eventually becomes a center word, so all embeddings are trained.

7. Geometric Interpretation

Since:

\[ \text{logit}_t = E[c] \cdot W_t \]

Increasing \(\text{logit}_t\) increases alignment between the center embedding and the output vector of the true context word. Over many steps, words that share contexts get pulled in similar directions.

Semantics emerges from repeated directional adjustments.

VI. The Experiment Setup

The setup is intentionally minimal. A vocabulary file, a corpus made only from that vocabulary, an encoder that converts words to ids, and a training script that learns an embedding matrix.

Goal
Build embeddings from scratch and show why axes are arbitrary while distances and directions remain useful.

Scope
Full softmax skip gram with a small vocabulary. No word2vec library. No tricks.

Pipeline overview

Step	Input	Output	What It Proves
1. Create vocabulary	vocab.txt	v = \|V\|	We control the world the model can speak
2. Index vocabulary	create_index.py	vocab_index.txt	Words become ids
3. Write corpus	Sentences	Raw corpus	Training distribution is the real teacher
4. Encode sentences	encode_sentence.py	corpus_encoded.txt	Model never sees words, only integers
5. Pick embedding dimension	D	Vector size	Capacity changes, meaning still emerges
6. Train skip gram	main_skip_gram.py	E.npy	Geometry emerges from prediction
7. Explore neighbors	explore.py	Nearest words	Distances matter more than axes

Placeholder: screenshots of vocab, corpus, and encoded lines will go here.

VII. Scaling Intuition Toward Transformers

A Transformer does not abandon this story. It scales it. Token embeddings still start as learned vectors. The difference is that the Transformer builds context dependent vectors on top of them through attention.

Skip gram teaches the static geometry. Transformers add dynamic geometry, one vector per token position, shaped by the entire sentence.

Placeholder: a small diagram showing static word embedding versus contextual token embedding.

VIII. GitHub, Data, and Reproducibility

This article is meant to be runnable. I will publish the code, the vocabulary, and a sample corpus. If you change the corpus, the space changes. If you change the seed, the basis changes. If you change the dimension, capacity changes. The geometry will still emerge from prediction.

That is the whole point. Embeddings are not invented. They are trained.

GitHub Download code, vocab, and corpus

Replace REPLACE_WITH_GITHUB_URL with your repository URL. Add screenshots and code blocks where needed.