Embedding Geometry: Skip Gram From Scratch
Embeddings feel mystical because their dimensions have no human labels. This essay builds a tiny skip gram model from scratch and shows what is real inside the space: not axis names, but geometry, neighbors, and stable relationships that survive random initialization.
I. What an Embedding Actually Is
Ask most people what an embedding is and they will point to a vector with hundreds of numbers. The next question is always the same.
What does dimension 17 mean
The uncomfortable answer is that it does not have a name. Not because we are hiding the meaning, but because no single axis is meant to align with a human concept. The axes are free to rotate, mix, and rearrange themselves during training as long as prediction improves.
Before we go further, let’s make the word embedding concrete.
An embedding is a learned vector representation for each token. In this article, a token will simply be a word, because that keeps the experiment readable. In real LLMs, a word is often split into multiple tokens.
The reason is practical. If we tried to create one token for every plural, every conjugation, every declension, and every spelling variation, the vocabulary size would explode. Tokenization is the mechanism that keeps the vocabulary manageable by breaking text into reusable pieces.
A table of vectors.
If the vocabulary size is V and the embedding dimension is D, then the embedding matrix E has shape V × D. Each row E[i] is the embedding vector for token i.
In a tiny demo where V = 50 and D = 20, E is a 50 × 20 matrix. In a production LLM, V can be tens of thousands, sometimes around 100,000 depending on the tokenizer. That is still far smaller than the number you would get if every inflected form had its own token.
That sounds abstract until you build one yourself. So I did.
In this article we will train a small skip gram model on a tiny, controlled corpus. This is not how modern LLMs are trained. In a Transformer, embeddings are learned simultaneously with attention matrices, MLP layers, and the output projection. Everything evolves together under the same loss.
Skip gram is a microscope. It isolates the embedding mechanism so we can see the geometry form without the architectural noise.
Then we will watch the space organize itself. Adjectives cluster together. Verbs find each other. Semantically related words drift closer, like plane and car.
Note: In this toy demo we use a very small embedding dimension (D = 20) so we can inspect the space and keep the experiment lightweight. In modern language models, embedding dimensions are typically much larger, often 768, 2048, 4096 or more depending on model size. A higher D does not make individual coordinates more interpretable. It increases the capacity of the space, allowing the model to represent many overlapping relationships and constraints simultaneously, with less geometric interference between tokens.
II. Axes Rotate, Geometry Survives
The punchline is subtle. If you retrain with a different random seed, the raw coordinates change. Dimension 1 can become dimension 4, or some linear combination of several dimensions. Yet the neighbor structure largely survives.
Imagine a tiny embedding space with D = 5. After training run A, you might see:
It looks like dimension 1 carries something like “vehicle”. But a single dimension is never the full story. What matters is overall similarity, usually measured with cosine similarity.
Now retrain from a different random initialization. You might see:
Now “vehicle” is no longer sitting in dimension 1. It is distributed across dimensions 2 and 4. Change the random seed or simply shuffle the training pairs and the coordinates will change again. The axes moved. The geometry did not.
Meaning is not stored in a coordinate.
Meaning is stored in the geometry.
III. Measuring Geometry With Cosine Similarity
If embeddings are geometry, we should be able to measure that geometry. The most common metric is cosine similarity, which compares the angle between two vectors.
Using the vectors above, we get:
| Pair | Cosine similarity | Interpretation |
|---|---|---|
| cos(car, plane) | ≈ 0.999 | Almost perfectly aligned directions, strong semantic proximity |
| cos(car, elephant) | ≈ -0.11 | Not aligned, weak or opposite directional relationship |
The key point is not any single coordinate. The key point is the direction of the full vector. Even if the axes rotate across training runs, the geometry that encodes these relationships remains stable.
Meaning is stored in relative direction, not axis names.
Note: these numbers are intentionally tiny and human scale. Real models use larger dimensions, but the metric is the same.
IV. How Skip Gram Actually Works
So far we have treated embeddings as if they simply existed. But how do we obtain these vectors?
We start with an embedding matrix E of size V × D. At the beginning, its values mean nothing. They are initialized randomly.
Initialization is not entirely arbitrary. In practice, variance scaling methods such as Glorot Xavier are used to keep training stable. In our small skip gram experiment, a simple small random initialization is sufficient.
From Sentences to Training Pairs
Let’s make this concrete with a tiny corpus:
Skip gram does not look at whole sentences at once. It slides a window over each sentence and generates (center, context) pairs.
For a sentence of length T and a window size w, the number of training pairs is:
Intuition: for each position i, we can pair the center word with up to w words on the left and up to w words on the right, but near the sentence boundaries there are fewer words available.
Example: the sentence the car moves fast has T = 4 tokens. With window size w = 2, the total number of pairs is:
Here are the 10 pairs produced from that single sentence:
When you repeat this over every sentence in the corpus, you get a large list of pairs. Training is simply iterating over that list and applying small gradient updates each time.
Notice the pattern. Car and plane repeatedly appear near the same context words. Elephant appears in different contexts.
The corpus defines the pressure field that shapes the embedding space.
Two Matrices
Skip gram uses two parameter matrices:
E maps token ids to vectors. W maps vectors back to vocabulary scores.
Concretely, the center word embedding is a vector \(h \in \mathbb{R}^D\). Multiplying by \(W \in \mathbb{R}^{D \times V}\) produces one score per vocabulary word:
You can read \(\text{logits}_j\) as the model’s raw score for the hypothesis: “word \(j\) is a plausible context word for this center word”.
Each vocabulary word \(j\) corresponds to one column vector \(W_j \in \mathbb{R}^D\). The score is a dot product:
If \(h\) and \(W_j\) point in similar directions, the dot product is large and \(\text{logit}_j\) is high. If they are orthogonal or opposite, the score is small or negative.
W is a decoder: it converts an embedding direction into a distribution over words.
During training, every word effectively has two representations: one as a center word from E, and one as a context word from W. In practice, we often keep only E, or average both spaces. Modern Transformer models frequently tie these matrices together.
One Training Step
Every pair triggers a tiny correction. Shared contexts produce shared corrections. Over many repetitions, geometry emerges.
Words that share contexts receive similar updates.
Geometry is the compressed memory of repeated context.
V. The Math Behind The Updates
We now describe precisely what happens during one training step. Let c be the center word and t the context word observed in the corpus.
1. From Token to Scores
We look up the embedding of the center word:
We compute a compatibility score between the center embedding and every vocabulary word considered as a possible context. The result is a vector of size V.
2. Turning Scores into Probabilities
The logits are arbitrary real numbers. We convert them into probabilities using softmax:
We use the exponential because it produces positive values, amplifies differences, and stays smoothly differentiable. Most importantly: the derivative of \(\exp(z)\) is \(\exp(z)\), which keeps gradients clean.
3. Why Compute a Loss and a Gradient
Training is an optimization problem. The model starts with random parameters. It makes predictions. Some predictions are wrong.
We need a numerical measure of how wrong the model is. This measure is the loss:
Measuring error is not enough. We also need a direction for improvement. The gradient tells us how the loss changes when we slightly modify a parameter.
The loss measures how wrong we are.
The gradient tells us how to become less wrong.
Backpropagation is the systematic application of the chain rule. It computes how the loss depends on intermediate quantities and pushes responsibility backward through the computation.
The learning rate 𝜂 controls how aggressively we correct the vectors. Too small, learning is painfully slow. Too large, training becomes unstable and may diverge. In practice we tune 𝜂 empirically by watching whether the loss decreases smoothly.
4. Where the Gradient Comes From
Start from:
Using \(\log(a/b)=\log a - \log b\) and \(\log(\exp(x))=x\):
Differentiating with respect to \(\text{logit}_j\) yields:
The term \(y_j\) comes from the target distribution defined by the corpus. The corpus tells us that exactly one word is correct: the observed context word \(t\).
We encode this information using a vector \(y \in \mathbb{R}^V\) defined as:
This vector contains zeros everywhere except at the index of the true word. It is called a one hot vector.
For all \(j \neq t\), \(y_j = 0\). Since the derivative of a constant is zero, those components contribute only \(P(j \mid c)\).
For the true word \(j = t\), we have:
In vector form, this becomes:
The one hot vector is not an implementation trick.
It encodes the fact that the corpus provides one observed reality.
The gradient \(P - y\) is simply the distance between belief and observation.
5. Interpreting the Update
For the true word:
If \(P_t\) is too small, then \(P_t - 1\) is negative. In gradient descent we subtract the gradient, so a negative gradient increases \(\text{logit}_t\). This raises the probability of the correct context word.
The correct word is pushed up. All others are pushed down.
6. Updating Both Matrices
At each training step, all columns of W are updated. Only the embedding vector of the current center word is updated in E. Over the entire corpus, every word eventually becomes a center word, so all embeddings are trained.
7. Geometric Interpretation
Since:
Increasing \(\text{logit}_t\) increases alignment between the center embedding and the output vector of the true context word. Over many steps, words that share contexts get pulled in similar directions.
Semantics emerges from repeated directional adjustments.
VI. The Experiment Setup
The setup is intentionally minimal. A vocabulary file, a corpus made only from that vocabulary, an encoder that converts words to ids, and a training script that learns an embedding matrix.
Build embeddings from scratch and show why axes are arbitrary while distances and directions remain useful.
Full softmax skip gram with a small vocabulary. No word2vec library. No tricks.
Pipeline overview
| Step | Input | Output | What It Proves |
|---|---|---|---|
| 1. Create vocabulary | vocab.txt | v = |V| | We control the world the model can speak |
| 2. Index vocabulary | create_index.py | vocab_index.txt | Words become ids |
| 3. Write corpus | Sentences | Raw corpus | Training distribution is the real teacher |
| 4. Encode sentences | encode_sentence.py | corpus_encoded.txt | Model never sees words, only integers |
| 5. Pick embedding dimension | D | Vector size | Capacity changes, meaning still emerges |
| 6. Train skip gram | main_skip_gram.py | E.npy | Geometry emerges from prediction |
| 7. Explore neighbors | explore.py | Nearest words | Distances matter more than axes |
Placeholder: screenshots of vocab, corpus, and encoded lines will go here.
VII. Scaling Intuition Toward Transformers
A Transformer does not abandon this story. It scales it. Token embeddings still start as learned vectors. The difference is that the Transformer builds context dependent vectors on top of them through attention.
Skip gram teaches the static geometry. Transformers add dynamic geometry, one vector per token position, shaped by the entire sentence.
Placeholder: a small diagram showing static word embedding versus contextual token embedding.
VIII. GitHub, Data, and Reproducibility
This article is meant to be runnable. I will publish the code, the vocabulary, and a sample corpus. If you change the corpus, the space changes. If you change the seed, the basis changes. If you change the dimension, capacity changes. The geometry will still emerge from prediction.
That is the whole point. Embeddings are not invented. They are trained.
GitHub Download code, vocab, and corpus