# Weight Tying: Two Jobs One Matrix ## What is it Weight tying means using the same weight matrix for the embedding layer and the output layer. The embedding layer turns token IDs into vectors. The output layer turns vectors back into token probabilities. They are inverse operations. They share one matrix. Think of it like a bilingual dictionary. You can look up an English word to find its French translation. Or you can look up a French word to find its English translation. It is the same dictionary used in two directions. Weight tying is the same idea applied to neural networks. One matrix serves both the input side and the output side. ## Where is it used Weight tying connects the very first layer and the very last layer of the model. The token embedding and the language model head share weights. ``` Input tokens → Token Embedding (matrix A of size 50257 × 768) → Transformer blocks → Final normalization → LM Head (shares matrix A, used as 768 × 50257) → Output logits ``` In code it is a single line. ```python self.token_embedding.weight = self.lm_head.weight ``` This makes both attributes point to the same tensor in memory. Changing one changes the other because they are literally the same bytes. ## Why we use it The most obvious reason is saving parameters. The embedding matrix has size vocabulary times embedding dimension. For GPT-2 Small that is 50257 times 768 which equals about 38.6 million numbers. The output matrix has the same size. Without tying these would be two separate matrices consuming about 77 million parameters just for input and output. With tying they become one matrix. We save 38.6 million parameters. That is about thirty percent of the total model size for GPT-2 Small. For larger models the savings are even greater. GPT-3 Large with a vocabulary of 50257 and an embedding dimension of 12288 would waste over 600 million parameters on a second copy of the embedding matrix. Those parameters can be better spent on more transformer blocks. The less obvious reason is better learning. The embedding matrix is the gateway into the model. Every token passes through it on the way in. The output matrix is the gateway out of the model. Every prediction passes through it on the way out. When the model gets a prediction wrong the gradient flows backward through the output matrix and all the way to the embedding matrix. Since they are the same matrix the embedding vectors get gradient signals from two directions. The forward pass through the embedding layer and the backward pass through the output layer both update the same numbers. This dual signal helps each embedding vector converge to a better representation. The third reason is mathematical elegance. The embedding layer maps token IDs to vectors. The output layer maps vectors to token probabilities. If the model has learned good embeddings then the same vectors that represent a token on the input side should be useful for predicting that token on the output side. Tying the weights enforces this consistency. A token's embedding vector is also the vector that the model uses to score that token as a possible next word. If the token *cat* has embedding vector v then the model's score for predicting *cat* is the dot product of the current hidden state with v. The embedding serves double duty as a representation and as a classification weight. ## When was it invented Weight tying was used in the original transformer paper in 2017. It was not a new idea at the time. Earlier language models like word2vec published in 2013 used tied input and output embeddings. It has been standard practice for language models ever since. GPT-2 and GPT-3 both use weight tying. LLaMA uses weight tying. Every model in this tutorial uses weight tying. There are cases where weight tying is not used. Some very large models separate the embedding and output matrices to allow the output layer to have a different structure from the input layer. But for most models including ours weight tying is the right choice. The parameter savings are too large to ignore and the dual gradient signal is genuinely helpful during training. ## How it works in practice Let us trace what happens when we train with tied weights. ### Forward pass with tied weights ``` Step 1: Token 3797 ("cat") enters the model Step 2: Embedding layer looks up row 3797 of matrix A Step 3: Row 3797 is the embedding vector for "cat" [768 numbers] Step 4: The vector flows through transformer blocks Step 5: The hidden state reaches the output layer Step 6: Output layer multiplies hidden state by matrix A^T Step 7: Row 3797 of A^T is column 3797 of A Step 8: This is the same vector that represented "cat" on input Step 9: The dot product gives the score for predicting "cat" ``` Notice that the same vector appears twice. Once as the representation of *cat* at the input. Once as the prediction target for *cat* at the output. The model is forced to make these two uses consistent. ### Backward pass with tied weights ``` Step 1: The model predicts wrong (true word was "mat" not "dog") Step 2: Loss is computed Step 3: Gradient flows to the output layer Step 4: The gradient updates row 3797 of matrix A (because cat was one of the wrong predictions) Step 5: The same gradient also flows back through the model Step 6: Eventually reaches the embedding layer Step 7: Row 3797 of matrix A gets a second gradient signal (because cat appeared in the input) Step 8: Both gradients are summed and applied to the same numbers ``` The embedding for *cat* gets updated twice per training step. Once for its role as an input token. Once for its role as a potential output token. This double signal means the embedding vectors learn faster and reach better representations. ## Verifying weight tying in code You can check that two tensors share memory in PyTorch. ```python import torch # Create an embedding and an output layer embedding = torch.nn.Embedding(1000, 768) output = torch.nn.Linear(768, 1000, bias=False) # Tie the weights embedding.weight = output.weight # Verify they share memory print(f"Same object: {embedding.weight is output.weight}") print(f"Same memory: {embedding.weight.data_ptr() == output.weight.data_ptr()}") # Modify one and see the other change old_value = embedding.weight[42, 0].item() output.weight[42, 0] = 99.9 new_value = embedding.weight[42, 0].item() print(f"\nAfter changing output.weight[42,0] to 99.9:") print(f"Embedding weight[42,0] changed from {old_value} to {new_value}") print(f"They are the same tensor. Changing one changes both.") ``` Running this code produces: ``` Same object: True Same memory: True After changing output.weight[42,0] to 99.9: Embedding weight[42,0] changed from 0.023 to 99.9 They are the same tensor. Changing one changes both. ``` ## The parameter savings by model size ``` Model Vocab Dim Weight Tying Savings GPT-2 Small 50,257 × 768 = 38.6 million params GPT-2 Medium 50,257 × 1,024 = 51.5 million params GPT-2 Large 50,257 × 1,280 = 64.3 million params LLaMA 7B 32,000 × 4,096 = 131.1 million params LLaMA 70B 32,000 × 8,192 = 262.1 million params GPT-3 (full) 50,257 × 12,288 = 617.6 million params ``` These savings are why weight tying is nearly universal. You get better embeddings and better training for the cost of zero extra parameters. In fact you get fewer parameters and better training at the same time. It is one of the rare cases in machine learning where there is no tradeoff. ## What you need to remember Weight tying makes the embedding layer and the output layer share the same weight matrix. One line of code saves tens or hundreds of millions of parameters. The shared matrix gets gradient signals from both the input and output directions leading to better embeddings. Every modern language model uses this technique. It is free better performance.