# Encoder, Decoder and Encoder-Decoder: The Three Transformer Families

## The short answer

There are three ways to build a transformer. Decoder-only models like
GPT generate text one token at a time. They can only see what came
before. Encoder-only models like BERT look at the entire input at once
in both directions. They understand text but cannot generate it.
Encoder-decoder models like T5 read the full input with an encoder and
generate output with a decoder. They are built for tasks that transform
one piece of text into another.

This guide builds a decoder-only model. The same architecture as GPT
and LLaMA and Mistral. This file explains why and what the other
options do.

## Decoder-Only (GPT family)

### What it looks like

```
Input: "The cat sat on the"
  → Token embedding
    → Causal attention (can only look backward)
      → Feed forward
        → Repeat N times
          → Output projection
            → Predict next token: "mat"
```

The key feature is the causal mask. Every token can only attend to
tokens that came before it. Token 5 can see tokens 0 through 4. Token
5 cannot see token 6 because token 6 has not been written yet.

### How it's trained

Decoder-only models are trained on next token prediction. Show the
model a sequence of tokens. Ask it to predict each next token. The
model learns to guess what comes next. This is called autoregressive
training.

```
Training example:
  Input:  [The, cat, sat, on, the]
  Target: [cat, sat, on, the, mat]

  The model sees "The" and must predict "cat"
  The model sees "The cat" and must predict "sat"
  The model sees "The cat sat" and must predict "on"
  ... and so on
```

Every prediction is made using only past tokens. The model never sees
the future. The causal mask enforces this during training. During
generation no mask is needed because future tokens simply do not exist
yet.

### What it's good at

Text generation. Writing stories. Answering questions. Having
conversations. Completing code. Any task where you produce new text
one token at a time.

Decoder-only models are the universal tool. With enough scale and the
right training data they can do almost anything. GPT-3 showed this in
2020. The model could translate and summarize and answer questions
despite being trained only to predict the next word. It learned these
skills implicitly because the training data contained examples of
translation and summarization and question answering.

### Why we chose it

Decoder-only models are the simplest to build and train. One task.
Predict the next token. One architecture. Causal attention whose output
goes through a feed forward network. No separate encoder. No cross
attention between encoder and decoder. One stack of identical blocks.

They also scale the best. Every major breakthrough in capability from
GPT-2 to GPT-3 to GPT-4 came from decoder-only models. The simplicity
of the architecture means all resources go into making the model bigger
and the data better. There is no complexity budget spent on additional
components.

### The limitation

Decoder-only models cannot look at the full input bidirectionally.
Token 5 cannot use information from token 10 because token 10 does not
exist yet. This is fine for generation but suboptimal for understanding
tasks where the entire input is available from the start.

For tasks like classification or named entity recognition where you
have the complete input a bidirectional model can capture context from
both directions. A decoder-only model can only capture context from the
left. In practice this matters less than you might think. With enough
scale a decoder-only model learns to compensate for the missing right
context by building rich representations that anticipate what comes
next.

## Encoder-Only (BERT family)

### What it looks like

```
Input: "The cat sat on the [MASK]"
  → Token embedding
    → Bidirectional attention (can look everywhere)
      → Feed forward
        → Repeat N times
          → Output projection
            → Predict masked token: "mat"
```

The key feature is bidirectional attention. Every token can attend to
every other token regardless of position. There is no causal mask.
Token 5 can see token 0 and token 10 equally.

### How it's trained

Encoder-only models are trained on masked language modeling. Hide some
percentage of input tokens randomly. Ask the model to predict what was
hidden.

```
Training example:
  Original: "The cat sat on the mat"
  Masked:   "The cat [MASK] on the [MASK]"
  Target:   "sat" and "mat"

  The model sees the whole sentence including words after the mask.
  It uses context from both directions to predict the hidden words.
```

The model sees the entire input at once. It can use information from
words before AND after the masked token. This is fundamentally
different from decoder-only training where the model is blind to the
future.

### What it's good at

Understanding tasks. Classification. Named entity recognition. Question
answering where the answer is in the provided text. Sentiment analysis.
Any task where the input is complete and the output is a label or a
span of text rather than a generated sequence.

BERT embeddings became the standard for representing text. For years
the best approach for any NLP task was to take a pretrained BERT model
and add a small task specific head on top. Fine-tune for a few epochs.
The approach worked because BERT's bidirectional understanding captured
rich representations of word meaning in context.

### The limitation

Encoder-only models cannot generate text autoregressively. They have
no causal mask. They have no mechanism to produce one token at a time
conditioned on previous outputs. You cannot use BERT to write a story
or hold a conversation.

Encoder-only models are also limited by their training objective. Masked
language modeling teaches the model to fill in blanks. It does not teach
the model to produce coherent sequences. You can generate text by
iteratively masking and predicting but the output is typically worse
than what a decoder-only model produces.

## Encoder-Decoder (T5 family)

### What it looks like

```
Input: "Translate to French: The cat sat on the mat"
  → Encoder (bidirectional attention)
    → Hidden representation of the full input
      → Decoder (causal attention + cross attention)
        → Output: "Le chat s'est assis sur le tapis"
```

The encoder reads the entire input bidirectionally. It produces a dense
representation of the input. The decoder generates the output
autoregressively one token at a time. The decoder has both causal self
attention like a GPT and cross attention that looks at the encoder's
output.

The cross attention is the key difference from decoder-only models. At
every generation step the decoder can look back at the full encoded
input. This gives the decoder direct access to the input representation
without needing to encode it in the autoregressive state.

### How it's trained

Encoder-decoder models are trained on sequence to sequence tasks. Show
the model an input sequence and a target output sequence. The encoder
processes the input. The decoder generates the output one token at a
time.

```
Training example:
  Input:  "Summarize: The cat sat on the mat for three hours..."
  Target: "A cat stayed on a mat for a long time."

  The encoder reads the full input bidirectionally.
  The decoder generates "A" then "cat" then "stayed" and so on.
  At each step the decoder can cross attend to the encoder's output.
```

The training uses teacher forcing. The decoder is given the correct
previous tokens during training. The model learns to produce the next
token given the input and the correct history.

### What it's good at

Sequence to sequence tasks. Translation. Summarization. Any task where
the input and output are both text but have different lengths or
structures.

Encoder-decoder models separate the concerns. The encoder focuses on
understanding the input. The decoder focuses on generating the output.
This division of labor can be more efficient than a decoder-only model
which must do both in a single stack of layers.

### The limitation

Encoder-decoder models are more complex. Two separate stacks of layers.
Cross attention between them. More parameters for the same quality on
general language tasks. The architecture is specialized for sequence to
sequence tasks and less flexible for open ended generation.

The rise of decoder-only models has reduced the popularity of encoder
decoder architectures. A large enough decoder-only model can implicitly
perform the separation that an encoder-decoder model makes explicit.
GPT-3 showed this for translation and summarization. The decoder-only
model learned to understand the input and generate the output in a
single stack of layers.

## Why this guide teaches decoder-only

Decoder-only models are the foundation of modern AI. ChatGPT is a
decoder-only model. Claude is a decoder-only model. LLaMA and Mistral
are decoder-only models. Understanding how they work means
understanding the architecture behind the most capable AI systems ever
built.

The architecture is also the simplest. One stack of blocks. One
attention pattern with causal masking. One training objective. Next
token prediction. The simplicity makes it the best starting point for
learning. Once you understand the decoder-only transformer you can
understand any transformer variant.

Encoder-only models are still widely used. BERT and its variants power
search engines and classification systems and information retrieval.
But they cannot generate text. Understanding them is useful for
specialized applications but not essential for building generative AI.

Encoder-decoder models are becoming less common. The gap between
decoder-only and encoder-decoder performance has narrowed. For most
practical purposes a large decoder-only model matches or exceeds an
encoder-decoder model on the same task. The added complexity is harder
to justify.

## When to use each

```
Do you need to generate new text token by token?
  → Decoder-only (GPT, LLaMA, Mistral)

Do you need to understand text and produce a label or classification?
  → Encoder-only (BERT, RoBERTa, DeBERTa)

Do you need to transform text from one form to another and want the
best possible quality for a specific task?
  → Encoder-decoder (T5, BART)

Do you want one architecture that can do everything reasonably well
and is simple to understand and build?
  → Decoder-only
```

## What you need to remember

Decoder-only models generate text one token at a time using only past
context. They are trained on next token prediction. This is the GPT
family and what this entire guide teaches.

Encoder-only models understand text bidirectionally using the full
context. They are trained on masked language modeling. This is the
BERT family. They cannot generate text.

Encoder-decoder models combine both. An encoder reads the input
bidirectionally. A decoder generates the output autoregressively with
cross attention to the encoder. This is the T5 family. They are built
for sequence to sequence tasks.

All three use the same building blocks. Attention. Feed forward
networks. Residual connections. Normalization. The only differences are
the attention mask pattern and the training objective. Master the
decoder-only architecture and you have mastered the foundation of all
modern language models.