Attention Is All You Need: Building the Original Transformer that Started the LLM Revolution - Writing

In June 2017, eight researchers at Google published “Attention Is All You Need”. The paper proposed replacing recurrent neural networks entirely with a mechanism called self-attention. The architecture they introduced - the transformer - is now the foundation of GPT, Claude, BERT, and essentially every large language model that exists.

I previously built a decoder-only transformer (the GPT architecture) and trained it on Poe. That model generates text. But the original paper wasn’t about text generation at all. It was about translation - English to German, English to French. And it used a different architecture: an encoder-decoder transformer with cross-attention.

So I built that too, from scratch in PyTorch. Trained on 50,000 English-French sentence pairs from Tatoeba, it has 11.5 million parameters, trains in about 40 minutes on a laptop, and produces surprisingly accurate translations.

What the paper proposed

Before 2017, sequence-to-sequence models used recurrent neural networks - LSTMs and GRUs. They processed tokens one at a time, left to right, maintaining a hidden state. This had two problems.

First, sequential processing is slow. You can’t parallelise across positions because each step depends on the previous one. Training on long sequences takes forever.

Second, long-range dependencies are hard. The hidden state has to compress the entire history of the sequence into a fixed-size vector. By the time the model reaches the end of a long sentence, information from the beginning has been diluted through dozens of nonlinear transformations.

The transformer fixes both problems with self-attention: every position can attend to every other position directly, in a single step, in parallel. There’s no recurrence and no sequential bottleneck.

The architecture

The original transformer has two halves: an encoder and a decoder.

ENCODER                          DECODER
--------                         --------
Input embedding                  Output embedding (shifted right)
+ Sinusoidal position            + Sinusoidal position

N x [                            N x [
  Self-attention (bidirectional)    Masked self-attention (causal)
  Feed-forward                     Cross-attention (to encoder)
]                                   Feed-forward
                                 ]

                                 Linear -> Softmax -> Output

The encoder reads the full input (English sentence) and produces a contextualised representation. The decoder generates the output (French sentence) one token at a time, attending both to its own previous tokens and to the encoder’s output.

Three types of attention, all using the same mechanism:

Encoder self-attention. Bidirectional - every English token can attend to every other English token. “The” can look at “table” and “table” can look at “The”. No mask.

Decoder self-attention. Causal - each French token can only attend to previous French tokens. The model can’t cheat by looking ahead at the answer. Same mask as GPT.

Cross-attention. The decoder attends to the encoder’s output. This is where translation happens. When generating “chat”, the decoder’s cross-attention focuses on “cat” in the encoder representation. The queries come from the decoder, the keys and values come from the encoder.

class DecoderLayer(nn.Module):
    def forward(self, tgt, memory, tgt_mask, memory_mask):
        # Masked self-attention (can't look ahead)
        attn_out = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + attn_out)
        # Cross-attention to encoder (the translation bridge)
        cross_out = self.cross_attn(tgt, memory, memory, memory_mask)
        tgt = self.norm2(tgt + cross_out)
        # Feed-forward
        ff_out = self.ff(tgt)
        tgt = self.norm3(tgt + ff_out)
        return tgt

That cross_attn(tgt, memory, memory) call is the entire difference between this architecture and GPT. The first argument is the query (what the decoder is looking for), the second and third are keys and values (what the encoder provides). It’s asking: “given where I am in the French output, which parts of the English input should I pay attention to?”

Scaled dot-product attention

All three attention types use the same core operation: scaled dot-product attention.

scale = math.sqrt(head_dim)
attn = (Q @ K.transpose(-2, -1)) / scale
attn = softmax(attn, dim=-1)
output = attn @ V

Q, K, V are projections of the input. The dot product Q @ K^T measures similarity between each query position and each key position. Dividing by sqrt(d_k) prevents the dot products from growing too large for softmax to handle (this is the “scaled” part - without it, gradients vanish in the saturated regions of softmax). The result is a weighted average of the values.

Multi-head attention runs this operation multiple times in parallel with different learned projections, then concatenates the results. Eight heads means the model can attend to eight different types of relationships simultaneously - one head might focus on syntactic role, another on proximity, another on semantic similarity.

Sinusoidal positional encoding

The original paper uses fixed sinusoidal functions instead of learned position embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each position gets a unique pattern of sines and cosines at different frequencies. The intuition: low-frequency components encode rough position (beginning vs end), high-frequency components encode fine position (this token vs the next one).

The paper’s stated reason was that sinusoidal encoding might let the model generalise to longer sequences than it saw during training, since relative positions map to linear transformations of the encoding. In practice, most later models (GPT-2, BERT) switched to learned position embeddings, and modern models use rotary position embeddings (RoPE). But sinusoidal encoding works well and has zero trainable parameters.

pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
    torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

Teacher forcing and the shifted-right trick

During training, the decoder gets the correct target sequence as input, shifted right by one position and prepended with a start-of-sequence token. So when predicting the third French word, the decoder sees the SOS token plus the first two correct French words.

Target input:  <SOS> je  suis étudiant
Target output: je    suis étudiant <EOS>

This is teacher forcing - the model always sees the ground truth during training, never its own predictions. It makes training stable and fast, but creates a train-test mismatch: at inference time, the model has to use its own (potentially wrong) predictions as input.

The Noam learning rate schedule

The paper introduced a specific warmup schedule:

lr = d_model^(-0.5) * min(step^(-0.5), step * warmup^(-1.5))

Learning rate increases linearly for the first warmup_steps, then decays proportionally to the inverse square root of the step number. The warmup prevents early instability when the model hasn’t learned useful representations yet and the gradients are noisy. The paper used 4000 warmup steps.

Label smoothing

Instead of training the model to predict probability 1.0 for the correct token and 0.0 for everything else, label smoothing spreads 10% of the probability mass uniformly across all tokens. The correct token gets 0.9 instead of 1.0.

This hurts perplexity (the model can never be perfectly confident) but improves BLEU scores because it prevents the model from becoming overconfident and encourages it to maintain reasonable probabilities for plausible alternatives. The paper used label smoothing of 0.1. PyTorch has it built into CrossEntropyLoss.

Training

I trained on 50,000 English-French pairs from the Tatoeba dataset, filtered to sentences under 15 words. Word-level tokenizers for both languages. Adam optimizer with the Noam schedule.

Epoch	Train Loss	Val Loss
1	6.5911	5.1423
5	3.1535	3.2336
10	2.5310	2.8353
15	2.2847	2.7203
20	2.1448	2.6820

Validation loss keeps improving through all 20 epochs. Compare this with my Poe language model, which overfit after epoch 1. The difference is the data-to-parameter ratio. The Poe model had 4.8M parameters and 1.9M characters - more capacity than data. This model has 11.5M parameters but 50,000 sentence pairs with diverse vocabulary. The model is still underfitting at epoch 20. More training would help.

Progression

Epoch 1 - the model has learned basic French word order but not much else. Everything is “est” (is) and <unk>:

EN: The cat is on the table.     FR: le est est <unk>.
EN: I love you.                  FR: je suis de vous.
EN: She has a beautiful house.   FR: il a est <unk>.

Epoch 5 - common phrases are already correct. “Le chat est sur la table” is perfect. But the model picks plausible-but-wrong words for harder translations:

EN: The cat is on the table.     FR: le chat est sur la table.
EN: I love you.                  FR: j'adore vous.
EN: She has a beautiful house.   FR: elle a une grande maison.
EN: We went to the beach.        FR: nous sommes en train de rentrer à la plage.

“J’adore vous” is grammatically wrong (should be “je t’aime” or “je vous adore”). “Une grande maison” means “a big house” instead of “a beautiful house”. The beach sentence is completely hallucinated - “we are in the process of returning to the beach” instead of “we went to the beach”.

Epoch 10 - nearly perfect. The model gets gender agreement right, handles past tense, and picks the correct idioms:

EN: The cat is on the table.     FR: le chat est sur la table.
EN: I love you.                  FR: je t'aime.
EN: She has a beautiful house.   FR: elle a une belle maison.
EN: We went to the beach.        FR: nous sommes allées à la plage hier.

One subtle error: “allées” uses the feminine plural past participle. It should be “allés” (masculine or mixed group). The model over-indexed on feminine forms.

Epoch 20 - that gender error is fixed. The model produces correct translations for most simple sentences:

EN: The cat is on the table.     FR: le chat est sur la table.
EN: I love you.                  FR: je t'aime.
EN: She has a beautiful house.   FR: elle a une belle maison.
EN: We went to the beach.        FR: nous sommes allés à la plage hier.

Try it

Pick an English sentence. These are real outputs from the trained model.

English input

Model output

Reference

Je suis étudiant.

Where it fails

The model handles simple, common sentences well. It breaks on:

Idioms. “Good morning” becomes “les bonnes matins” (the good mornings) instead of “Bonjour”. The model translates literally because it’s never learned that “good morning” is an idiom.

Rare vocabulary. Words that appear fewer than twice in the training data get mapped to <unk>. The model can’t translate what it hasn’t seen.

Gender agreement on unusual nouns. “La nourriture est délicieux” should be “délicieuse” (feminine). The model knows gender rules for common words but not for “nourriture”.

Complex tenses. Simple present and passé composé work well. Conditional, subjunctive, and plus-que-parfait are spotty.

These failures all come from the same root cause: not enough data. 50,000 sentence pairs is tiny. The original paper trained on 4.5 million sentence pairs (WMT 2014 English-German) and 36 million pairs (English-French). Scale the data 100x and most of these errors disappear.

What this teaches you that decoder-only doesn’t

Building a decoder-only transformer (GPT) teaches you attention and autoregressive generation. Building the original encoder-decoder adds cross-attention, bidirectional context, and the relationship between the three architectures.

Cross-attention is the bridge. In a decoder-only model, all the information flows through a single stream of self-attention. In an encoder-decoder, the encoder and decoder are separate networks connected only by cross-attention. The encoder’s job is to build a rich representation of the input. The decoder’s job is to read that representation and produce output. Cross-attention is the only communication channel between them, and it’s where translation happens.

Bidirectional context matters. The encoder sees the entire input at once - every word can attend to every other word, no mask. “Bank” in “I sat on the bank of the river” attends to “river” and resolves its meaning. In a decoder-only model, “bank” can only attend to words before it. The encoder has strictly more information available at every position.

The three architectures are the same mechanism, three ways. Once you’ve built this, you see that BERT (encoder-only), GPT (decoder-only), and the original transformer (encoder-decoder) are all the same building blocks wired differently. The attention mechanism doesn’t change. What changes is what can attend to what, and whether there’s a second stream of attention crossing between encoder and decoder.

The code

Everything is at github.com/danieljohnmorris/attention-is-all-you-need. Six files, no dependencies beyond PyTorch. Train it on a laptop in under an hour.

python download_data.py
python train.py --epochs 20
python translate.py --interactive