Inspired by the LLM Engineer’s Handbook, I wanted to build each piece of the foundation model pipeline myself - data, tokenization, architecture, pre-training, inference. Not fine-tuning an existing model. Training one from nothing. So I wrote a small language model from scratch in PyTorch and trained it on Edgar Allan Poe.
The model has about 4.8 million parameters. It runs on a laptop. It produces text that sounds vaguely like gothic horror written by someone having a stroke. That’s the point - you learn more from a model that almost works than from one that’s too big to inspect.
The corpus
The Poe Museum has his complete works online. I scraped 131 works - 67 stories and 64 poems. The texts sit in WordPress .entry-content divs, so extraction was straightforward with BeautifulSoup.
After cleaning: 1,904,601 characters, roughly 340,000 words, 146 unique characters. Small by LLM standards. Enough to learn English structure and pick up Poe’s vocabulary.
Two tokenizers
I built both a character-level tokenizer and a byte-pair encoding (BPE) tokenizer from scratch. No libraries. This is where you really feel the difference between reading about tokenization and implementing it.
Character-level
The simplest possible approach. Each unique character maps to an integer.
class CharTokenizer:
def train(self, text: str):
chars = sorted(set(text))
self.char_to_id = {ch: i for i, ch in enumerate(chars)}
self.id_to_char = {i: ch for ch, i in self.char_to_id.items()}
def encode(self, text: str) -> list[int]:
return [self.char_to_id[ch] for ch in text]
def decode(self, ids: list[int]) -> str:
return "".join(self.id_to_char[i] for i in ids)
Poe’s corpus has 146 unique characters. That’s the vocabulary. Every character is one token.
The advantage: zero information loss and no unknown tokens. The downside: the model has to learn everything from individual characters. “The” is three tokens. The model needs to burn capacity learning that ‘t’ followed by ‘h’ followed by ‘e’ is a common pattern, before it can even start on syntax.
With 1.9M characters, you get 1.9M tokens. Long sequences, slow training.
Byte-pair encoding
BPE starts with the character vocabulary and iteratively merges the most frequent adjacent pair into a new token. After enough merges, common words become single tokens.
# Simplified core loop
for i in range(num_merges):
# Count all adjacent pairs across the corpus
pair_counts = Counter()
for word, freq in word_freqs.items():
for j in range(len(word) - 1):
pair_counts[(word[j], word[j + 1])] += freq
# Merge the most frequent pair
best_pair = pair_counts.most_common(1)[0][0]
merged = best_pair[0] + best_pair[1]
vocab[merged] = next_id
merges.append(best_pair)
# Apply merge to all words
word_freqs = apply_merge_everywhere(word_freqs, best_pair, merged)
With 512 tokens (146 base chars + 366 merges), the same text encodes to 1,087,709 tokens. A 1.75x compression ratio. Common words like “the”, “upon”, “they” become single tokens. Less common words split at natural boundaries - “midnight” becomes ['m', 'id', 'n', 'ight'].
BPE training is slow. Each merge requires rescanning all pair frequencies. With 1.9M characters and 366 merges, it takes a few minutes. The actual encoding is fast. This matches the real-world pattern - you train the tokenizer once and encode many times.
BPE’s merges are greedy. It always picks the globally most frequent pair. This means it can produce suboptimal tokenizations for specific words. “dreary” becomes ['d', 're', 'ary'] rather than something more morphologically sensible. Real tokenizers like SentencePiece use a unigram model that optimises globally. But for understanding the concept, greedy BPE is clear and correct.
The model
A decoder-only transformer. The same architecture family as GPT-2, just smaller.
Token embedding (vocab_size -> 256)
+ Positional embedding (128 -> 256)
-> Dropout
-> 6x Transformer Block:
LayerNorm -> Multi-Head Self-Attention (8 heads)
LayerNorm -> Feed-Forward (256 -> 1024 -> 256, GELU)
-> Final LayerNorm
-> Linear head (256 -> vocab_size)
About 4.8 million parameters. For context, GPT-2 small is 124 million. GPT-3 is 175 billion.
A few design choices worth noting:
Pre-norm vs post-norm. Original transformers apply layer norm after the residual connection. GPT-2 and most modern models apply it before. Pre-norm trains more stably - the residual path is cleaner, gradients flow better. I used pre-norm.
Weight tying. The input token embedding matrix and the output projection matrix share the same weights. This reduces parameters and improves training - the model’s notion of “what does token X mean” and “should I predict token X” stay aligned.
Causal masking. The self-attention pattern is masked so each position can only attend to previous positions. This is the “decoder-only” part. Without it you’re building BERT, not GPT.
# The causal mask - one line that defines next-token prediction
mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
attn = attn.masked_fill(mask, float("-inf"))
That single mask is the difference between “predict missing words” (BERT-style) and “predict the next word” (GPT-style). When I first forgot to add it, the model achieved suspiciously low loss and generated exact memorised passages. It was cheating by looking ahead.
Training
AdamW optimizer, cosine learning rate schedule, gradient clipping at 1.0. Standard recipe. Trained on Apple MPS (Metal Performance Shaders) - the GPU on M-series Macs.
Character-level results
Char-level training took about 7.5 hours for 5 epochs on MPS. Here’s what the model produced at each stage.
Epoch 1 - the model has learned word boundaries, common words, and em dashes. Vocabulary is recognisably Poe but grammar is loose:
The death of a beautiful woman, and the phantasmagoric are not to the
heart. If there be made knowledge to confirm the green surface of the
earth and — we shall not be the affiant within the gravitation of the
box — being unfortunately reblished as much as passionately
Epoch 3 - dialogue structure appears, with proper quotation marks and exchanges:
"Cannot stay any thing?" said I.
"Go —"
And more complex narrative:
The death of a beautiful woman, not unlike those of the piazzas of
Correction with the combustibility of blows. But the universe as well
shot out, and the handsie of ocean will night.
Epoch 5 - longer coherent runs, gothic imagery, first-person voice:
The death of a beautiful woman, through the crowd, through seventy
thousand ways to shall examine the blood-room in the pitcher bere
intended to transmiss of it the very first tenef o'clock, through a
shadow and hard, rumors and his parcel, into the innermost regions
of impetuous apparatus
It never produces grammatically correct Poe. But it learns the texture: long sentences, dashes, words like “countenance” and “phantasmagoric” and “impetuous”. That’s a lot to learn from individual characters.
BPE results
With the same model config, BPE trained in 3.5 hours - less than half the time. Each epoch processes 43% fewer tokens because of compression.
BPE epoch 1 already rivals char epoch 3 for coherence. The model doesn’t waste capacity learning to spell:
I found myself in a dark chamber — for it was a condition which I could
not at any positive interruption of the singular buck — I saw nothing
of the age with which, as I have already said, existed, except by a
peditor of the "Old Charley," a paper, has been reading me to exagree.
BPE epoch 2 - already producing near-perfect English prose:
the reflection of individual error with which it commenced. This
capitals are, in fact, attributed the word "the trunks of a governex
wrace and that of ungovernexate dominions"
BPE epoch 4 - proper Poe atmosphere, with phrases that could almost pass as real:
In this stage of my beeting I became aware of a dull, sullen
glow-satisfied with a fashion of great genius, pursues of the kingdom
“I became aware of a dull, sullen glow” is straight out of The Tell-Tale Heart.
BPE epoch 5 - the most accomplished output:
Once upon a midnight dreary upon the lips of the axis. It was then,
fully the musical inclined atmosphere — a small portion of the main
drift to the glittering of the night. It was easily for its centurat,
and found myself, at the falling of a funnel of the ship.
The overfitting problem
Both models overfit aggressively. Best validation loss is always epoch 1. After that, training loss keeps dropping while validation loss climbs.
Char-level:
| Epoch | Train Loss | Val Loss |
|---|---|---|
| 1 | 1.068 | 2.048 |
| 5 | 0.426 | 2.945 |
BPE:
| Epoch | Train Loss | Val Loss |
|---|---|---|
| 1 | 1.744 | 3.433 |
| 5 | 0.557 | 4.702 |
BPE has higher absolute losses because the vocabulary is larger - 512 possible next tokens vs 146. More choices means higher cross-entropy even for equivalent quality.
The obvious question: is the model too big?
No. 4.8 million parameters is tiny. The model isn’t too big for the task - it’s too big for the data. Poe only wrote about 1.9 million characters. The model has more parameters than training characters. It can memorise the corpus, so it does.
This is the opposite of how real LLMs work. GPT-3 has 175 billion parameters but was trained on roughly 300 billion tokens. The model is always underfitting - it could never memorise the training set even if it tried. So it’s forced to learn general patterns instead.
There are three levers:
More data fixes the problem properly. Train this same 4.8M model on all of Project Gutenberg - billions of characters across thousands of authors - and it would still be underfitting at epoch 5. It would learn grammar, not just Poe’s vocabulary.
Smaller model delays the overfitting but also limits what the model can learn. A 500K parameter model would take longer to memorise the corpus, but it would also produce worse output at every stage. You’d get less overfitting and less capability - a bad trade.
More regularisation - higher dropout, stronger weight decay, early stopping at epoch 1 - helps at the margins. But you can’t regularise your way out of a data shortage. The model needs more text to learn from, not more punishment for learning too well.
The real insight is about ratios. When parameters >> data, the model memorises. When data >> parameters, the model generalises. Every LLM training run is a bet on where you sit on that curve. With 4.8M parameters and 1.9M characters, this run sits on the memorisation side.
Despite overfitting, later epochs produce more interesting output. The model memorises patterns and fragments, not exact sequences verbatim. So generation still varies - it’s recombining memorised pieces in new ways. A kind of creative plagiarism.
Generation
At inference time, the model predicts one token at a time. It takes the sequence so far, runs it through the transformer, gets a probability distribution over the vocabulary for the next token, samples from it, appends, and repeats.
Two parameters matter most:
Temperature scales the logits before softmax. Temperature 1.0 means raw probabilities. Lower (0.3) makes the model more confident - it picks common tokens, produces repetitive but grammatical text. Higher (1.5) flattens the distribution - more creative, more chaotic. I used 0.8.
Top-k restricts sampling to the k most probable tokens. This prevents the model from occasionally picking a wildly improbable token that derails the generation. Top-k of 40 works well for this model size.
# The core generation loop
logits = logits[:, -1, :] / temperature
if top_k > 0:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
Try it
Pick a seed phrase and a model. These are real outputs from the trained models - pre-generated, not running live.
What I learned
Attention is just a weighted average. Stripped of the jargon, multi-head attention computes “how relevant is each previous position to the current one” and takes a weighted average of their values. The multi-head part just means it does this with different learned projections in parallel, so it can attend to different types of relationships simultaneously.
The residual connections are load-bearing. Remove them and the model barely trains. They let gradients flow directly from the loss back to early layers without being squashed through nonlinearities. Pre-norm makes this even more stable.
Tokenization determines the difficulty. The same model architecture struggles more with char-level than BPE, because the effective task is harder. Char-level has to simultaneously learn spelling, word boundaries, syntax, and style. BPE gets word boundaries for free and most spelling for free. The model can focus on higher-level patterns. BPE trained in less than half the time and produced better output.
Small models memorise fast. With 4.8M parameters and 1.9M characters, the model starts overfitting after one epoch. Validation loss bottoms out at epoch 1 and climbs from there while training loss keeps falling. More data would help more than any architectural trick.
The gap between “works” and “good” is enormous. This model produces text that has the shape and feel of Poe but falls apart on close reading. Going from here to GPT-2 quality requires 25x more parameters. Going from there to GPT-4 quality requires more than just scaling up the same architecture.
The code
Everything is at github.com/danieljohnmorris/tiny-poe-llm. Five files, no dependencies beyond PyTorch. You can train it on a laptop in a few hours.
# Scrape Poe's complete works
python scrape.py
# Train with char-level tokenizer
python train.py --tokenizer char --epochs 5
# Train with BPE tokenizer
python train.py --tokenizer bpe --bpe-vocab-size 512 --epochs 5
# Generate text from a trained model
python generate.py --prompt "Once upon a midnight dreary"
A few hours of debugging your own broken attention masks teaches more about transformers than a week of tutorials.