Tokenizer Lab — how an AI reads text

Before a model can predict the next token, it has to decide what a tokeneven is. Models don't read characters or whole words — they read learned sub-word pieces. This is the algorithm that finds them: byte-pair encoding. Train one on your own text and watch the pieces emerge. It all runs in your browser.

1 · training text27 base symbols → 40 merges learned

merges40

Byte-pair encoding starts from single characters and repeatedly fuses the most frequent adjacent pair. Drag the slider to watch a vocabulary of sub-word pieces grow from the text itself. Nothing leaves your browser.

2 · encode some text28 chars → 13 tokens · 2.15 chars/token

themodellearnslowerandlowest

Highlighted chips are multi-character tokens the model learned. With zero merges every character is its own token; as merges grow, common pieces like low and est collapse into single tokens — the same trick real LLM tokenizers use to fit more meaning per token.

3 · learned merges, in order

1.t+· → t
2.e+· → e
3.s+· → s
4.o+w → ow
5.t+h → th
6.l+ow → low
7.e+r → er
8.er+· → er
9.n+· → n
10.s+t → st
11.th+e → the
12.e+s → es
13.e+st → est
14.n+g → ng
15.n+e → ne
16.w+i → wi
17.m+o → mo
18.ng+· → ng
19.wi+d → wid
20.l+· → l
21.i+t → it
22.e+c → ec
23.t+e → te
24.o+m → om
25.r+e → re
26.a+r → ar
27.low+er → lower
28.low+est → lowest
29.low+i → lowi
30.lowi+ng → lowing
31.ne+w → new
32.mo+d → mod
33.mod+e → mode
34.mode+l → model
35.s+e → se
36.se+es → sees
37.p+i → pi
38.pi+ec → piec
39.piec+es → pieces
40.o+n → on

Order matters: encoding replays these merges from the top, so earlier (more frequent) pairs always win — that determinism is what lets a tokenizer and a model agree on exactly the same tokens every time.

what you're seeing

GPT-style models ship a frozen tokenizer trained this exact way on a huge corpus — tens of thousands of merges instead of a few dozen. The principle is identical: start from raw characters, fuse the most common pairs, and end up with a vocabulary where frequent chunks (“ing”, “tion”, “low”) are single tokens while rare strings stay split. It's why models count tokens, not words — and why this lab is the natural companion to Crumbs, which predicts the next one.