Tokenizer Lab — how an AI reads text
Before a model can predict the next token, it has to decide what a tokeneven is. Models don't read characters or whole words — they read learned sub-word pieces. This is the algorithm that finds them: byte-pair encoding. Train one on your own text and watch the pieces emerge. It all runs in your browser.
Byte-pair encoding starts from single characters and repeatedly fuses the most frequent adjacent pair. Drag the slider to watch a vocabulary of sub-word pieces grow from the text itself. Nothing leaves your browser.
Highlighted chips are multi-character tokens the model learned. With zero merges every character is its own token; as merges grow, common pieces like low and est collapse into single tokens — the same trick real LLM tokenizers use to fit more meaning per token.
- 1.t+· → t
- 2.e+· → e
- 3.s+· → s
- 4.o+w → ow
- 5.t+h → th
- 6.l+ow → low
- 7.e+r → er
- 8.er+· → er
- 9.n+· → n
- 10.s+t → st
- 11.th+e → the
- 12.e+s → es
- 13.e+st → est
- 14.n+g → ng
- 15.n+e → ne
- 16.w+i → wi
- 17.m+o → mo
- 18.ng+· → ng
- 19.wi+d → wid
- 20.l+· → l
- 21.i+t → it
- 22.e+c → ec
- 23.t+e → te
- 24.o+m → om
- 25.r+e → re
- 26.a+r → ar
- 27.low+er → lower
- 28.low+est → lowest
- 29.low+i → lowi
- 30.lowi+ng → lowing
- 31.ne+w → new
- 32.mo+d → mod
- 33.mod+e → mode
- 34.mode+l → model
- 35.s+e → se
- 36.se+es → sees
- 37.p+i → pi
- 38.pi+ec → piec
- 39.piec+es → pieces
- 40.o+n → on
Order matters: encoding replays these merges from the top, so earlier (more frequent) pairs always win — that determinism is what lets a tokenizer and a model agree on exactly the same tokens every time.
GPT-style models ship a frozen tokenizer trained this exact way on a huge corpus — tens of thousands of merges instead of a few dozen. The principle is identical: start from raw characters, fuse the most common pairs, and end up with a vocabulary where frequent chunks (“ing”, “tion”, “low”) are single tokens while rare strings stay split. It's why models count tokens, not words — and why this lab is the natural companion to Crumbs, which predicts the next one.