← demos

Tokenizer Lab — how an AI reads text

Before a model can predict the next token, it has to decide what a tokeneven is. Models don't read characters or whole words — they read learned sub-word pieces. This is the algorithm that finds them: byte-pair encoding. Train one on your own text and watch the pieces emerge. It all runs in your browser.

1 · training text27 base symbols → 40 merges learned
merges40

Byte-pair encoding starts from single characters and repeatedly fuses the most frequent adjacent pair. Drag the slider to watch a vocabulary of sub-word pieces grow from the text itself. Nothing leaves your browser.

2 · encode some text28 chars → 13 tokens · 2.15 chars/token
themodellearnslowerandlowest

Highlighted chips are multi-character tokens the model learned. With zero merges every character is its own token; as merges grow, common pieces like low and est collapse into single tokens — the same trick real LLM tokenizers use to fit more meaning per token.

3 · learned merges, in order
  1. 1.t+·t
  2. 2.e+·e
  3. 3.s+·s
  4. 4.o+wow
  5. 5.t+hth
  6. 6.l+owlow
  7. 7.e+rer
  8. 8.er+·er
  9. 9.n+·n
  10. 10.s+tst
  11. 11.th+ethe
  12. 12.e+ses
  13. 13.e+stest
  14. 14.n+gng
  15. 15.n+ene
  16. 16.w+iwi
  17. 17.m+omo
  18. 18.ng+·ng
  19. 19.wi+dwid
  20. 20.l+·l
  21. 21.i+tit
  22. 22.e+cec
  23. 23.t+ete
  24. 24.o+mom
  25. 25.r+ere
  26. 26.a+rar
  27. 27.low+erlower
  28. 28.low+estlowest
  29. 29.low+ilowi
  30. 30.lowi+nglowing
  31. 31.ne+wnew
  32. 32.mo+dmod
  33. 33.mod+emode
  34. 34.mode+lmodel
  35. 35.s+ese
  36. 36.se+essees
  37. 37.p+ipi
  38. 38.pi+ecpiec
  39. 39.piec+espieces
  40. 40.o+non

Order matters: encoding replays these merges from the top, so earlier (more frequent) pairs always win — that determinism is what lets a tokenizer and a model agree on exactly the same tokens every time.

what you're seeing

GPT-style models ship a frozen tokenizer trained this exact way on a huge corpus — tens of thousands of merges instead of a few dozen. The principle is identical: start from raw characters, fuse the most common pairs, and end up with a vocabulary where frequent chunks (“ing”, “tion”, “low”) are single tokens while rare strings stay split. It's why models count tokens, not words — and why this lab is the natural companion to Crumbs, which predicts the next one.