Sediment — the information in text

Most of what you type is predictable, and predictable means free — you could have guessed it. The information is only the part that surprises. This measures it: feed in any text, and watch how much is signal and how much is sediment you can wash away. Everything runs in your browser.

1 · text474 chars measured

A character-level model learns which characters tend to follow which, then asks of each character: how surprised was I to see this? Surprise, measured in bits, is information. No surprise is redundancy.

0.17

bits / char

vs 4.75 naive

96%

redundancy

predictable

compresses to

27.8× smaller

10 B

actual info

of 282 B raw

2 · where the information is

Bright characters surprised the model — they carry the signal. Dim characters were predictable sediment.

you are reading this one character at a time, but you do not need every character. yur brain fills the predictable ones back in without being asked. that is redundancy: the part of the text that says nothing you could not already guess. strip it away and what remains is the information. shale is rock made of pressure and patience, fine layers of sediment pressed until only the structure is left. language is the same. most keystrokes are sediment. a few carry the signal.

3 · strip the sedimentkeep top 35%

Drag to remove the most predictable characters. Watch how much you can take away before meaning actually breaks — the redundancy is the slack.

what you're seeing

This is the principle behind every compressor — gzip, PNG, the codec that streamed this page. Predict the next symbol; spend bits only on the surprises. Claude Shannon estimated written English at barely more than one bit per character, which means roughly three quarters of what you type is already implied by what came before. The model here is tiny and learns only from the text you give it, so its numbers are an in-sample estimate, not a universal one — but the shape of the result is real. Crumbs predicts the next token; Tokenizer Lab shows how text becomes tokens; Sediment measures what those tokens are actually worth.