Common questions about tokens, tokenisers, and how Token Compare arrives at its numbers.

What is a token?

Tokens are the chunks of text that large language models actually read and write: sometimes a whole word ("hello"), sometimes a fragment ("ization"), sometimes a single character or punctuation mark. Models charge by the token, count their context windows in tokens, and are trained on corpora measured in tokens.

Why aren't tokens just words?

Tokenisers are trained to compress text efficiently, so common words become a single token and rarer or longer ones split into pieces. "The" is one token; "antidisestablishmentarianism" is six. As a rough rule, English prose averages around 1.3 tokens per word.

Why do GPT, Claude, and Gemini count the same text differently?

Each model family ships its own tokeniser, trained on a different mix of data. The same paragraph produces different counts depending on the model. Typical variance is ±5-10% for English prose, more for code or non-English languages. Token Compare uses OpenAI's o200k_base (the tokeniser used by GPT-5, GPT-4.1, GPT-4o and 4o mini, and the o-series: o1, o3, o4-mini) as a single reference point. Expect Claude and Gemini to land in roughly the same ballpark, not on the same number.

Why isn't there a Claude or Gemini tokeniser option?

Neither Anthropic nor Google publishes the production tokeniser for their current models. The only accurate way to count Claude or Gemini tokens is to call their official count-tokens APIs, which requires a server, API keys, and sending your text over the network, which would break the everything-runs-in-your-browser promise. Other tokeniser sites that offer a "Claude" or "Gemini" option are usually shipping the old open-source Claude 1/2 tokeniser or Google's Gemma tokeniser as a stand-in; the numbers are close but not exact, and rarely flagged as such. Sticking to one accurate, fully client-side tokeniser felt more honest than offering pseudo-accurate alternatives.

Why do some languages use more tokens than English?

Most production tokenisers were trained on data heavily skewed toward English, so many other languages get worse compression. German runs around 1.4 tokens per word vs about 1.3 for English (compounds, umlauts); Japanese and Chinese can need 2-3× more tokens per character; transliterated proper nouns from non-Latin scripts (like "Mahershalalhashbaz" or "Tchaikovsky") often split into many tokens each. Languages with Latin script and lots of training data (Spanish, French) come closer to English ratios. This is also why the KJV Bible's token count runs higher than its English word count would predict.

How accurate are the reference text counts?

Where a text is in the public domain (Pride and Prejudice, Moby-Dick, the Magna Carta), counts are exact, produced by running the actual full text through o200k_base. Where a text is still in copyright (Harry Potter, 1984, Lord of the Rings) or is a moving target (Wikipedia, training corpora), counts are estimates derived from word counts and an empirical tokens-per-word ratio. Click the ⓘ next to any reference for the source and method.

Is my text sent anywhere?

No. Tokenisation, comparison, and rendering all happen in your browser. The only network traffic on this page is the tokeniser library itself (loaded once from a CDN) and an anonymous pageview ping to Plausible. Your text never leaves your machine.

← Back to Token Compare