Where glitch tokens hide: common patterns in tokenzer vocabulary

Ties
Summary and 1. Introduction
-
Methods
2.1 Start analysis
2.2 Indicators to detect under trained tokens and 2.3 Verification of candidate tokens
-
Results
3.1 Indicators efficiency and verification
3.2 Current observations
3.3 Observations specific to the model
-
Closed source models
-
Discussion, thanks and references
A. Verification details
B. A short start on UTF-8 encoding
C. Outputs for API -based verification
3.2 Current observations
Although many of our results depend on details specific to the model such as the training and configuration of Tokenizer, the architecture of the model and the training data, there are a certain number of common points which appear in many families of different models.
3.2.1 Entry to an byte
The tokens representing a unique byte are a common source of non -trained tokens. The most common events are the “rescue” bytes 0xf5–0xff which are not used in the encoded text UTF-8[2]And are a practical source to quickly locate non -driven reference tokens for the indicators that require them. In addition, many tokensiseurs, including Gemma, Llama2 and Mistral families, include each byte as a token, but also affect a double token to many characters in the Normal ASCII range 0x00–0x7F. For example, A is both token 282 as an unused byte rescue token and as a 235280 token a text to '' in the Gemma models. These problems are not universal, and we also find models that precisely include the 243 bytes used in UTF-8
As tokens, including Eleutherai models [14]. Simple non-trained byte chips are generally classified as “partial UTF-8 sequences” or “inaccessible”, and our indicators are effective in revealing which are never seen or rarely seen in training. We publish specific tables that show the state of each token to an byte for each model analyzed in our repository.
3.2.2 Fragments of merged tokens
3.2.3 Special tokens
Many models include non -trained special tokens, such as
[2] See Annex B for a primer on the UTF-8 coding. [3] When you mention more complete fragments of tokens, tokens in parentheses have not been detected or verified as sub-shaped, unless explicitly indicated.