Crypto

Where glitch tokens hide: common patterns in tokenzer vocabulary

Admin3 weeks ago

0 2 2 minutes read

Where glitch tokens hide: common patterns in tokenzer vocabulary

Ties

Summary and 1. Introduction

Methods

2.1 Start analysis

2.2 Indicators to detect under trained tokens and 2.3 Verification of candidate tokens
Results

3.1 Indicators efficiency and verification

3.2 Current observations

3.3 Observations specific to the model
Closed source models
Discussion, thanks and references

A. Verification details

B. A short start on UTF-8 encoding

C. Outputs for API -based verification

3.2 Current observations

Although many of our results depend on details specific to the model such as the training and configuration of Tokenizer, the architecture of the model and the training data, there are a certain number of common points which appear in many families of different models.

3.2.1 Entry to an byte

The tokens representing a unique byte are a common source of non -trained tokens. The most common events are the “rescue” bytes 0xf5–0xff which are not used in the encoded text UTF-8[2]And are a practical source to quickly locate non -driven reference tokens for the indicators that require them. In addition, many tokensiseurs, including Gemma, Llama2 and Mistral families, include each byte as a token, but also affect a double token to many characters in the Normal ASCII range 0x00–0x7F. For example, A is both token 282 as an unused byte rescue token and as a 235280 token a text to '' in the Gemma models. These problems are not universal, and we also find models that precisely include the 243 bytes used in UTF-8

Figure 2: Typical token indicators compared to training data. The indicators (one) based on incorporation for the OLMO V1.7 7B model and the number of times each token appears in the first era of training data.

As tokens, including Eleutherai models [14]. Simple non-trained byte chips are generally classified as “partial UTF-8 sequences” or “inaccessible”, and our indicators are effective in revealing which are never seen or rarely seen in training. We publish specific tables that show the state of each token to an byte for each model analyzed in our repository.

3.2.2 Fragments of merged tokens

3.2.3 Special tokens

Many models include non -trained special tokens, such as ,, Or <| inutilisé_123 |>. In the following discussion, we generally omit mention them, unless their status of token (not) trained is particularly surprising, because their inclusion in the tokenzer and the training data is generally deliberate, for purposes such as the ability to refine models without modifying tokenseurs. Current observation is that in several times, tokens such as That we expect to be completely unspecified, nevertheless seem to have been seen in the training. A probable source for this is code standard or guides on language models using these tokens in the normal text, as well as tokenseurs allowing such special control tokens in the normal input text.

[2] See Annex B for a primer on the UTF-8 coding.

[3] When you mention more complete fragments of tokens, tokens in parentheses have not been detected or verified as sub-shaped, unless explicitly indicated.

Admin3 weeks ago

0 2 2 minutes read