← Back to posts

The Weighted Perplexity Benchmark: Tokenizer-Normalized Evaluation for Language Model Comparison

Published: 7/7/2025

Abstract

Perplexity remains the primary intrinsic metric for evaluating language models, yet direct comparison across models with different tokenization schemes presents methodological challenges. We introduce a tokenizer-normalized perplexity metric that enables consistent comparison of language models regardless of their tokenization approaches. Through empirical analysis of 19 language models across five families on WikiText-2, we demonstrate how tokenization differences can affect traditional perplexity measurements by up to 21.6%. Our normalization method reveals the magnitude of tokenization effects on model comparisons and enables analysis of architectural efficiency patterns. We identify Llama Scout as performing substantially worse than models with far fewer active parameters. The Weighted Perplexity Benchmark provides researchers with a principled approach to evaluate language models while controlling for tokenization strategies.

Authors: Jessica Taylor, Vie McCoy

1. Introduction

As language model architectures, training approaches, and tokenization strategies diversify, comparing their fundamental text prediction capabilities has become increasingly difficult. Perplexity, defined as the exponentiated average negative log-likelihood of a sequence, remains the standard intrinsic evaluation metric. However, different tokenization schemes create systematic differences in perplexity comparisons that complicate cross-model evaluation.

This post addresses the need for tokenizer-independent evaluation metrics through the Weighted Perplexity Benchmark (WPB). Our contributions include:

  1. A mathematical framework for normalizing perplexity scores across different tokenization schemes
  2. Empirical evaluation of 19 language models demonstrating significant tokenization effects
  3. Analysis of how tokenization differences affect model comparisons
  4. Identification of architectural efficiency patterns when tokenization effects are controlled

2. Background and Related Work

2.1 Perplexity and Tokenization Dependencies

For a tokenized sequence X=(x1,x2,,xt)X = (x_1, x_2, \ldots, x_t) and model MM, perplexity is defined as:

PPLM(X)=exp(1ti=1tlnpM(xix<i))\text{PPL}_M(X) = \exp\left(-\frac{1}{t}\sum_{i=1}^{t} \ln p_M(x_i|x_{<i})\right)

The critical issue is that tt (the number of tokens) varies significantly across tokenizers for identical text. As noted by Mielke et al. (2019), this creates systematic differences in perplexity scores that complicate fair comparison.

2.2 Prior Normalization Approaches

Several approaches have been proposed to address tokenization differences in language model evaluation:

Bits-per-character (BPC): Widely used in character-level language modeling, BPC measures the average number of bits needed to encode each character (Graves, 2013). As shown by Bauwens (2024), BPC can be derived from token-level perplexity through normalization by character count.

Per-byte perplexity: The llama.cpp project has explored normalizing perplexity by the number of bytes in the original text rather than tokens, enabling comparison across models with different vocabulary sizes.

Character-level evaluation: Direct evaluation at the character level avoids tokenization differences entirely but requires models specifically designed for character-level prediction (Mielke et al., 2019).

Marginal tokenization: Cao & Rimell (2021) propose marginalizing over possible tokenizations, though this requires significant computational overhead, and implementation is more complex.

Some of these approaches would require changes to LLM inference toolchains. Our work provides a simple normalization that can be applied to any token-level language model. In particular, we used llama.cpp for perplexity evaluation, and then adjusted from there.

3. Methodology

3.1 Tokenizer-Normalized Perplexity

We introduce a normalization that adjusts perplexity scores to a common tokenization baseline. Given two models AA and BB with tokenizers producing nAn_A and nBn_B tokens respectively for the same text:

The total negative log-likelihood for model BB is:

Total NLLB=nB×ln(PPLB)\text{Total NLL}_B = n_B \times \ln(\text{PPL}_B)

The normalized perplexity of model BB relative to model AA's tokenization is:

Normalized PPLB=exp(Total NLLBnA)=PPLB(nB/nA)\text{Normalized PPL}_B = \exp\left(\frac{\text{Total NLL}_B}{n_A}\right) = \text{PPL}_B^{(n_B/n_A)}

This adjustment redistributes the total prediction loss across the reference number of tokens. When nB>nAn_B > n_A, model BB makes more predictions for the same text, and its perplexity is adjusted upward. When nB<nAn_B < n_A, the adjustment is downward.

As a quick check, consider a hypothetical LLM that predicted text equally well to Llama 70B, but whose tokenizer produced twice as many tokens. We would expect its total NNL to equal Llama 70B’s total NNL. Let model A be Llama 70B, and model B be this different model. We can write:

nA×ln(PPLA)=Total NLLA=Total NLLB=2nA×ln(PPLB)n_A \times \ln(\text{PPL}_A) = \text{Total NLL}_A = \text{Total NLL}_B = 2 n_A \times \ln(\text{PPL}_B)

ln(PPLA)=2ln(PPLB)\ln(\text{PPL}_A) = 2 \ln(\text{PPL}_B)

PPLA=PPLB2 \text{PPL}_A = \text{PPL}_B^2

Normalized PPLB=PPLB2=PPLA=Normalized PPLA \text{Normalized PPL}_B = \text{PPL}_B^2 = \text{PPL}_A = \text{Normalized PPL}_A

This quick check confirms that our formula for normalized PPL evaluates the two models equally, despite tokenization differences.

Method assumptions: This approach assumes that the total information content of the text remains constant across tokenizations, redistributing the prediction difficulty across fewer or more tokens. While tokenization schemes may create systematically easier or harder prediction tasks, this provides a principled baseline for comparison.

3.2 Evaluation Protocol

We tested a number of base models on WikiText-2, using llama.cpp and FP8 quantization, on a Mac Studio (512GB unified memory). We got gguf quantized models from mradermacher on Hugging Face, and/or the GGUF My Repo llama.cpp based tool, also on Hugging Face.

We evaluated 18 models across five families on WikiText-2: Llama (5 models), Gemma (4 models), Qwen (6 models), Mixtral (2 models), and DeepSeek (1 model). Models range from 0.5B to 236B parameters and include both dense and Mixture-of-Experts architectures. We use Llama 3's tokenizer as the reference baseline, chosen for its temporal priority and widespread adoption.

4. Results

4.1 Empirical Tokenization Differences

Our analysis of WikiText-2 across major model families reveals substantial tokenization variations:

FamilyToken Count% Difference from Llama
Llama 3288,7680% (reference)
Llama 4288,252-0.2%
Gemma294,912+2.1%
DeepSeek305,152+5.7%
Qwen299,008+3.5%
Mixtral328,704+13.8%

These differences directly impact perplexity calculations, making cross-family comparisons challenging to interpret without normalization.

4.2 Tokenization Impact on Model Comparisons

This table shows original versus normalized perplexity scores:

ModelPPLTokensNormalized PPLChange
Llama 3.2 1B10.195288,76810.1950
Llama 3.2 3B8.082288,7688.0820
Llama 3.1 8B6.404288,7686.4040
Llama 3.1 70B2.824288,7682.8240
Llama 4 Scout8.840288,2528.805–0.39%
Gemma 3 1B10.801294,91211.362+5.19%
Gemma 3 4B7.438294,9127.762+4.36%
Gemma 3 12B5.776294,9125.996+3.80%
Gemma 3 27B4.740294,9124.899+3.37%
Qwen 2.5 0.5B13.908299,00815.269+9.78%
Qwen 2.5 1.5B9.802299,00810.628+8.43%
Qwen 2.5 3B8.424299,0089.085+7.85%
Qwen 3 4B8.151299,0088.780+7.72%
Qwen 3 8B7.224299,0087.749+7.26%
Qwen 3 30B-A3B6.256299,0086.676+6.72%
Mixtral 8×7B4.104328,7044.989+21.56%
Mixtral 8×22B2.973328,7043.457+16.26%
DeepSeek V23.980305,1524.304+8.15%

Key findings:

  1. Mixtral models experience the largest adjustment (up to 21.6% increase in perplexity), corresponding to their 13.8% higher token count.
  2. Magnitude of tokenization effects: Changes range from 0% (reference family) to over 20%.
  3. Systematic patterns: Models using tokenizers that segment text more finely have lower raw perplexity scores, as expected from the mathematical relationship between token count and per-token prediction difficulty.

4.3 Architectural Analysis

The normalization enables analysis of architectural efficiency patterns:

Dense Model Scaling: Among dense models, we expect a relationship between parameter count and normalized perplexity. This plot shows parameters versus normalized PPL, for the 13 dense models:

Normalized PPL scatter plot

To fit a curve, we take the natural log of both parameter count and normalized PPL, and then apply linear regression:

Normalized PPL line fit in log space

Our approximation is

ln(Normalized PPL)0.296ln(Params1B)+2.48\ln\left(\text{Normalized PPL}\right) \approx -0.296 \cdot \ln\left(\frac{\text{Params}}{1B}\right) + 2.48

We have R2=0.943R^2 = 0.943, showing a reasonably strong fit. To put this in perspective, a doubling of parameter count reduces normalized PPL by about 18.5%, and a 10x in parameter count reduces normalized PPL by about 49.4%. Or to be more succinct: 10x in parameters means approximately a halving of normalized perplexity.

The limited number of models and variation across model families constrains statistical confidence in these relationships.

Mixture-of-Experts Performance: Mixtral models maintain competitive performance after normalization. However, architectural conclusions should be drawn cautiously given the limited sample size.

4.4 Llama Scout: An Architectural Outlier

Llama Scout emerges as a striking outlier in our analysis. Despite approximately 17B active parameters per forward pass (from ~109B total parameters), it achieves normalized perplexity of 8.81—worse than Llama 3.2 3B (8.08) which has 5.7× fewer active parameters. This suggests significant inefficiencies in this particular Mixture-of-Experts implementation, though generalizations about MoE architectures should be made cautiously based on a single example.

To check if quantization was affecting performance, we tested a FP16 version of Llama Scout, yielding a (un-normalized) PPL of 8.8486, hardly different from the PPL of the FP8 version, 8.8396. Hence, FP8 quantization does not explain Llama Scout’s poor performance.

4.5 Implications for Model Evaluation

Our results demonstrate that tokenization differences substantially affect model comparisons:

  1. Magnitude of effects: Up to 21.6% difference in apparent performance associated with tokenization differences
  2. Evaluation consistency: Normalization provides more consistent comparison baselines across model families
  3. Architectural insights: Controlling for tokenization enables cleaner analysis of design trade-offs

5. Discussion

5.1 Methodological Implications

The Weighted Perplexity Benchmark addresses tokenization inconsistencies in current evaluation practice. The framework enables:

  1. Consistent cross-family comparison of language models
  2. Controlled assessment of architectural innovations
  3. Cleaner analysis of scaling relationships

5.2 Relationship to Prior Work

Our approach builds on established normalization concepts from BPC and character-level evaluation but applies them in a computationally simple way to existing token-level models. This approach works easily with existing toolchains such as llama.cpp.

5.3 Limitations and Future Work

Several limitations constrain our analysis:

  1. Single dataset: Results are demonstrated only on WikiText-2; generalization across domains and languages requires validation. Particular benchmarks, especially ones with public data, can be gamed over time.
  2. Statistical validation: A limited set of tested models limits statistical confidence.
  3. Limited architectural diversity: Conclusions about MoE efficiency rest on limited examples.

Future work should extend validation to multiple datasets and languages, and test a broader set of models. Additionally, the method should be validated by independent implementations. The relevance of perplexity can also be assessed generally by finding its relationship with other benchmarks.

6. Conclusion

We have introduced the Weighted Perplexity Benchmark, a tokenizer-normalized evaluation framework that enables more consistent comparison of language models across different tokenization schemes. Our analysis of 19 models reveals substantial effects of tokenization differences on perplexity evaluations, with changes of up to 21.6%.

The framework provides researchers with a principled approach to language model evaluation that controls for tokenization differences, while working with existing toolchains such as llama.cpp. This enables more consistent assessment of architectural innovations and cleaner analysis of scaling behaviors. The method is simple to implement and can be applied to any tokenized evaluation dataset.

Our findings highlight the importance of methodological rigor in language model evaluation. While this approach addresses tokenization inconsistencies, broader validation across datasets, reference choices, and evaluation metrics will strengthen its applicability to the field.

References

Bauwens, T. (2024). Bits-per-character and its relation to perplexity. Personal Blog. https://bauwenst.github.io/posts/explainers/2024-07-29-Bits-per-character/

Cao, S., & Rimell, L. (2021). You should evaluate your language model on marginal likelihood over tokenisations. ACL.

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850.

Mielke, S. J., Cotterell, R., Gorman, K., Roark, B., & Eisner, J. (2019). What kind of language is hard to language-model? ACL.