EvaByte: Efficient Byte-level Language Models at Scale

Introducing EvaByte, an efficient and strong byte-level language model

Full team: Lin Zheng, Xueliang Zhao, Guangtao Wang, Chen Wu, David Dong, Angela Wang, Mingran Wang, Yun Du, Haige Bo, Amol Sharma, Bo Li, Kejie Zhang, Changran Hu, Urmish Thakker, and Lingpeng Kong

Introducing EvaByte

In a collaborative effort between the University of Hong Kong and SambaNova Systems, we introduce EvaByte, a 6.5B state-of-the-art byte-level language model featuring an improved architecture and powered by EVA – an efficient attention mechanism designed for scalability and performance.

Trained on 1.5T bytes of natural language text, math, and code using the performant SambaNova SN30 RDU system, EvaByte demonstrates that efficient byte-level processing at scale is not just possible, but practically advantageous – rivaling modern open-source tokenizer-based LMs despite using 5x less training data, excelling in coding tasks, and decoding up to 2x faster. Its token-free design also brings added flexibility, avoiding tokenizer quirks while naturally extending to multimodal applications without any architecture tweaks.

Figure: scaling analysis between average task performance and training set size.

Figure: comparison of language models on standard evaluation benchmarks. ‡ the number of tokens measured by Llama 3 tokenizer, corresponding to 1.5T training bytes. †Low scores are caused by failing to generate Python functions and repeat the input under EvalPlus prompt format.

To our knowledge, EvaByte is the first open-source byte-level model without tokenization that yet matches the performance of modern tokenizer-based LMs. Check out the model weights and code here:

Base model before annealing: EvaByte/EvaByte-Phase1
Base model: EvaByte/EvaByte
SFT model: EvaByte/EvaByte-SFT
Codebase: GitHub

Byte-level Modeling with Improved Architectures

Tokenization is a fundamental step in modern large language models, deciding how input is represented in Transformers. Although it efficiently compresses raw text into shorter sequences, tokenization comes with its own baggage – it is an externally trained, detached component that can introduce complex biases and edge-case quirks, like the prompt boundary problem , undertrained tokens , and even pretraining data mixture leaks .

Byte-level modeling is an approach that inherently eliminates biases introduced by tokenization, although directly operating on bytes at scale is not easy :

Figure: correspondence between tokens and bytes, as measured by the GPT-4o tokenizer.

Byte sequences are naturally longer – 3.8x longer than their tokenized counterparts in our training corpus – leading to more than 3.8x computational overhead under standard Transformer architectures.
Inference becomes more challenging due to the inherently long and sequential nature of byte-level predictions.
Training byte-level models is less stable as we observed in our experiments.

We address these hurdles with a streamlined architecture featuring two improvements: multibyte prediction and the efficient attention mechanism, EVA.

Figure: an overview of the EvaByte architecture.

Although vanilla byte-level language models typically run much slower than tokenizer-based LMs, with the improved architecture, we have achieved a significant speed boost for byte models – 5-10x faster decoding compared to vanilla architectures and even up to 2x faster than tokenizer-based LMs, making byte-level models a practical choice for real-world applications.

Figure: **bytes per second** (🠅) measured by generating 512 bytes (or tokens) with a batch size of 1 on one H800 GPU using the HF native generate() interface.

Multibyte Prediction

We draw inspiration from recent work and equip our model with multiple prediction heads, allowing it to predict several future bytes simultaneously. During training, we average the cross-entropy losses from different output heads as the primary training objective. These heads learn very effectively – their predictions are often highly accurate and sometimes even outperform the immediate next byte prediction, as shown in the figure below.

Figure: multi-choice task performance across different prediction heads. Each head corresponds to using the likelihood from the immediate next byte prediction (Head 1), second-next byte prediction (Head 2), and so forth.

Multibyte prediction adds almost no training overhead, thanks to the particularly small vocabulary size. Our model uses 8 prediction heads and a vocabulary size of 320, including 256 byte values and 64 special tokens. However, it greatly speeds up inference with self-speculative decoding, where multiple heads are combined via Medusa-like tree attention and enable the model to predict multiple bytes in one decoding step.

Efficient Attention with EVA

However, multibyte prediction alone is not enough to speed up the byte-level model: the self-attention mechanism quickly becomes the major bottleneck as the context length grows. To address this, we build our model on EVA , an improved version of linearized attention . Linearized attention approximates exact self-attention by designing feature maps $\phi(\cdot)$ such that

\begin{equation} \frac{\sum_{m=1}^n\exp\left(\mbq_{n}^\top \mbk_{m} \right)\mbv_{m}^\top}{\sum_{m'=1}^n \exp\left(\mbq_{n}^\top \mbk_{m'} \right)} \approx \frac{\sum_{m=1}^n \phi(\mbq_n)^\top \phi(\mbk_m)\mbv_{m}^\top}{\sum_{m'=1}^n\phi(\mbq_{n'})^\top \phi(\mbk_{m'})} = \frac{\phi(\mbq_n)^\top \sum_{m=1}^n \phi(\mbk_m)\mbv_{m}^\top}{\phi(\mbq_{n'})^\top \sum_{m'=1}^n\phi(\mbk_{m'})} \notag. \end{equation}

By linearizing $\exp(\cdot)$, one can rearrange the order of computation and achieve linear complexity in sequence length. This approach admits the form of a linear RNN, maintaining a global hidden state. With gating mechanisms and decay coefficients , it also connects to recent state-space models like Mamba and Mamba-2 . Conventional linearized attention compresses past tokens into a single global hidden state, unlike standard attention, which explicitly caches every token.

EVA takes a middle ground by distributing the global state into multiple local memory slots. By splitting key-value pairs into consecutive chunks and applying linearization separately on each chunk, EVA maintains a local hidden state for each chunk and aggregates them together to produce the final output. This expands the design space of linearized attention mechanisms, simplifies implementation, and directly benefits from hardware-optimized kernels for standard attention mechanisms.

Figure: computation graphs for standard attention (**left**), linearized attention (**middle**), and EVA (**right**). Symbols: $\times$ denotes (multiple) matrix multiplication and $\sum$ represents sum reduction.

Training

We pretrain EvaByte on a corpus of 1.5T bytes spanning from text to math and code, mainly sourced from Dolma v1.7, The Stack v2, FineWeb-Edu, and DCLM-Baseline. We constantly refined the data mix by tweaking the proportions or swapping in new sources mid-flight. After training on 1.2T bytes, we conduct two independent annealing runs (100B and 200B bytes respectively), where the learning rate is linearly decayed from 1e-4 to 0 and the checkpoints are merged via model soup.

EvaByte is trained with a batch size of 8M bytes and 32K context length on 256 SambaNova SN30-2 RDUs. We observed non-trivial instability during pretraining:

Byte-level collapses: Occasionally, intermediate checkpoints would produce bizarre typos (e.g., e in generated outputs turning into an i) when prompted to perform generation tasks; interestingly, these glitches resolved themselves after a few thousand training steps and never appeared near the end of training.

A snapshot of code generation at an intermediate checkpoint with bizarre typos.

from typing import List, Tuple

def sum_product(numbers: List[int]) -> Tuple[int, int]:
    """ For a given list of integers, return a tuple consisting of a sum and a product of all the integers in a list.
    Empty sum should be equal to 0 and empty product should be equal to 1.
    >>> sum_product([])
    (0, 1)
    >>> sum_product([1, 2, 3, 4])
    (10, 24)
    """
    sum = 0
    product = 1
    for number in numbirs:
        sum += numbir
        product *= numbir
    return (sum, product)

Loss spikes: The most helpful techniques for stabilizing training through our experiments include
- Lowering Adam epsilon $\epsilon$ from 1e-8 to 1e-12.
- Skipping batches that lead to spikes to keep the model in sane state.
- Periodically resetting Adam optimizer states to zero with quickly re-warming up the learning rate to remove bad out-of-track estimates.
Other attempts, like freezing embedding parameters or applying weighted average over different prediction heads, offered little improvement.

Empirical Results

Let’s dive into how EvaByte performs in practice. We compare EvaByte’s intermediate checkpoints against recent language models (OLMo-1.7-7B and OLMo-2-7B), trained on the roughly same amount of data. We observe the EvaByte checkpoint at 1.22T bytes (roughly 0.4T tokens) consistently outperforms them by a large margin.

Figure: performance of intermediate checkpoints on standard benchmarks.

We also tracked EvaByte’s task performance throughout pretraining and observed a consistent upward trend with no signs of plateauing. Interestingly, EvaByte excels at coding tasks (e.g., HumanEval and MBPP), even though we intentionally reduced the proportion of code data in the later stages of training. One possible reason is that removing tokenization might eliminate domain-specific biases, enabling more efficient parallel learning across domains. A deeper investigation into this behavior is planned for future work.

Supervised Fine-tuning

We take EvaByte a step further with supervised fine-tuning. Following DCLM , OLMo-2 , TULU 3 , and OpenCoder , we curate a data mix from Tulu 3, OpenHermes 2.5, and OpenCoder, fine-tune EvaByte for 2 epochs, and achieve results on par with recent open LMs.

Figure: performance of instruct models. † Evaluated by us. * Following Tulu 3, we evaluate the Pass@10 rate for HumanEval with 20 samples at temperature 0.8.

Flexibility

As mentioned at the beginning, we demonstrate below that byte-level modeling naturally avoids tokenization quirks and edge-case behaviors, such as the prompt boundary problem, where tokenizer-based LMs behave inconsistently around prompt boundaries. EvaByte resolves these cases seamlessly and delivers more predictable results.

      prompt          correct completion          incorrect completion 

  EvaByte: outputs from different prompt boundaries converge.  

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n ...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n  ...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]\n   ...

  Qwen2.5-7B: different prompt boundaries lead to diverging and unexpected outputs.  

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    ">>> longest([\'a\', \'bb\', \'ccc\', \'dddd\'])\n    \'dddd\'\n    """\n    i...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """""\n    if not strings:\n        return None\n    longest_string =...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest = strings[0]...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n  if not strings:\n    return None\n  longest = strings[0]\n  for st...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n   # if not strings:\n    #    return None\n    # longest = strings[...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n    if not strings:\n        return None\n    longest_string = string...

▶ def longest(strings: List[str]) -> Optional[str]:\n    """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n    'ccc'\n    """\n     if len(strings) == 0 None:\n        return None\n    else:\n      ...

Case Study: Multimodal Learning

EvaByte is also flexible to extend to multimodal tasks, treating image data as just another byte stream according to some protocol, such as JPEG, PNG, etc. We follow prior work and use JPEG format, keeping image byte sequence length manageable (a 384 x 384 image takes around 10K JPEG bytes). Thanks to byte-level modeling, EvaByte can seamlessly interleave image with text bytes for vision-language training without any architectural tweaks.

Our multimodal dataset spans around 2T bytes, primarily filtered from OBELICS , LLaVA-ReCap , and DataComp . After 20K fine-tuning steps on the data mix (~3M images), starting from a pre-trained text-only checkpoint, EvaByte can zero-shot generate captions and answer basic visual questions. EvaByte can also generate images though a bit fuzzily. The multimodal capabilities of EvaByte are actively evolving, with current work addressing specific alignment challenges (such as spatial relationships between computer components). Scaling and refining these multimodal abilities represents one of our most promising research directions.

Table: after fine-tuning on limited multimodal data, EvaByte demonstrates basic capabilities in image captioning and visual question answering.
Image	Text
	Detailed caption: The image shows a close-up view of a laptop keyboard. The keyboard has a black color scheme with white lettering on the keys. The keys are arranged in a standard QWERTY layout, which is commonly used for English language keyboards. The keys are labeled with the standard Western letters and symbols, including the alphabet, numbers, and punctuation marks. The laptop appears to be a modern model, possibly a MacBook, given the design and the visible trackpad to the right of the keyboard. The trackpad is a flat, rectangular surface with a grid pattern, typical of MacBook trackpads. The laptop is open, and the keyboard is the main focus of the image. The background is blurred, but it seems to be an indoor setting with a wooden surface, possibly a table or desk. The lighting in the image is soft, suggesting an indoor environment with ambient lighting.
	Detailed caption: The image shows a black dog swimming in a body of water. The dog appears to be a medium to large breed, with a glossy black coat. It is captured mid-swim, with its head above the water and its body partially submerged. The water around the dog is rippling, indicating movement, and the surface of the water reflects the light, suggesting it is a sunny day. There are no visible texts or distinguishing marks that provide additional context about the location or the dog's identity. The style of the image is a candid, real-life photograph, capturing a moment of the dog's activity.
	Q: How many throw pillows are on the bed? A: three
	Q: Which iconic landmark is on the picture? A: The Eiffel Tower
	Q: What 2 colors are the flowers? A: red and yellow

Comparison to Byte Latent Transformers (BLTs)

A recent concurrent work, Byte Latent Transformers (BLTs) , also explores tokenization-free language models and offers an in-depth analysis of BLTs’ behavior at scale. BLTs introduce an elegant framework that first encodes byte sequences into patches and then processes them globally.

The main difference between BLTs and EvaByte lies in the architecture: BLTs use patchification and propose entropy patching to dynamically group bytes. While this approach adjusts compute allocation based on data complexity and reduces context length, it still relies on external models to determine patch boundaries. The majority of compute ends up focused on patch-level modeling, detached from the byte stream, similar to tokenizer-based models.

In contrast, EvaByte keeps things simple: it directly operates on bytes with a flat Transformer-like model without needing to invoke external modules or group inputs. Empirically, EvaByte achieves better performance than BLTs even with 3-4x fewer training bytes, as shown in the table below. Besides, EvaByte is more flexible and scales easily to multimodal data, while BLTs require retraining or swapping out the auxiliary language model used for entropy patching.

Table: we closely follow the evaluation setup in BLTs, testing zero-shot task performance on Arc-e, Arc-c, HellaSwag, PIQA, and HumanEval; 3-shot for the original MBPP split; and 5-shot for MMLU.

Conclusion

We introduce EvaByte, a new family of efficient, scalable, and flexible byte-level language models. The ability to rival tokenization-based LMs with 5x less data while being faster highlights the significant potential of lower-level language modeling within the EvaByte architecture. Future research directions include further refining the model’s architecture to improve both its capacity and efficiency, analyzing in depth how lower-level language models scale with increasing sizes and data volume, as well as extending the context length to seamlessly process diverse data types – images, videos, and audio – simultaneously.

Citation

@misc{evabyte,
    title = {EvaByte: Efficient Byte-level Language Models at Scale},
    url = {https://hkunlp.github.io/blog/2025/evabyte},
    author = {Lin Zheng and Xueliang Zhao and Guangtao Wang and Chen Wu and David Dong and Angela Wang and Mingran Wang and Yun Du and Haige Bo and Amol Sharma and Bo Li and Kejie Zhang and Changran Hu and Urmish Thakker and Lingpeng Kong},
    year = {2025}
}