Introducing EvaByte, an efficient and strong byte-level language model
Full team: Lin Zheng, Xueliang Zhao, Guangtao Wang, Chen Wu, David Dong, Angela Wang, Mingran Wang, Yun Du, Haige Bo, Amol Sharma, Bo Li, Kejie Zhang, Changran Hu, Urmish Thakker, and Lingpeng Kong
In a collaborative effort between the University of Hong Kong and SambaNova Systems, we introduce EvaByte, a 6.5B state-of-the-art byte-level language model featuring an improved architecture and powered by EVA – an efficient attention mechanism designed for scalability and performance.
Trained on 1.5T bytes of natural language text, math, and code using the performant SambaNova SN30 RDU system, EvaByte demonstrates that efficient byte-level processing at scale is not just possible, but practically advantageous – rivaling modern open-source tokenizer-based LMs
To our knowledge, EvaByte is the first open-source byte-level model without tokenization that yet matches the performance of modern tokenizer-based LMs. Check out the model weights and code here:
Tokenization is a fundamental step in modern large language models, deciding how input is represented in Transformers. Although it efficiently compresses raw text into shorter sequences, tokenization comes with its own baggage – it is an externally trained, detached component that can introduce complex biases and edge-case quirks, like the prompt boundary problem
Byte-level modeling is an approach that inherently eliminates biases introduced by tokenization, although directly operating on bytes at scale is not easy
We address these hurdles with a streamlined architecture featuring two improvements: multibyte prediction and the efficient attention mechanism, EVA.
Although vanilla byte-level language models typically run much slower than tokenizer-based LMs, with the improved architecture, we have achieved a significant speed boost for byte models – 5-10x faster decoding compared to vanilla architectures and even up to 2x faster than tokenizer-based LMs, making byte-level models a practical choice for real-world applications.
We draw inspiration from recent work
Multibyte prediction adds almost no training overhead, thanks to the particularly small vocabulary size.
However, multibyte prediction alone is not enough to speed up the byte-level model: the self-attention mechanism quickly becomes the major bottleneck as the context length grows. To address this, we build our model on EVA
By linearizing $\exp(\cdot)$, one can rearrange the order of computation and achieve linear complexity in sequence length. This approach admits the form of a linear RNN, maintaining a global hidden state. With gating mechanisms and decay coefficients
EVA takes a middle ground by distributing the global state into multiple local memory slots. By splitting key-value pairs into consecutive chunks and applying linearization separately on each chunk, EVA maintains a local hidden state for each chunk and aggregates them together to produce the final output. This expands the design space of linearized attention mechanisms, simplifies implementation, and directly benefits from hardware-optimized kernels for standard attention mechanisms.
We pretrain EvaByte on a corpus of 1.5T bytes spanning from text to math and code, mainly sourced from Dolma v1.7, The Stack v2, FineWeb-Edu, and DCLM-Baseline. We constantly refined the data mix by tweaking the proportions or swapping in new sources mid-flight. After training on 1.2T bytes, we conduct two independent annealing runs (100B and 200B bytes respectively), where the learning rate is linearly decayed from 1e-4 to 0 and the checkpoints are merged via model soup.
EvaByte is trained with a batch size of 8M bytes and 32K context length on 256 SambaNova SN30-2 RDUs. We observed non-trivial instability during pretraining:
e
in generated outputs turning into an i
) when prompted to perform generation tasks; interestingly, these glitches resolved themselves after a few thousand training steps and never appeared near the end of training.from typing import List, Tuple def sum_product(numbers: List[int]) -> Tuple[int, int]: """ For a given list of integers, return a tuple consisting of a sum and a product of all the integers in a list. Empty sum should be equal to 0 and empty product should be equal to 1. >>> sum_product([]) (0, 1) >>> sum_product([1, 2, 3, 4]) (10, 24) """ sum = 0 product = 1 for number in numbirs: sum += numbir product *= numbir return (sum, product)
Other attempts, like freezing embedding parameters or applying weighted average over different prediction heads, offered little improvement.
Let’s dive into how EvaByte performs in practice. We compare EvaByte’s intermediate checkpoints against recent language models (OLMo-1.7-7B and OLMo-2-7B), trained on the roughly same amount of data. We observe the EvaByte checkpoint at 1.22T bytes (roughly 0.4T tokens) consistently outperforms them by a large margin.
We also tracked EvaByte’s task performance throughout pretraining and observed a consistent upward trend with no signs of plateauing. Interestingly, EvaByte excels at coding tasks (e.g., HumanEval and MBPP), even though we intentionally reduced the proportion of code data in the later stages of training. One possible reason is that removing tokenization might eliminate domain-specific biases, enabling more efficient parallel learning across domains. A deeper investigation into this behavior is planned for future work.
We take EvaByte a step further with supervised fine-tuning. Following DCLM
As mentioned at the beginning, we demonstrate below that byte-level modeling naturally avoids tokenization quirks and edge-case behaviors, such as the prompt boundary problem, where tokenizer-based LMs behave inconsistently around prompt boundaries. EvaByte resolves these cases seamlessly and delivers more predictable results.
▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n ... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n ... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n ...
▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n ">>> longest([\'a\', \'bb\', \'ccc\', \'dddd\'])\n \'dddd\'\n """\n i... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """""\n if not strings:\n return None\n longest_string =... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """ ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest = strings[0]\n for st... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n # if not strings:\n # return None\n # longest = strings[... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if not strings:\n return None\n longest_string = string... ▶ def longest(strings: List[str]) -> Optional[str]:\n """ Out of list of strings, return the longest one. ...>>> longest(['a', 'bb', 'ccc'])\n 'ccc'\n """\n if len(strings) == 0 None:\n return None\n else:\n ...
EvaByte is also flexible to extend to multimodal tasks, treating image data as just another byte stream according to some protocol, such as JPEG, PNG, etc. We follow prior work
Our multimodal dataset spans around 2T bytes, primarily filtered from OBELICS
Image | Text |
---|---|
![]() | Detailed caption: The image shows a close-up view of a laptop keyboard. The keyboard has a black color scheme with white lettering on the keys. The keys are arranged in a standard QWERTY layout, which is commonly used for English language keyboards. The keys are labeled with the standard Western letters and symbols, including the alphabet, numbers, and punctuation marks. The laptop appears to be a modern model, possibly a MacBook, given the design and the visible trackpad to the right of the keyboard. The trackpad is a flat, rectangular surface with a grid pattern, typical of MacBook trackpads. The laptop is open, and the keyboard is the main focus of the image. The background is blurred, but it seems to be an indoor setting with a wooden surface, possibly a table or desk. The lighting in the image is soft, suggesting an indoor environment with ambient lighting. |
![]() | Detailed caption: The image shows a black dog swimming in a body of water. The dog appears to be a medium to large breed, with a glossy black coat. It is captured mid-swim, with its head above the water and its body partially submerged. The water around the dog is rippling, indicating movement, and the surface of the water reflects the light, suggesting it is a sunny day. There are no visible texts or distinguishing marks that provide additional context about the location or the dog's identity. The style of the image is a candid, real-life photograph, capturing a moment of the dog's activity. |
![]() | Q: How many throw pillows are on the bed? A: three |
![]() | Q: Which iconic landmark is on the picture? A: The Eiffel Tower |
![]() | Q: What 2 colors are the flowers? A: red and yellow |
A recent concurrent work, Byte Latent Transformers (BLTs)
The main difference between BLTs and EvaByte lies in the architecture: BLTs use patchification and propose entropy patching to dynamically group bytes. While this approach adjusts compute allocation based on data complexity and reduces context length, it still relies on external models to determine patch boundaries. The majority of compute ends up focused on patch-level modeling, detached from the byte stream, similar to tokenizer-based models.
In contrast, EvaByte keeps things simple: it directly operates on bytes with a flat Transformer-like model without needing to invoke external modules or group inputs. Empirically, EvaByte achieves better performance than BLTs even with 3-4x fewer training bytes, as shown in the table below. Besides, EvaByte is more flexible and scales easily to multimodal data, while BLTs require retraining or swapping out the auxiliary language model used for entropy patching.
We introduce EvaByte, a new family of efficient, scalable, and flexible byte-level language models. The ability to rival tokenization-based LMs with 5x less data while being faster highlights the significant potential of lower-level language modeling within the EvaByte architecture. Future research directions include further refining the model’s architecture to improve both its capacity and efficiency, analyzing in depth how lower-level language models scale with increasing sizes and data volume, as well as extending the context length to seamlessly process diverse data types – images, videos, and audio – simultaneously.
@misc{evabyte,
title = {EvaByte: Efficient Byte-level Language Models at Scale},
url = {https://hkunlp.github.io/blog/2025/evabyte},
author = {Lin Zheng and Xueliang Zhao and Guangtao Wang and Chen Wu and David Dong and Angela Wang and Mingran Wang and Yun Du and Haige Bo and Amol Sharma and Bo Li and Kejie Zhang and Changran Hu and Urmish Thakker and Lingpeng Kong},
year = {2025}
}