blt/README.md

# Byte Latent Transformer

This repository contains code for our paper: "Byte Latent Transformer: Patches Scale Better Than Tokens"

- [Paper Link](https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf)

## Abstract

We introduce the Byte Latent Transformer architecture (BLTs), a new byte-level LLM architecture that
for the first time, matches tokenization-based LLM performance at scale, with significant improvements
in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve
as the primary units of computation. Patches are segmented dynamically based on the entropy of the
next byte, allocating more compute and model capacity where there is more data complexity. The BLT
architecture includes new attention mechanisms to maximize the information flow between byte and
patch hidden representations and a new type of byte-sequence memory. We present the first scaling
study of byte-level models up to 8B parameters and 8T training bytes, showing for the first time
that we can train a model end-to-end at scale from bytes with no tokenization or other preprocessing.
Scaling trends reveal training and inference efficiency benefits from dynamically selecting very long
patches on average, along with qualitative improvements with reasoning and long tail generalization
from modeling byte-sequences.

![BLT Architecture Diagram](blt-figure.jpg)

## Development Status

We are actively updating the blt code to make it easier to reproduce our results.
Please file an issue and/or be patient while we make more of our code public!

## Quick start

The following commands launch a SLURM job that creates an environment for Meta Lingua.
The env creation should take around 5 minutes without counting downloads.

```bash
git clone https://github.com/facebookresearch/blt
cd blt

bash setup/create_env.sh
# or if you have access to a SLURM cluster
sbatch setup/create_env.sh
```

Once that is done your can activate the environment

```bash
conda activate blt_<date>
```

use the provided script to download and prepare data from huggingface (among `fineweb_edu`, `fineweb_edu_10bt`, or `dclm_baseline_1.0`).
This command will download the `fineweb_edu` and prepare it for training in the `./data` directory, specifying the amount of memory `terashuf` (the tool used to shuffle samples) will be allocated. By default, the number of chunks (`nchunks`) is 32. If you are running on fewer than 32 GPUs, it is recommended to set `nchunks` to 1 or to match `nchunks` with the number of GPUs (`nchunks` = NGPUs). See [here](https://github.com/facebookresearch/lingua/issues/55#issuecomment-2483643076) for more details.

```bash
python setup/download_prepare_hf_data.py fineweb_edu <MEMORY> --data_dir ./data --seed 42 --nchunks <NCHUNKS>
```

to download tokenizer (here llama3), use the following script:

```bash
python setup/download_tokenizer.py llama3 <SAVE_PATH> --api_key <HUGGINGFACE_TOKEN>
```

Now launch a debug job to check if everything works. **The provided configurations are templates, you need to adapt them for them to work (change `dump_dir`, `data.root_dir`, `data.tokenizer.path`, etc ...)**

```bash
# stool stands for SLURM tool !
python -m bytelatent.stool script=bytelatent.train config=bytelatent/configs/debug.yaml nodes=1 partition=<partition>
# if you want to launch locally you can use torchrun
torchrun --nproc-per-node 8 -m bytelatent.train config=bytelatent/configs/debug.yaml
# or you can also launch on 1 GPU
python -m bytelatent.train  config=bytelatent/configs/debug.yaml
```

When using `stool`, if a job crashes, it can be relaunched using sbatch:

```bash
sbatch path/to/dump_dir/submit.slurm
```

## Linting

To lint, run the following command

```
bash dev/lint.sh
```

## Citation

The BLT is partially based on Meta Lingua, so consider citing it in addition to our BLT paper if you re-use our work.

BLT Paper Citation

```
@article{pagnoni2024byte,
  title={Byte latent transformer: Patches scale better than tokens},
  author={Pagnoni, Artidoro and Pasunuru, Ram and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason and Zettlemoyer, Luke and others},
  journal={arXiv preprint arXiv:2412.09871},
  year={2024}
}
```

Lingua Code

```
@misc{meta_lingua,
  author = {Mathurin Videau, Badr Youbi Idrissi, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, David Lopez-Paz},
  title = {{Meta Lingua}: A minimal {PyTorch LLM} training library},
  url = {https://github.com/facebookresearch/lingua},
  year = {2024}
}
```

## License

The BLT code is partially based on Meta Lingua.

Meta Lingua is licensed under BSD-3-Clause license. Refer to the LICENSE file in the top level directory.
Initial commit 2024-12-12 23:32:30 +00:00			`# Byte Latent Transformer`

			`This repository contains code for our paper: "Byte Latent Transformer: Patches Scale Better Than Tokens"`

			`- [Paper Link](https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf)`

			`## Abstract`

			`We introduce the Byte Latent Transformer architecture (BLTs), a new byte-level LLM architecture that`
			`for the first time, matches tokenization-based LLM performance at scale, with significant improvements`
			`in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve`
			`as the primary units of computation. Patches are segmented dynamically based on the entropy of the`
			`next byte, allocating more compute and model capacity where there is more data complexity. The BLT`
			`architecture includes new attention mechanisms to maximize the information flow between byte and`
			`patch hidden representations and a new type of byte-sequence memory. We present the first scaling`
			`study of byte-level models up to 8B parameters and 8T training bytes, showing for the first time`
			`that we can train a model end-to-end at scale from bytes with no tokenization or other preprocessing.`
			`Scaling trends reveal training and inference efficiency benefits from dynamically selecting very long`
			`patches on average, along with qualitative improvements with reasoning and long tail generalization`
			`from modeling byte-sequences.`

			`![BLT Architecture Diagram](blt-figure.jpg)`

			`## Development Status`

			`We are actively updating the blt code to make it easier to reproduce our results.`
			`Please file an issue and/or be patient while we make more of our code public!`

			`## Quick start`

			`The following commands launch a SLURM job that creates an environment for Meta Lingua.`
			`The env creation should take around 5 minutes without counting downloads.`

			```bash
			`git clone https://github.com/facebookresearch/blt`
			`cd blt`

			`bash setup/create_env.sh`
			`# or if you have access to a SLURM cluster`
			`sbatch setup/create_env.sh`
			```

			`Once that is done your can activate the environment`

			```bash
			`conda activate blt_<date>`
			```

			use the provided script to download and prepare data from huggingface (among `fineweb_edu`, `fineweb_edu_10bt`, or `dclm_baseline_1.0`).
			This command will download the `fineweb_edu` and prepare it for training in the `./data` directory, specifying the amount of memory `terashuf` (the tool used to shuffle samples) will be allocated. By default, the number of chunks (`nchunks`) is 32. If you are running on fewer than 32 GPUs, it is recommended to set `nchunks` to 1 or to match `nchunks` with the number of GPUs (`nchunks` = NGPUs). See [here](https://github.com/facebookresearch/lingua/issues/55#issuecomment-2483643076) for more details.

			```bash
			`python setup/download_prepare_hf_data.py fineweb_edu <MEMORY> --data_dir ./data --seed 42 --nchunks <NCHUNKS>`
			```

docs: update README.md (#1) folowing -> following 2025-01-03 20:08:00 +00:00			`to download tokenizer (here llama3), use the following script:`
Initial commit 2024-12-12 23:32:30 +00:00
			```bash
			`python setup/download_tokenizer.py llama3 <SAVE_PATH> --api_key <HUGGINGFACE_TOKEN>`
			```

			Now launch a debug job to check if everything works. The provided configurations are templates, you need to adapt them for them to work (change `dump_dir`, `data.root_dir`, `data.tokenizer.path`, etc ...)

			```bash
			`# stool stands for SLURM tool !`
Update README.md (#58) 2025-02-14 19:16:49 +00:00			`python -m bytelatent.stool script=bytelatent.train config=bytelatent/configs/debug.yaml nodes=1 partition=<partition>`
Initial commit 2024-12-12 23:32:30 +00:00			`# if you want to launch locally you can use torchrun`
Update README.md (#58) 2025-02-14 19:16:49 +00:00			`torchrun --nproc-per-node 8 -m bytelatent.train config=bytelatent/configs/debug.yaml`
Initial commit 2024-12-12 23:32:30 +00:00			`# or you can also launch on 1 GPU`
Update README.md (#58) 2025-02-14 19:16:49 +00:00			`python -m bytelatent.train config=bytelatent/configs/debug.yaml`
Initial commit 2024-12-12 23:32:30 +00:00			```

			When using `stool`, if a job crashes, it can be relaunched using sbatch:

			```bash
			`sbatch path/to/dump_dir/submit.slurm`
			```

			`## Linting`

			`To lint, run the following command`

			```
			`bash dev/lint.sh`
			```

			`## Citation`

			`The BLT is partially based on Meta Lingua, so consider citing it in addition to our BLT paper if you re-use our work.`

Update README.md with arxiv citation 2025-02-15 19:50:41 +00:00			`BLT Paper Citation`
Initial commit 2024-12-12 23:32:30 +00:00
			```
Update README.md with arxiv citation 2025-02-15 19:50:41 +00:00			`@article{pagnoni2024byte,`
			`title={Byte latent transformer: Patches scale better than tokens},`
			`author={Pagnoni, Artidoro and Pasunuru, Ram and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason and Zettlemoyer, Luke and others},`
			`journal={arXiv preprint arXiv:2412.09871},`
			`year={2024}`
Initial commit 2024-12-12 23:32:30 +00:00			`}`
			```

			`Lingua Code`

			```
			`@misc{meta_lingua,`
			`author = {Mathurin Videau, Badr Youbi Idrissi, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, David Lopez-Paz},`
			`title = {{Meta Lingua}: A minimal {PyTorch LLM} training library},`
			`url = {https://github.com/facebookresearch/lingua},`
			`year = {2024}`
			`}`
			```

			`## License`

Update README.md (#13) Fixed typo on Meta Lingua 2025-01-03 20:06:47 +00:00			`The BLT code is partially based on Meta Lingua.`
Initial commit 2024-12-12 23:32:30 +00:00
			`Meta Lingua is licensed under BSD-3-Clause license. Refer to the LICENSE file in the top level directory.`