Commit graph

52 commits

Author SHA1 Message Date
Pedro Rodriguez
1e78a49bf0
update ()
Summary:

Test Plan:
2025-04-02 09:40:08 -07:00
Pedro Rodriguez
b79eb3ef11
Get generation working for BLT ()
Summary:

Create a script for simple generation from BLT

Test Plan:

```
python -m bytelatent.generate_blt config=../internal-blt/configs/eval_blt.yaml
```
2025-04-01 16:07:55 -07:00
Hanna
2dcf48bdd9
Fix in-place addition of patch_embds () 2025-03-20 16:46:32 -07:00
Srinivasan Iyer
fc946a1918
Some fixes for entropy model predictions ()
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-03-13 10:28:42 -07:00
Pedro Rodriguez
083656ce55
Update ppl evals to work with blt model, in addition to entropy model ()
Summary:

Test Plan:

Run
```
python -m bytelatent.eval config=../internal-blt/configs/eval_blt.yaml validation.max_n_docs=null
python -m bytelatent.eval config=../internal-blt/configs/eval_entropy.yaml validation.max_n_docs=null
```
2025-03-13 10:23:31 -07:00
Pedro Rodriguez
f84ee635bd
Update iterate_data ()
Summary:

Test Plan:
2025-03-13 10:14:41 -07:00
Srinivasan Iyer
c110f6be2a
Add way to call consolidate ()
* Add way to call consolidate

* black

* isort

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-03-11 16:53:33 -07:00
Srinivasan Iyer
a5ceaaa226
When merging configs, do not merge data sources ()
* When merging configs, do not merge data sources

* Add todo

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-03-11 11:03:24 -07:00
Pedro Rodriguez
7517ac2a9f
Get evals working again. ()
- PPL/validation: Works now and uses multi-gpu. For some reason 1 GPU differs from multi-GPU, can debug in a followup PR
- Generation evals likely work, but are very slow, so disabled for now


Test Plan:
```
torchrun --nproc-per-node 8 -m bytelatent.eval config=../internal-blt/configs/eval.yaml
```
2025-03-11 09:57:19 -07:00
Pedro Rodriguez
63913e4dba
Reduce per file resources arrow uses ()
Summary:

Test Plan:
2025-03-05 15:03:42 -08:00
Pedro Rodriguez
8f2cf8899d
Let process start before yielding preloaded prefetch buffer, avoid needlessly losing buffer in edge cases ()
Summary:

Test Plan:
2025-03-05 15:02:57 -08:00
Pedro Rodriguez
ea1fc75862
Add approximate state persistence ()
Summary:

Test Plan:

***
More verbose multiprocess logging, fix get_state_and_recycle

Summary:

Test Plan:
2025-03-05 15:01:45 -08:00
Pedro Rodriguez
9bd51df961
Fix rsync to not preserve original permissions, instead use destination ()
Summary:

Test Plan:
2025-03-05 11:49:41 -08:00
Pedro Rodriguez
c727844e9d
Correctly reset batch iterator at each arrow create_iter call. ()
Summary:

Test Plan:
2025-03-03 16:59:02 -08:00
Pedro Rodriguez
08b8c7cd05
Pass mask in packing_iterator, correctly handle last batch, fix masking ()
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-27 11:41:47 -08:00
Srinivasan Iyer
0da051f4f9
Initialize rope embeddings properly for the entropy model ()
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-25 15:35:25 -08:00
Pedro Rodriguez
aeb95f12a1
Remove byte tokenizer and add config args to switch between byte/patch packing ()
Summary:

Test Plan:

```
python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null checkpoint.dump.every=1000 checkpoint.eval.every=100000 eval=null

pytest bytelatent/
```
2025-02-25 11:10:59 -08:00
Pedro Rodriguez
ff36aa8642
Add vocab and seq len abstract fields () 2025-02-24 14:41:58 -08:00
Bocheng Li
a6ed14f689
Fix: Correct model_args usage in parallelize_model call () 2025-02-24 14:40:38 -08:00
Pedro Rodriguez
fc3399ef40
Update iterator inheritance, pass file format args, limit iterator ()
- Create a common class to use in all inheritance for states
- Add a limit iterator that we can use in evals
- Modify ArrowFileIterator behavior to not do arrow path inference if file_format='json'
- Make EvalArgs valid
- Move testing iterators to a common directory to allow usage in multiple test files
- Make it so that SequenceIterator can take a None rng_state, to disable all rng ops (for eval mainly)

Test Plan:

- `pytest bytelatent`
- `python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null eval=null`
2025-02-21 16:21:07 -08:00
Pedro Rodriguez
b0956bde99
Make apex logs less noisy ()
Summary:

Test Plan:
2025-02-18 10:45:56 -08:00
Pedro Rodriguez
82ab5930ec
Make it possible to specify multiple config files ()
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```


Summary:

Test Plan:
2025-02-18 10:42:44 -08:00
CharlesCNorton
9f29e0de18
fix(README): correct typo in quickstart instructions ()
Changed "your can activate the environment" to "you can activate the environment" for clarity.
2025-02-18 09:47:58 -08:00
Srinivasan Iyer
f3e8125f74
using apex rmsnorm ()
* using apex rmsnorm

* added message for missing apex

* black

* missed a print

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-14 11:22:03 -08:00
Srinivasan Iyer
c49e25171e
Update README.md () 2025-02-14 11:16:49 -08:00
Pedro Rodriguez
8c61ab5e67
Fix multiprocessing dataloader checkpointing and use it in the train script () 2025-02-13 11:58:23 -08:00
Pedro Rodriguez
85c2f28f26
Test first batch matches ()
Summary:

Test Plan:
2025-02-13 10:05:08 -08:00
Srinivasan Iyer
9d907fed1c
disable reshard after forward ()
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-12 18:33:53 -08:00
Srinivasan Iyer
48e4ad0bd2
make sure max_encoder_seq_length matches ()
* make sure max_encoder_seq_length matches

* black and assert comment

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-12 18:27:22 -08:00
Srinivasan Iyer
22c7fe1d1c
fix save and reload model state ()
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-07 14:27:47 -08:00
Pedro Rodriguez
fe45f69fbf
Add bpb and n_bytes to metric logging ()
Summary:

Test Plan:
2025-02-07 13:14:30 -08:00
Srinivasan Iyer
aebdc481a8
Fix init and repro ()
* Fix init and repro

* comment + black

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-06 14:18:02 -08:00
Pedro Rodriguez
936d9437be
Allow ArrowIterator to read from json ()
Summary:

Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so

Test Plan:

Run train script until dataloader starts
2025-02-06 09:57:22 -08:00
Pedro Rodriguez
afedb16598
Update checkpointing to use fsspec ()
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 09:41:58 -08:00
Srinivasan Iyer
739dc71a0a
Add rope fp32 ()
* Log model

* Add flag for rope outer in fp32

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:19:37 -08:00
Srinivasan Iyer
6fbaf7266f
fix stool ()
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:18:40 -08:00
Srinivasan Iyer
7cf8fab49b
Fix wandb logging ()
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 16:24:39 -08:00
Pedro Rodriguez
c79b1fdbd0
Fix distributed all reduce grad norm ()
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez
7044771a12
This includes fixes that make checkpointing and reloading work correctly. ()
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-27 16:56:42 -08:00
Pedro Rodriguez
7622d28b74
Initial codes and scripts for training entropy model ()
Summary:

Test Plan:
2025-01-27 09:46:44 -08:00
Pedro Rodriguez
a809259e71
Use load_async flag to not start MP iterator ()
Summary:

Test Plan:
2025-01-24 10:57:20 -08:00
Pedro Rodriguez
bc42cebd7d
Update file check script to check sizes ()
Summary:

Test Plan:
2025-01-22 13:06:46 -08:00
Ink
392117bff2
Fix realtime entropy patching ()
* allow loading of the entropy model directly

* remove unused argument

* remove spammy warning

* allow patch_batch_size to be adjusted in the forward() method

* revert to original patcher style, fix warning

* allow grads when calculating entropies

* fix grad flow

* return preds from calculate_entropies()

* remove legacy arg

* fix an error with monotonicity and small sequence lengths

* ensure patcher is serializable

* revert patcher to original

* remove unused import
2025-01-21 16:34:23 -08:00
Pedro Rodriguez
6ffeb66b53
Changes for training entropy model and correcting attention in local models ()
Summary:

- Refactor local model configs to be separate and clearer
- Add attention arguments and correct which attention is used in local models
- Preparation for being able to have an entropy train script
- Fix failing unit tests

Test Plan:
2025-01-17 14:23:01 -08:00
Ink
caec8d2621
allow flex-attention to be disabled ()
* allow flex-attention to silently fail

* allow flex-attn to be disabled via an env var
2025-01-14 09:32:07 -08:00
Pedro Rodriguez
1da3dd9315
Update preprocess_entropies script to blt inference + add fsspec support ()
Summary:

Test Plan:
2025-01-13 15:28:14 -08:00
Pedro Rodriguez
b0120da72f
Replace regular filesystem calls with fsspec + add s3 support ()
Summary:

For compatibility with either local/nfs or S3 datasets, swap to fsspec.

Add a tool to compare local and remote filesystems

Test Plan:

- Ran regular train script
- Ran with config with data in S3
2025-01-10 11:04:41 -08:00
Pedro Rodriguez
d4ddb95322
Add plotting code from paper ()
Summary:

Test Plan:
2025-01-09 12:11:50 -08:00
Ink
2fdc6f3cc9
Package bytelatent as a module ()
* make installable via pip

* fix missing xformers deps

* remove non-core dependencies

* fix linting

* fix isort
2025-01-06 16:44:50 -08:00
Ikko Eltociear Ashimine
9065bb1cce
docs: update README.md ()
folowing -> following
2025-01-03 12:08:00 -08:00