Commit graph

32 commits

Author SHA1 Message Date
Pedro Rodriguez 655eca670d Minimal working eval
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-02-14 23:45:29 +00:00
Pedro Rodriguez a3e0647d03 Make apex logs less noisy
Summary:

Test Plan:
2025-02-14 23:45:28 +00:00
Pedro Rodriguez f94babc94e Make it possible to specify multiple config files
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```


Summary:

Test Plan:
2025-02-14 22:50:57 +00:00
Srinivasan Iyer f3e8125f74
using apex rmsnorm (#57)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
* using apex rmsnorm

* added message for missing apex

* black

* missed a print

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-14 11:22:03 -08:00
Srinivasan Iyer c49e25171e
Update README.md (#58) 2025-02-14 11:16:49 -08:00
Pedro Rodriguez 8c61ab5e67
Fix multiprocessing dataloader checkpointing and use it in the train script (#50)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-13 11:58:23 -08:00
Pedro Rodriguez 85c2f28f26
Test first batch matches (#53)
Summary:

Test Plan:
2025-02-13 10:05:08 -08:00
Srinivasan Iyer 9d907fed1c
disable reshard after forward (#56)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-12 18:33:53 -08:00
Srinivasan Iyer 48e4ad0bd2
make sure max_encoder_seq_length matches (#55)
* make sure max_encoder_seq_length matches

* black and assert comment

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-12 18:27:22 -08:00
Srinivasan Iyer 22c7fe1d1c
fix save and reload model state (#49)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-07 14:27:47 -08:00
Pedro Rodriguez fe45f69fbf
Add bpb and n_bytes to metric logging (#41)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-07 13:14:30 -08:00
Srinivasan Iyer aebdc481a8
Fix init and repro (#48)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Fix init and repro

* comment + black

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-06 14:18:02 -08:00
Pedro Rodriguez 936d9437be
Allow ArrowIterator to read from json (#45)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so

Test Plan:

Run train script until dataloader starts
2025-02-06 09:57:22 -08:00
Pedro Rodriguez afedb16598
Update checkpointing to use fsspec (#39)
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 09:41:58 -08:00
Srinivasan Iyer 739dc71a0a
Add rope fp32 (#43)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Log model

* Add flag for rope outer in fp32

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:19:37 -08:00
Srinivasan Iyer 6fbaf7266f
fix stool (#44)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:18:40 -08:00
Srinivasan Iyer 7cf8fab49b
Fix wandb logging (#42)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 16:24:39 -08:00
Pedro Rodriguez c79b1fdbd0
Fix distributed all reduce grad norm (#40)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez 7044771a12
This includes fixes that make checkpointing and reloading work correctly. (#35)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-27 16:56:42 -08:00
Pedro Rodriguez 7622d28b74
Initial codes and scripts for training entropy model (#34)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-01-27 09:46:44 -08:00
Pedro Rodriguez a809259e71
Use load_async flag to not start MP iterator (#33)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-01-24 10:57:20 -08:00
Pedro Rodriguez bc42cebd7d
Update file check script to check sizes (#32)
Some checks failed
Lint with isort / lint (push) Has been cancelled
Lint with Black / lint (push) Has been cancelled
Summary:

Test Plan:
2025-01-22 13:06:46 -08:00
Ink 392117bff2
Fix realtime entropy patching (#26)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* allow loading of the entropy model directly

* remove unused argument

* remove spammy warning

* allow patch_batch_size to be adjusted in the forward() method

* revert to original patcher style, fix warning

* allow grads when calculating entropies

* fix grad flow

* return preds from calculate_entropies()

* remove legacy arg

* fix an error with monotonicity and small sequence lengths

* ensure patcher is serializable

* revert patcher to original

* remove unused import
2025-01-21 16:34:23 -08:00
Pedro Rodriguez 6ffeb66b53
Changes for training entropy model and correcting attention in local models (#25)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

- Refactor local model configs to be separate and clearer
- Add attention arguments and correct which attention is used in local models
- Preparation for being able to have an entropy train script
- Fix failing unit tests

Test Plan:
2025-01-17 14:23:01 -08:00
Ink caec8d2621
allow flex-attention to be disabled (#19)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
* allow flex-attention to silently fail

* allow flex-attn to be disabled via an env var
2025-01-14 09:32:07 -08:00
Pedro Rodriguez 1da3dd9315
Update preprocess_entropies script to blt inference + add fsspec support (#23)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-01-13 15:28:14 -08:00
Pedro Rodriguez b0120da72f
Replace regular filesystem calls with fsspec + add s3 support (#18)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

For compatibility with either local/nfs or S3 datasets, swap to fsspec.

Add a tool to compare local and remote filesystems

Test Plan:

- Ran regular train script
- Ran with config with data in S3
2025-01-10 11:04:41 -08:00
Pedro Rodriguez d4ddb95322
Add plotting code from paper (#17)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-01-09 12:11:50 -08:00
Ink 2fdc6f3cc9
Package bytelatent as a module (#7)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
* make installable via pip

* fix missing xformers deps

* remove non-core dependencies

* fix linting

* fix isort
2025-01-06 16:44:50 -08:00
Ikko Eltociear Ashimine 9065bb1cce
docs: update README.md (#1)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
folowing -> following
2025-01-03 12:08:00 -08:00
Daniele Sartiano 898671b66b
Update README.md (#13)
Fixed typo on Meta Lingua
2025-01-03 12:06:47 -08:00
Pedro Rodriguez bcc039bb75 Initial commit 2024-12-12 15:32:30 -08:00