Commit graph

15 commits

Author SHA1 Message Date
Pedro Rodriguez
7517ac2a9f
Get evals working again. (#46)
- PPL/validation: Works now and uses multi-gpu. For some reason 1 GPU differs from multi-GPU, can debug in a followup PR
- Generation evals likely work, but are very slow, so disabled for now


Test Plan:
```
torchrun --nproc-per-node 8 -m bytelatent.eval config=../internal-blt/configs/eval.yaml
```
2025-03-11 09:57:19 -07:00
Pedro Rodriguez
63913e4dba
Reduce per file resources arrow uses (#77)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-03-05 15:03:42 -08:00
Pedro Rodriguez
ea1fc75862
Add approximate state persistence (#73)
Summary:

Test Plan:

***
More verbose multiprocess logging, fix get_state_and_recycle

Summary:

Test Plan:
2025-03-05 15:01:45 -08:00
Bocheng Li
a6ed14f689
Fix: Correct model_args usage in parallelize_model call (#69) 2025-02-24 14:40:38 -08:00
Pedro Rodriguez
82ab5930ec
Make it possible to specify multiple config files (#54)
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```


Summary:

Test Plan:
2025-02-18 10:42:44 -08:00
Pedro Rodriguez
8c61ab5e67
Fix multiprocessing dataloader checkpointing and use it in the train script (#50)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-13 11:58:23 -08:00
Srinivasan Iyer
48e4ad0bd2
make sure max_encoder_seq_length matches (#55)
* make sure max_encoder_seq_length matches

* black and assert comment

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-12 18:27:22 -08:00
Pedro Rodriguez
fe45f69fbf
Add bpb and n_bytes to metric logging (#41)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-07 13:14:30 -08:00
Pedro Rodriguez
afedb16598
Update checkpointing to use fsspec (#39)
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 09:41:58 -08:00
Srinivasan Iyer
739dc71a0a
Add rope fp32 (#43)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Log model

* Add flag for rope outer in fp32

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:19:37 -08:00
Pedro Rodriguez
c79b1fdbd0
Fix distributed all reduce grad norm (#40)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez
7044771a12
This includes fixes that make checkpointing and reloading work correctly. (#35)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-27 16:56:42 -08:00
Pedro Rodriguez
7622d28b74
Initial codes and scripts for training entropy model (#34)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-01-27 09:46:44 -08:00
Pedro Rodriguez
6ffeb66b53
Changes for training entropy model and correcting attention in local models (#25)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

- Refactor local model configs to be separate and clearer
- Add attention arguments and correct which attention is used in local models
- Preparation for being able to have an entropy train script
- Fix failing unit tests

Test Plan:
2025-01-17 14:23:01 -08:00
Pedro Rodriguez
bcc039bb75 Initial commit 2024-12-12 15:32:30 -08:00