Commit graph

112 commits

Author SHA1 Message Date
Pedro Rodriguez 15d9c40abe
Merge 3e3193c1d4 into sapling-pr-archive-EntilZha 2025-02-12 10:24:54 -08:00
Pedro Rodriguez c0c5bdba91
Merge c54c9f0517 into sapling-pr-archive-EntilZha 2025-02-12 10:24:45 -08:00
Pedro Rodriguez 3e3193c1d4 Fix multiprocessing dataloader checkpointing and use it in the train script
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-12 18:24:40 +00:00
Pedro Rodriguez c54c9f0517 Test first batch matches
Summary:

Test Plan:
2025-02-12 18:24:39 +00:00
Pedro Rodriguez ec59c13d81
Merge bd3cf61bb9 into sapling-pr-archive-EntilZha 2025-02-12 10:11:44 -08:00
Pedro Rodriguez 9613e0ea5f merge commit for archive created by Sapling 2025-02-12 18:09:31 +00:00
Pedro Rodriguez bd3cf61bb9 Fix multiprocessing dataloader checkpointing and use it in the train script
Summary:

Test Plan:
2025-02-12 18:09:26 +00:00
Pedro Rodriguez 4cee32ea8c Test first batch matches
Summary:

Test Plan:
2025-02-12 18:09:26 +00:00
Pedro Rodriguez b61a612bbb
Merge 92af9b3f56 into sapling-pr-archive-EntilZha 2025-02-12 10:07:50 -08:00
Pedro Rodriguez 92af9b3f56 Test first batch matches
Summary:

Test Plan:
2025-02-12 18:07:22 +00:00
Pedro Rodriguez c6cbacc8c1 merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-11 22:56:32 +00:00
Pedro Rodriguez 38cc67a953 Fix multiprocessing dataloader checkpointing and use it in the train script
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-11 22:56:26 +00:00
Pedro Rodriguez c4b7a01b2b
Merge 5c8fb4f1b3 into sapling-pr-archive-EntilZha
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-02-07 15:27:12 -08:00
Pedro Rodriguez 5c8fb4f1b3 Fix multiprocessing dataloader checkpointing and use it in the train script
Summary:

Test Plan:
2025-02-07 23:27:05 +00:00
Srinivasan Iyer 22c7fe1d1c
fix save and reload model state (#49)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-07 14:27:47 -08:00
Pedro Rodriguez fe45f69fbf
Add bpb and n_bytes to metric logging (#41)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-07 13:14:30 -08:00
Pedro Rodriguez b35206d756 merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-07 21:13:42 +00:00
Pedro Rodriguez 8d7338308e Add bpb and n_bytes to metric logging
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-02-07 21:13:36 +00:00
Pedro Rodriguez f783846574 merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-07 00:26:06 +00:00
Pedro Rodriguez b6396eb0f4 Add bpb and n_bytes to metric logging
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-07 00:26:01 +00:00
Srinivasan Iyer aebdc481a8
Fix init and repro (#48)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Fix init and repro

* comment + black

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-06 14:18:02 -08:00
Pedro Rodriguez ab594996a9
Merge 4e2ed0aa05 into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-06 10:24:36 -08:00
Pedro Rodriguez 4e2ed0aa05 Add bpb and n_bytes to metric logging
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-06 18:08:01 +00:00
Pedro Rodriguez 936d9437be
Allow ArrowIterator to read from json (#45)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so

Test Plan:

Run train script until dataloader starts
2025-02-06 09:57:22 -08:00
Pedro Rodriguez 2950b63cf2
Merge 9c3c997cae into sapling-pr-archive-EntilZha 2025-02-06 09:44:41 -08:00
Pedro Rodriguez 9c3c997cae Allow ArrowIterator to read from json
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so

Test Plan:

Run train script until dataloader starts
2025-02-06 17:44:36 +00:00
Pedro Rodriguez fff80b86b5
Merge 0e9421af07 into sapling-pr-archive-EntilZha 2025-02-06 09:43:15 -08:00
Pedro Rodriguez 0e9421af07 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-06 17:43:11 +00:00
Pedro Rodriguez 18ae0ba444
Merge 8d26140970 into sapling-pr-archive-EntilZha 2025-02-06 09:42:51 -08:00
Pedro Rodriguez 8d26140970 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-06 17:42:42 +00:00
Pedro Rodriguez afedb16598
Update checkpointing to use fsspec (#39)
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 09:41:58 -08:00
Pedro Rodriguez d44902da97
Merge f058373889 into sapling-pr-archive-EntilZha 2025-02-06 09:37:27 -08:00
Pedro Rodriguez f058373889 Update checkpointing to use fsspec
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 17:37:21 +00:00
Pedro Rodriguez e13495c351
Merge 341264685a into sapling-pr-archive-EntilZha 2025-02-06 09:34:50 -08:00
Pedro Rodriguez 341264685a Update checkpointing to use fsspec
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 17:34:44 +00:00
Pedro Rodriguez 2d1c766050
Merge 45bfe94c1e into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-05 17:27:35 -08:00
Pedro Rodriguez 45bfe94c1e Broken train reproducing bf16 error
Summary:

Test Plan:
2025-02-06 01:27:23 +00:00
Srinivasan Iyer 739dc71a0a
Add rope fp32 (#43)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Log model

* Add flag for rope outer in fp32

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:19:37 -08:00
Srinivasan Iyer 6fbaf7266f
fix stool (#44)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:18:40 -08:00
Srinivasan Iyer 7cf8fab49b
Fix wandb logging (#42)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 16:24:39 -08:00
Pedro Rodriguez 8f1a9a858e Minimal working eval
Summary:

Test Plan:
2025-02-05 22:47:01 +00:00
Pedro Rodriguez 48cf4dfee1 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-05 22:47:01 +00:00
Pedro Rodriguez 1377fcb010
Merge 2f42633b07 into sapling-pr-archive-EntilZha 2025-02-05 14:32:03 -08:00
Pedro Rodriguez 2f42633b07 Add bpb and n_bytes to metric logging
Summary:

Test Plan:
2025-02-05 22:27:15 +00:00
Pedro Rodriguez b2f2a6a76e merge commit for archive created by Sapling 2025-02-05 22:10:37 +00:00
Pedro Rodriguez 1450464031 Update checkpointing to use fsspec
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 19:09:14 +00:00
Pedro Rodriguez c3d7f720f0
Merge b6e53f1d4c into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-04 16:55:37 -08:00
Pedro Rodriguez b6e53f1d4c Update checkpointing to use fsspec
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:55:32 +00:00
Pedro Rodriguez c79b1fdbd0
Fix distributed all reduce grad norm (#40)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez e1cd15ec30 merge commit for archive created by Sapling 2025-02-05 00:53:00 +00:00