Commit graph

144 commits

Author SHA1 Message Date
Pedro Rodriguez f783846574 merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-07 00:26:06 +00:00
Pedro Rodriguez b6396eb0f4 Add bpb and n_bytes to metric logging
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-07 00:26:01 +00:00
Srinivasan Iyer aebdc481a8
Fix init and repro (#48)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Fix init and repro

* comment + black

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-06 14:18:02 -08:00
Pedro Rodriguez ab594996a9
Merge 4e2ed0aa05 into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-06 10:24:36 -08:00
Pedro Rodriguez 4e2ed0aa05 Add bpb and n_bytes to metric logging
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-02-06 18:08:01 +00:00
Pedro Rodriguez 936d9437be
Allow ArrowIterator to read from json (#45)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so

Test Plan:

Run train script until dataloader starts
2025-02-06 09:57:22 -08:00
Pedro Rodriguez 2950b63cf2
Merge 9c3c997cae into sapling-pr-archive-EntilZha 2025-02-06 09:44:41 -08:00
Pedro Rodriguez 9c3c997cae Allow ArrowIterator to read from json
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so

Test Plan:

Run train script until dataloader starts
2025-02-06 17:44:36 +00:00
Pedro Rodriguez fff80b86b5
Merge 0e9421af07 into sapling-pr-archive-EntilZha 2025-02-06 09:43:15 -08:00
Pedro Rodriguez 0e9421af07 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-06 17:43:11 +00:00
Pedro Rodriguez 18ae0ba444
Merge 8d26140970 into sapling-pr-archive-EntilZha 2025-02-06 09:42:51 -08:00
Pedro Rodriguez 8d26140970 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-06 17:42:42 +00:00
Pedro Rodriguez afedb16598
Update checkpointing to use fsspec (#39)
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 09:41:58 -08:00
Pedro Rodriguez d44902da97
Merge f058373889 into sapling-pr-archive-EntilZha 2025-02-06 09:37:27 -08:00
Pedro Rodriguez f058373889 Update checkpointing to use fsspec
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 17:37:21 +00:00
Pedro Rodriguez e13495c351
Merge 341264685a into sapling-pr-archive-EntilZha 2025-02-06 09:34:50 -08:00
Pedro Rodriguez 341264685a Update checkpointing to use fsspec
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 17:34:44 +00:00
Pedro Rodriguez 2d1c766050
Merge 45bfe94c1e into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-05 17:27:35 -08:00
Pedro Rodriguez 45bfe94c1e Broken train reproducing bf16 error
Summary:

Test Plan:
2025-02-06 01:27:23 +00:00
Srinivasan Iyer 739dc71a0a
Add rope fp32 (#43)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Log model

* Add flag for rope outer in fp32

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:19:37 -08:00
Srinivasan Iyer 6fbaf7266f
fix stool (#44)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:18:40 -08:00
Srinivasan Iyer 7cf8fab49b
Fix wandb logging (#42)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 16:24:39 -08:00
Pedro Rodriguez 8f1a9a858e Minimal working eval
Summary:

Test Plan:
2025-02-05 22:47:01 +00:00
Pedro Rodriguez 48cf4dfee1 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-05 22:47:01 +00:00
Pedro Rodriguez 1377fcb010
Merge 2f42633b07 into sapling-pr-archive-EntilZha 2025-02-05 14:32:03 -08:00
Pedro Rodriguez 2f42633b07 Add bpb and n_bytes to metric logging
Summary:

Test Plan:
2025-02-05 22:27:15 +00:00
Pedro Rodriguez b2f2a6a76e merge commit for archive created by Sapling 2025-02-05 22:10:37 +00:00
Pedro Rodriguez 1450464031 Update checkpointing to use fsspec
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 19:09:14 +00:00
Pedro Rodriguez c3d7f720f0
Merge b6e53f1d4c into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-04 16:55:37 -08:00
Pedro Rodriguez b6e53f1d4c Update checkpointing to use fsspec
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:55:32 +00:00
Pedro Rodriguez c79b1fdbd0
Fix distributed all reduce grad norm (#40)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez e1cd15ec30 merge commit for archive created by Sapling 2025-02-05 00:53:00 +00:00
Pedro Rodriguez ac257bac19 Fix distributed all reduce grad norm
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-05 00:52:52 +00:00
Pedro Rodriguez 2d68e5126d merge commit for archive created by Sapling 2025-02-05 00:51:52 +00:00
Pedro Rodriguez 9cf7847e26 Fix distributed all reduce grad norm
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-05 00:51:27 +00:00
Pedro Rodriguez 8db01ac392
Merge 4ad4889405 into sapling-pr-archive-EntilZha 2025-02-04 16:30:42 -08:00
Pedro Rodriguez 4ad4889405 Update checkpointing to use fsspec
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:30:38 +00:00
Pedro Rodriguez 97e3bc0427
Merge b2058fb0f6 into sapling-pr-archive-EntilZha 2025-02-04 16:30:13 -08:00
Pedro Rodriguez b2058fb0f6 Update checkpointing to use fsspec
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:29:37 +00:00
Pedro Rodriguez 740d76cd69
Merge e742218d65 into sapling-pr-archive-EntilZha 2025-02-04 16:21:14 -08:00
Pedro Rodriguez e742218d65 Update checkpointing to use fsspec
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```
2025-02-05 00:20:58 +00:00
Pedro Rodriguez 9c4cca558b
Merge bc39591032 into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-04 10:19:56 -08:00
Pedro Rodriguez bc39591032 Several changes to enable entropy model training/eval
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```
2025-02-04 18:19:49 +00:00
Pedro Rodriguez f73b9e1a41 merge commit for archive created by Sapling 2025-02-04 18:05:21 +00:00
Pedro Rodriguez ab399e981d Several changes to enable entropy model training/eval
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100
```
2025-02-04 18:05:16 +00:00
Pedro Rodriguez 48df9ce785
Merge c6ef4285e2 into sapling-pr-archive-EntilZha 2025-02-04 10:03:26 -08:00
Pedro Rodriguez c6ef4285e2 Several changes to enable entropy model training/eval
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:
2025-02-04 18:03:19 +00:00
Pedro Rodriguez 4ff8341738
Merge 11cad6c84d into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-03 18:29:37 -08:00
Pedro Rodriguez 11cad6c84d WIP parallel copy script
Summary:

Test Plan:
2025-01-28 00:57:06 +00:00
Pedro Rodriguez 7044771a12
This includes fixes that make checkpointing and reloading work correctly. (#35)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-27 16:56:42 -08:00