Commit graph

81 commits

Author SHA1 Message Date
Pedro Rodriguez d44902da97
Merge f058373889 into sapling-pr-archive-EntilZha 2025-02-06 09:37:27 -08:00
Pedro Rodriguez f058373889 Update checkpointing to use fsspec
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 17:37:21 +00:00
Pedro Rodriguez e13495c351
Merge 341264685a into sapling-pr-archive-EntilZha 2025-02-06 09:34:50 -08:00
Pedro Rodriguez 341264685a Update checkpointing to use fsspec
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 17:34:44 +00:00
Pedro Rodriguez 2d1c766050
Merge 45bfe94c1e into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-05 17:27:35 -08:00
Pedro Rodriguez 45bfe94c1e Broken train reproducing bf16 error
Summary:

Test Plan:
2025-02-06 01:27:23 +00:00
Srinivasan Iyer 739dc71a0a
Add rope fp32 (#43)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
* Log model

* Add flag for rope outer in fp32

---------

Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:19:37 -08:00
Srinivasan Iyer 6fbaf7266f
fix stool (#44)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 17:18:40 -08:00
Srinivasan Iyer 7cf8fab49b
Fix wandb logging (#42)
Co-authored-by: Srini Iyer <sviyer@meta.com>
2025-02-05 16:24:39 -08:00
Pedro Rodriguez 8f1a9a858e Minimal working eval
Summary:

Test Plan:
2025-02-05 22:47:01 +00:00
Pedro Rodriguez 48cf4dfee1 Allow ArrowIterator to read from json
Summary:

Test Plan:
2025-02-05 22:47:01 +00:00
Pedro Rodriguez 1377fcb010
Merge 2f42633b07 into sapling-pr-archive-EntilZha 2025-02-05 14:32:03 -08:00
Pedro Rodriguez 2f42633b07 Add bpb and n_bytes to metric logging
Summary:

Test Plan:
2025-02-05 22:27:15 +00:00
Pedro Rodriguez b2f2a6a76e merge commit for archive created by Sapling 2025-02-05 22:10:37 +00:00
Pedro Rodriguez 1450464031 Update checkpointing to use fsspec
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 19:09:14 +00:00
Pedro Rodriguez c3d7f720f0
Merge b6e53f1d4c into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-04 16:55:37 -08:00
Pedro Rodriguez b6e53f1d4c Update checkpointing to use fsspec
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:55:32 +00:00
Pedro Rodriguez c79b1fdbd0
Fix distributed all reduce grad norm (#40)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez e1cd15ec30 merge commit for archive created by Sapling 2025-02-05 00:53:00 +00:00
Pedro Rodriguez ac257bac19 Fix distributed all reduce grad norm
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-05 00:52:52 +00:00
Pedro Rodriguez 2d68e5126d merge commit for archive created by Sapling 2025-02-05 00:51:52 +00:00
Pedro Rodriguez 9cf7847e26 Fix distributed all reduce grad norm
Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-05 00:51:27 +00:00
Pedro Rodriguez 8db01ac392
Merge 4ad4889405 into sapling-pr-archive-EntilZha 2025-02-04 16:30:42 -08:00
Pedro Rodriguez 4ad4889405 Update checkpointing to use fsspec
Summary:

- Make the data/checkpoint code fsspec compatible


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:30:38 +00:00
Pedro Rodriguez 97e3bc0427
Merge b2058fb0f6 into sapling-pr-archive-EntilZha 2025-02-04 16:30:13 -08:00
Pedro Rodriguez b2058fb0f6 Update checkpointing to use fsspec
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 00:29:37 +00:00
Pedro Rodriguez 740d76cd69
Merge e742218d65 into sapling-pr-archive-EntilZha 2025-02-04 16:21:14 -08:00
Pedro Rodriguez e742218d65 Update checkpointing to use fsspec
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```
2025-02-05 00:20:58 +00:00
Pedro Rodriguez 9c4cca558b
Merge bc39591032 into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-04 10:19:56 -08:00
Pedro Rodriguez bc39591032 Several changes to enable entropy model training/eval
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100
```

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null
```
2025-02-04 18:19:49 +00:00
Pedro Rodriguez f73b9e1a41 merge commit for archive created by Sapling 2025-02-04 18:05:21 +00:00
Pedro Rodriguez ab399e981d Several changes to enable entropy model training/eval
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:

Run

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100
```
2025-02-04 18:05:16 +00:00
Pedro Rodriguez 48df9ce785
Merge c6ef4285e2 into sapling-pr-archive-EntilZha 2025-02-04 10:03:26 -08:00
Pedro Rodriguez c6ef4285e2 Several changes to enable entropy model training/eval
Summary:

- Make arrow iterator able to read from jsonl files, the entropies are omitted in this case
- Make the data/checkpoint code fsspec compatible
- Fix issues with all reduce with non-bf16 in dist_sum and norm computation.
- Minimal fixes to get eval to run, it is slow currently
- Add bpb numbers during training


Test Plan:
2025-02-04 18:03:19 +00:00
Pedro Rodriguez 4ff8341738
Merge 11cad6c84d into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-03 18:29:37 -08:00
Pedro Rodriguez 11cad6c84d WIP parallel copy script
Summary:

Test Plan:
2025-01-28 00:57:06 +00:00
Pedro Rodriguez 7044771a12
This includes fixes that make checkpointing and reloading work correctly. (#35)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-27 16:56:42 -08:00
Pedro Rodriguez 4db801a532
Merge caf82b924e into sapling-pr-archive-EntilZha
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-01-27 16:54:52 -08:00
Pedro Rodriguez caf82b924e This includes fixes that make checkpointing and reloading work correctly.
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-28 00:54:47 +00:00
Pedro Rodriguez c2f1e4845e
Merge e02ba763b0 into sapling-pr-archive-EntilZha 2025-01-27 16:38:54 -08:00
Pedro Rodriguez e02ba763b0 This includes fixes that make checkpointing and reloading work correctly.
It also batches in a first set of changes for fixing eval code

Summary:

Test Plan:
2025-01-28 00:38:46 +00:00
Pedro Rodriguez 7622d28b74
Initial codes and scripts for training entropy model (#34)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-01-27 09:46:44 -08:00
Pedro Rodriguez b1c12dd275
Merge 34ca1f7d4b into sapling-pr-archive-EntilZha
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-01-24 13:59:47 -08:00
Pedro Rodriguez 34ca1f7d4b Initial codes and scripts for training entropy model
Summary:

Test Plan:
2025-01-24 21:59:42 +00:00
Pedro Rodriguez f1a2589266
Merge fb09022e5e into sapling-pr-archive-EntilZha 2025-01-24 13:55:48 -08:00
Pedro Rodriguez fb09022e5e Initial codes and scripts for training entropy model
Summary:

Test Plan:
2025-01-24 21:55:41 +00:00
Pedro Rodriguez a809259e71
Use load_async flag to not start MP iterator (#33)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-01-24 10:57:20 -08:00
Pedro Rodriguez 17b727465f
Merge bd461af91a into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-01-24 10:56:34 -08:00
Pedro Rodriguez bd461af91a Use load_async flag to not start MP iterator
Summary:

Test Plan:
2025-01-24 18:56:28 +00:00
Pedro Rodriguez bc42cebd7d
Update file check script to check sizes (#32)
Some checks failed
Lint with isort / lint (push) Has been cancelled
Lint with Black / lint (push) Has been cancelled
Summary:

Test Plan:
2025-01-22 13:06:46 -08:00