Pedro Rodriguez
45bfe94c1e
Broken train reproducing bf16 error
...
Summary:
Test Plan:
2025-02-06 01:27:23 +00:00
Pedro Rodriguez
2f42633b07
Add bpb and n_bytes to metric logging
...
Summary:
Test Plan:
2025-02-05 22:27:15 +00:00
Pedro Rodriguez
1450464031
Update checkpointing to use fsspec
...
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:
- Make the data/checkpoint code fsspec compatible
Test Plan:
Run unit tests and the commands below
```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```
```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```
```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-05 19:09:14 +00:00
Pedro Rodriguez
c79b1fdbd0
Fix distributed all reduce grad norm ( #40 )
...
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:
With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures
Test Plan:
- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
2025-02-04 16:53:50 -08:00
Pedro Rodriguez
7044771a12
This includes fixes that make checkpointing and reloading work correctly. ( #35 )
...
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
It also batches in a first set of changes for fixing eval code
Summary:
Test Plan:
2025-01-27 16:56:42 -08:00
Pedro Rodriguez
7622d28b74
Initial codes and scripts for training entropy model ( #34 )
...
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:
Test Plan:
2025-01-27 09:46:44 -08:00
Pedro Rodriguez
6ffeb66b53
Changes for training entropy model and correcting attention in local models ( #25 )
...
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:
- Refactor local model configs to be separate and clearer
- Add attention arguments and correct which attention is used in local models
- Preparation for being able to have an entropy train script
- Fix failing unit tests
Test Plan:
2025-01-17 14:23:01 -08:00
Pedro Rodriguez
bcc039bb75
Initial commit
2024-12-12 15:32:30 -08:00