blt/bytelatent
Pedro Rodriguez afedb16598
Update checkpointing to use fsspec (#39)
Summary:

- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR


Test Plan:

Run unit tests and the commands below

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```

These currently won't work due to the torch distributed save, but theses hould be tested at a later date

```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```

```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
2025-02-06 09:41:58 -08:00
..
configs This includes fixes that make checkpointing and reloading work correctly. (#35) 2025-01-27 16:56:42 -08:00
data This includes fixes that make checkpointing and reloading work correctly. (#35) 2025-01-27 16:56:42 -08:00
model Add rope fp32 (#43) 2025-02-05 17:19:37 -08:00
plotting Add plotting code from paper (#17) 2025-01-09 12:11:50 -08:00
preprocess Fix realtime entropy patching (#26) 2025-01-21 16:34:23 -08:00
tokenizers Initial commit 2024-12-12 15:32:30 -08:00
.DS_Store Initial commit 2024-12-12 15:32:30 -08:00
__init__.py Initial commit 2024-12-12 15:32:30 -08:00
args.py Update checkpointing to use fsspec (#39) 2025-02-06 09:41:58 -08:00
base_transformer.py Add rope fp32 (#43) 2025-02-05 17:19:37 -08:00
checkpoint.py Update checkpointing to use fsspec (#39) 2025-02-06 09:41:58 -08:00
constants.py Initial commit 2024-12-12 15:32:30 -08:00
distributed.py Changes for training entropy model and correcting attention in local models (#25) 2025-01-17 14:23:01 -08:00
entropy_model.py Changes for training entropy model and correcting attention in local models (#25) 2025-01-17 14:23:01 -08:00
eval.py This includes fixes that make checkpointing and reloading work correctly. (#35) 2025-01-27 16:56:42 -08:00
float8.py Initial commit 2024-12-12 15:32:30 -08:00
generate.py This includes fixes that make checkpointing and reloading work correctly. (#35) 2025-01-27 16:56:42 -08:00
logger.py Update checkpointing to use fsspec (#39) 2025-02-06 09:41:58 -08:00
metrics.py Update checkpointing to use fsspec (#39) 2025-02-06 09:41:58 -08:00
norms.py Fix distributed all reduce grad norm (#40) 2025-02-04 16:53:50 -08:00
optim.py Initial commit 2024-12-12 15:32:30 -08:00
probe.py Initial commit 2024-12-12 15:32:30 -08:00
profiling.py Initial commit 2024-12-12 15:32:30 -08:00
stool.py fix stool (#44) 2025-02-05 17:18:40 -08:00
test_blt.py Initial codes and scripts for training entropy model (#34) 2025-01-27 09:46:44 -08:00
test_entropy_model.py Changes for training entropy model and correcting attention in local models (#25) 2025-01-17 14:23:01 -08:00
train.py Update checkpointing to use fsspec (#39) 2025-02-06 09:41:58 -08:00
transformer.py Changes for training entropy model and correcting attention in local models (#25) 2025-01-17 14:23:01 -08:00