blt/bytelatent at afedb16598776530b7d21c5d6997ef1126076804 - vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-04-25 19:19:09 +00:00

History

Pedro Rodriguez afedb16598 Update checkpointing to use fsspec (#39 ) Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```		2025-02-06 09:41:58 -08:00
..
configs	This includes fixes that make checkpointing and reloading work correctly. (#35 )	2025-01-27 16:56:42 -08:00
data	This includes fixes that make checkpointing and reloading work correctly. (#35 )	2025-01-27 16:56:42 -08:00
model	Add rope fp32 (#43 )	2025-02-05 17:19:37 -08:00
plotting	Add plotting code from paper (#17 )	2025-01-09 12:11:50 -08:00
preprocess	Fix realtime entropy patching (#26 )	2025-01-21 16:34:23 -08:00
tokenizers	Initial commit	2024-12-12 15:32:30 -08:00
.DS_Store	Initial commit	2024-12-12 15:32:30 -08:00
__init__.py	Initial commit	2024-12-12 15:32:30 -08:00
args.py	Update checkpointing to use fsspec (#39 )	2025-02-06 09:41:58 -08:00
base_transformer.py	Add rope fp32 (#43 )	2025-02-05 17:19:37 -08:00
checkpoint.py	Update checkpointing to use fsspec (#39 )	2025-02-06 09:41:58 -08:00
constants.py	Initial commit	2024-12-12 15:32:30 -08:00
distributed.py	Changes for training entropy model and correcting attention in local models (#25 )	2025-01-17 14:23:01 -08:00
entropy_model.py	Changes for training entropy model and correcting attention in local models (#25 )	2025-01-17 14:23:01 -08:00
eval.py	This includes fixes that make checkpointing and reloading work correctly. (#35 )	2025-01-27 16:56:42 -08:00
float8.py	Initial commit	2024-12-12 15:32:30 -08:00
generate.py	This includes fixes that make checkpointing and reloading work correctly. (#35 )	2025-01-27 16:56:42 -08:00
logger.py	Update checkpointing to use fsspec (#39 )	2025-02-06 09:41:58 -08:00
metrics.py	Update checkpointing to use fsspec (#39 )	2025-02-06 09:41:58 -08:00
norms.py	Fix distributed all reduce grad norm (#40 )	2025-02-04 16:53:50 -08:00
optim.py	Initial commit	2024-12-12 15:32:30 -08:00
probe.py	Initial commit	2024-12-12 15:32:30 -08:00
profiling.py	Initial commit	2024-12-12 15:32:30 -08:00
stool.py	fix stool (#44 )	2025-02-05 17:18:40 -08:00
test_blt.py	Initial codes and scripts for training entropy model (#34 )	2025-01-27 09:46:44 -08:00
test_entropy_model.py	Changes for training entropy model and correcting attention in local models (#25 )	2025-01-17 14:23:01 -08:00
train.py	Update checkpointing to use fsspec (#39 )	2025-02-06 09:41:58 -08:00
transformer.py	Changes for training entropy model and correcting attention in local models (#25 )	2025-01-17 14:23:01 -08:00