vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-09-01 18:19:06 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	7517ac2a9f	Get evals working again. (#46 ) - PPL/validation: Works now and uses multi-gpu. For some reason 1 GPU differs from multi-GPU, can debug in a followup PR - Generation evals likely work, but are very slow, so disabled for now Test Plan: ``` torchrun --nproc-per-node 8 -m bytelatent.eval config=../internal-blt/configs/eval.yaml ```	2025-03-11 09:57:19 -07:00
Pedro Rodriguez	fe45f69fbf	Add bpb and n_bytes to metric logging (#41 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-07 13:14:30 -08:00
Pedro Rodriguez	afedb16598	Update checkpointing to use fsspec (#39 ) Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 09:41:58 -08:00
Srinivasan Iyer	7cf8fab49b	Fix wandb logging (#42 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 16:24:39 -08:00
Pedro Rodriguez	bcc039bb75	Initial commit	2024-12-12 15:32:30 -08:00

5 commits