vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-09-01 10:09:06 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	7517ac2a9f	Get evals working again. (#46 ) - PPL/validation: Works now and uses multi-gpu. For some reason 1 GPU differs from multi-GPU, can debug in a followup PR - Generation evals likely work, but are very slow, so disabled for now Test Plan: ``` torchrun --nproc-per-node 8 -m bytelatent.eval config=../internal-blt/configs/eval.yaml ```	2025-03-11 09:57:19 -07:00
Pedro Rodriguez	63913e4dba	Reduce per file resources arrow uses (#77 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-03-05 15:03:42 -08:00
Pedro Rodriguez	ea1fc75862	Add approximate state persistence (#73 ) Summary: Test Plan: *** More verbose multiprocess logging, fix get_state_and_recycle Summary: Test Plan:	2025-03-05 15:01:45 -08:00
Bocheng Li	a6ed14f689	Fix: Correct model_args usage in parallelize_model call (#69 )	2025-02-24 14:40:38 -08:00
Pedro Rodriguez	82ab5930ec	Make it possible to specify multiple config files (#54 ) Summary: Make it possible to specify multiple config files. Parsing CLI is not a special case anymore, just uses the same config inheritance method. Test Plan: Test that this iterpolates in the right order via unit tests Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is: - Default pydantic args - Included configs, eg `config` - CLI args ``` python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null ``` Summary: Test Plan:	2025-02-18 10:42:44 -08:00
Pedro Rodriguez	8c61ab5e67	Fix multiprocessing dataloader checkpointing and use it in the train script (#50 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-13 11:58:23 -08:00
Srinivasan Iyer	48e4ad0bd2	make sure max_encoder_seq_length matches (#55 ) * make sure max_encoder_seq_length matches * black and assert comment --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-12 18:27:22 -08:00
Pedro Rodriguez	fe45f69fbf	Add bpb and n_bytes to metric logging (#41 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-07 13:14:30 -08:00
Pedro Rodriguez	afedb16598	Update checkpointing to use fsspec (#39 ) Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 09:41:58 -08:00
Srinivasan Iyer	739dc71a0a	Add rope fp32 (#43 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * Log model * Add flag for rope outer in fp32 --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:19:37 -08:00
Pedro Rodriguez	c79b1fdbd0	Fix distributed all reduce grad norm (#40 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-04 16:53:50 -08:00
Pedro Rodriguez	7044771a12	This includes fixes that make checkpointing and reloading work correctly. (#35 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-27 16:56:42 -08:00
Pedro Rodriguez	7622d28b74	Initial codes and scripts for training entropy model (#34 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-01-27 09:46:44 -08:00
Pedro Rodriguez	6ffeb66b53	Changes for training entropy model and correcting attention in local models (#25 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: - Refactor local model configs to be separate and clearer - Add attention arguments and correct which attention is used in local models - Preparation for being able to have an entropy train script - Fix failing unit tests Test Plan:	2025-01-17 14:23:01 -08:00
Pedro Rodriguez	bcc039bb75	Initial commit	2024-12-12 15:32:30 -08:00

15 commits