vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 05:22:16 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	8c61ab5e67	Fix multiprocessing dataloader checkpointing and use it in the train script (#50 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-13 11:58:23 -08:00
Pedro Rodriguez	936d9437be	Allow ArrowIterator to read from json (#45 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Currently, arrow iterator can only read arrow files. However, the pyarrow library can read other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines, so we can read from the original source data, and simply omit the entropy column when doing so Test Plan: Run train script until dataloader starts	2025-02-06 09:57:22 -08:00
Pedro Rodriguez	afedb16598	Update checkpointing to use fsspec (#39 ) Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 09:41:58 -08:00
Pedro Rodriguez	7044771a12	This includes fixes that make checkpointing and reloading work correctly. (#35 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-27 16:56:42 -08:00
Pedro Rodriguez	7622d28b74	Initial codes and scripts for training entropy model (#34 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-01-27 09:46:44 -08:00
Pedro Rodriguez	a809259e71	Use load_async flag to not start MP iterator (#33 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-01-24 10:57:20 -08:00
Pedro Rodriguez	6ffeb66b53	Changes for training entropy model and correcting attention in local models (#25 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: - Refactor local model configs to be separate and clearer - Add attention arguments and correct which attention is used in local models - Preparation for being able to have an entropy train script - Fix failing unit tests Test Plan:	2025-01-17 14:23:01 -08:00
Pedro Rodriguez	b0120da72f	Replace regular filesystem calls with fsspec + add s3 support (#18 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: For compatibility with either local/nfs or S3 datasets, swap to fsspec. Add a tool to compare local and remote filesystems Test Plan: - Ran regular train script - Ran with config with data in S3	2025-01-10 11:04:41 -08:00
Pedro Rodriguez	bcc039bb75	Initial commit	2024-12-12 15:32:30 -08:00

9 commits