vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 13:32:14 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	341264685a	Update checkpointing to use fsspec Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:34:44 +00:00
Pedro Rodriguez	c79b1fdbd0	Fix distributed all reduce grad norm (#40 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-04 16:53:50 -08:00
Pedro Rodriguez	7044771a12	This includes fixes that make checkpointing and reloading work correctly. (#35 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-27 16:56:42 -08:00
Pedro Rodriguez	7622d28b74	Initial codes and scripts for training entropy model (#34 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-01-27 09:46:44 -08:00
Pedro Rodriguez	a809259e71	Use load_async flag to not start MP iterator (#33 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-01-24 10:57:20 -08:00
Pedro Rodriguez	bc42cebd7d	Update file check script to check sizes (#32 ) Some checks failed Lint with isort / lint (push) Has been cancelled Details Lint with Black / lint (push) Has been cancelled Details Summary: Test Plan:	2025-01-22 13:06:46 -08:00
Ink	392117bff2	Fix realtime entropy patching (#26 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * allow loading of the entropy model directly * remove unused argument * remove spammy warning * allow patch_batch_size to be adjusted in the forward() method * revert to original patcher style, fix warning * allow grads when calculating entropies * fix grad flow * return preds from calculate_entropies() * remove legacy arg * fix an error with monotonicity and small sequence lengths * ensure patcher is serializable * revert patcher to original * remove unused import	2025-01-21 16:34:23 -08:00
Pedro Rodriguez	6ffeb66b53	Changes for training entropy model and correcting attention in local models (#25 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: - Refactor local model configs to be separate and clearer - Add attention arguments and correct which attention is used in local models - Preparation for being able to have an entropy train script - Fix failing unit tests Test Plan:	2025-01-17 14:23:01 -08:00
Ink	caec8d2621	allow flex-attention to be disabled (#19 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details * allow flex-attention to silently fail * allow flex-attn to be disabled via an env var	2025-01-14 09:32:07 -08:00
Pedro Rodriguez	1da3dd9315	Update preprocess_entropies script to blt inference + add fsspec support (#23 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-01-13 15:28:14 -08:00
Pedro Rodriguez	b0120da72f	Replace regular filesystem calls with fsspec + add s3 support (#18 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: For compatibility with either local/nfs or S3 datasets, swap to fsspec. Add a tool to compare local and remote filesystems Test Plan: - Ran regular train script - Ran with config with data in S3	2025-01-10 11:04:41 -08:00
Pedro Rodriguez	d4ddb95322	Add plotting code from paper (#17 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-01-09 12:11:50 -08:00
Ink	2fdc6f3cc9	Package `bytelatent` as a module (#7 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details * make installable via pip * fix missing xformers deps * remove non-core dependencies * fix linting * fix isort	2025-01-06 16:44:50 -08:00
Ikko Eltociear Ashimine	9065bb1cce	docs: update README.md (#1 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details folowing -> following	2025-01-03 12:08:00 -08:00
Daniele Sartiano	898671b66b	Update README.md (#13 ) Fixed typo on Meta Lingua	2025-01-03 12:06:47 -08:00
Pedro Rodriguez	bcc039bb75	Initial commit	2024-12-12 15:32:30 -08:00

16 commits