vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 05:22:16 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	f783846574	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-07 00:26:06 +00:00
Pedro Rodriguez	b6396eb0f4	Add bpb and n_bytes to metric logging Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-07 00:26:01 +00:00
Srinivasan Iyer	aebdc481a8	Fix init and repro (#48 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * Fix init and repro * comment + black --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-06 14:18:02 -08:00
Pedro Rodriguez	ab594996a9	Merge `4e2ed0aa05` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-06 10:24:36 -08:00
Pedro Rodriguez	4e2ed0aa05	Add bpb and n_bytes to metric logging Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-06 18:08:01 +00:00
Pedro Rodriguez	936d9437be	Allow ArrowIterator to read from json (#45 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Currently, arrow iterator can only read arrow files. However, the pyarrow library can read other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines, so we can read from the original source data, and simply omit the entropy column when doing so Test Plan: Run train script until dataloader starts	2025-02-06 09:57:22 -08:00
Pedro Rodriguez	2950b63cf2	Merge `9c3c997cae` into sapling-pr-archive-EntilZha	2025-02-06 09:44:41 -08:00
Pedro Rodriguez	9c3c997cae	Allow ArrowIterator to read from json Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Currently, arrow iterator can only read arrow files. However, the pyarrow library can read other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines, so we can read from the original source data, and simply omit the entropy column when doing so Test Plan: Run train script until dataloader starts	2025-02-06 17:44:36 +00:00
Pedro Rodriguez	fff80b86b5	Merge `0e9421af07` into sapling-pr-archive-EntilZha	2025-02-06 09:43:15 -08:00
Pedro Rodriguez	0e9421af07	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-06 17:43:11 +00:00
Pedro Rodriguez	18ae0ba444	Merge `8d26140970` into sapling-pr-archive-EntilZha	2025-02-06 09:42:51 -08:00
Pedro Rodriguez	8d26140970	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-06 17:42:42 +00:00
Pedro Rodriguez	afedb16598	Update checkpointing to use fsspec (#39 ) Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 09:41:58 -08:00
Pedro Rodriguez	d44902da97	Merge `f058373889` into sapling-pr-archive-EntilZha	2025-02-06 09:37:27 -08:00
Pedro Rodriguez	f058373889	Update checkpointing to use fsspec Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:37:21 +00:00
Pedro Rodriguez	e13495c351	Merge `341264685a` into sapling-pr-archive-EntilZha	2025-02-06 09:34:50 -08:00
Pedro Rodriguez	341264685a	Update checkpointing to use fsspec Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:34:44 +00:00
Pedro Rodriguez	2d1c766050	Merge `45bfe94c1e` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-05 17:27:35 -08:00
Pedro Rodriguez	45bfe94c1e	Broken train reproducing bf16 error Summary: Test Plan:	2025-02-06 01:27:23 +00:00
Srinivasan Iyer	739dc71a0a	Add rope fp32 (#43 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * Log model * Add flag for rope outer in fp32 --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:19:37 -08:00
Srinivasan Iyer	6fbaf7266f	fix stool (#44 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:18:40 -08:00
Srinivasan Iyer	7cf8fab49b	Fix wandb logging (#42 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 16:24:39 -08:00
Pedro Rodriguez	8f1a9a858e	Minimal working eval Summary: Test Plan:	2025-02-05 22:47:01 +00:00
Pedro Rodriguez	48cf4dfee1	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-05 22:47:01 +00:00
Pedro Rodriguez	1377fcb010	Merge `2f42633b07` into sapling-pr-archive-EntilZha	2025-02-05 14:32:03 -08:00
Pedro Rodriguez	2f42633b07	Add bpb and n_bytes to metric logging Summary: Test Plan:	2025-02-05 22:27:15 +00:00
Pedro Rodriguez	b2f2a6a76e	merge commit for archive created by Sapling	2025-02-05 22:10:37 +00:00
Pedro Rodriguez	1450464031	Update checkpointing to use fsspec Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 19:09:14 +00:00
Pedro Rodriguez	c3d7f720f0	Merge `b6e53f1d4c` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-04 16:55:37 -08:00
Pedro Rodriguez	b6e53f1d4c	Update checkpointing to use fsspec Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:55:32 +00:00
Pedro Rodriguez	c79b1fdbd0	Fix distributed all reduce grad norm (#40 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-04 16:53:50 -08:00
Pedro Rodriguez	e1cd15ec30	merge commit for archive created by Sapling	2025-02-05 00:53:00 +00:00
Pedro Rodriguez	ac257bac19	Fix distributed all reduce grad norm Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-05 00:52:52 +00:00
Pedro Rodriguez	2d68e5126d	merge commit for archive created by Sapling	2025-02-05 00:51:52 +00:00
Pedro Rodriguez	9cf7847e26	Fix distributed all reduce grad norm Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-05 00:51:27 +00:00
Pedro Rodriguez	8db01ac392	Merge `4ad4889405` into sapling-pr-archive-EntilZha	2025-02-04 16:30:42 -08:00
Pedro Rodriguez	4ad4889405	Update checkpointing to use fsspec Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:30:38 +00:00
Pedro Rodriguez	97e3bc0427	Merge `b2058fb0f6` into sapling-pr-archive-EntilZha	2025-02-04 16:30:13 -08:00
Pedro Rodriguez	b2058fb0f6	Update checkpointing to use fsspec Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:29:37 +00:00
Pedro Rodriguez	740d76cd69	Merge `e742218d65` into sapling-pr-archive-EntilZha	2025-02-04 16:21:14 -08:00
Pedro Rodriguez	e742218d65	Update checkpointing to use fsspec Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ```	2025-02-05 00:20:58 +00:00
Pedro Rodriguez	9c4cca558b	Merge `bc39591032` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-04 10:19:56 -08:00
Pedro Rodriguez	bc39591032	Several changes to enable entropy model training/eval Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ```	2025-02-04 18:19:49 +00:00
Pedro Rodriguez	f73b9e1a41	merge commit for archive created by Sapling	2025-02-04 18:05:21 +00:00
Pedro Rodriguez	ab399e981d	Several changes to enable entropy model training/eval Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100 ```	2025-02-04 18:05:16 +00:00
Pedro Rodriguez	48df9ce785	Merge `c6ef4285e2` into sapling-pr-archive-EntilZha	2025-02-04 10:03:26 -08:00
Pedro Rodriguez	c6ef4285e2	Several changes to enable entropy model training/eval Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan:	2025-02-04 18:03:19 +00:00
Pedro Rodriguez	4ff8341738	Merge `11cad6c84d` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-03 18:29:37 -08:00
Pedro Rodriguez	11cad6c84d	WIP parallel copy script Summary: Test Plan:	2025-01-28 00:57:06 +00:00
Pedro Rodriguez	7044771a12	This includes fixes that make checkpointing and reloading work correctly. (#35 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-27 16:56:42 -08:00

1 2 3

144 commits