vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 13:32:14 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	3e3193c1d4	Fix multiprocessing dataloader checkpointing and use it in the train script Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-12 18:24:40 +00:00
Pedro Rodriguez	c54c9f0517	Test first batch matches Summary: Test Plan:	2025-02-12 18:24:39 +00:00
Pedro Rodriguez	ec59c13d81	Merge `bd3cf61bb9` into sapling-pr-archive-EntilZha	2025-02-12 10:11:44 -08:00
Pedro Rodriguez	9613e0ea5f	merge commit for archive created by Sapling	2025-02-12 18:09:31 +00:00
Pedro Rodriguez	bd3cf61bb9	Fix multiprocessing dataloader checkpointing and use it in the train script Summary: Test Plan:	2025-02-12 18:09:26 +00:00
Pedro Rodriguez	4cee32ea8c	Test first batch matches Summary: Test Plan:	2025-02-12 18:09:26 +00:00
Pedro Rodriguez	b61a612bbb	Merge `92af9b3f56` into sapling-pr-archive-EntilZha	2025-02-12 10:07:50 -08:00
Pedro Rodriguez	92af9b3f56	Test first batch matches Summary: Test Plan:	2025-02-12 18:07:22 +00:00
Pedro Rodriguez	c6cbacc8c1	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-11 22:56:32 +00:00
Pedro Rodriguez	38cc67a953	Fix multiprocessing dataloader checkpointing and use it in the train script Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-11 22:56:26 +00:00
Pedro Rodriguez	c4b7a01b2b	Merge `5c8fb4f1b3` into sapling-pr-archive-EntilZha Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details	2025-02-07 15:27:12 -08:00
Pedro Rodriguez	5c8fb4f1b3	Fix multiprocessing dataloader checkpointing and use it in the train script Summary: Test Plan:	2025-02-07 23:27:05 +00:00
Srinivasan Iyer	22c7fe1d1c	fix save and reload model state (#49 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-07 14:27:47 -08:00
Pedro Rodriguez	fe45f69fbf	Add bpb and n_bytes to metric logging (#41 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-07 13:14:30 -08:00
Pedro Rodriguez	b35206d756	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-07 21:13:42 +00:00
Pedro Rodriguez	8d7338308e	Add bpb and n_bytes to metric logging Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-02-07 21:13:36 +00:00
Pedro Rodriguez	f783846574	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-07 00:26:06 +00:00
Pedro Rodriguez	b6396eb0f4	Add bpb and n_bytes to metric logging Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-07 00:26:01 +00:00
Srinivasan Iyer	aebdc481a8	Fix init and repro (#48 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * Fix init and repro * comment + black --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-06 14:18:02 -08:00
Pedro Rodriguez	ab594996a9	Merge `4e2ed0aa05` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-06 10:24:36 -08:00
Pedro Rodriguez	4e2ed0aa05	Add bpb and n_bytes to metric logging Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-02-06 18:08:01 +00:00
Pedro Rodriguez	936d9437be	Allow ArrowIterator to read from json (#45 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Currently, arrow iterator can only read arrow files. However, the pyarrow library can read other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines, so we can read from the original source data, and simply omit the entropy column when doing so Test Plan: Run train script until dataloader starts	2025-02-06 09:57:22 -08:00
Pedro Rodriguez	2950b63cf2	Merge `9c3c997cae` into sapling-pr-archive-EntilZha	2025-02-06 09:44:41 -08:00
Pedro Rodriguez	9c3c997cae	Allow ArrowIterator to read from json Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Currently, arrow iterator can only read arrow files. However, the pyarrow library can read other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines, so we can read from the original source data, and simply omit the entropy column when doing so Test Plan: Run train script until dataloader starts	2025-02-06 17:44:36 +00:00
Pedro Rodriguez	fff80b86b5	Merge `0e9421af07` into sapling-pr-archive-EntilZha	2025-02-06 09:43:15 -08:00
Pedro Rodriguez	0e9421af07	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-06 17:43:11 +00:00
Pedro Rodriguez	18ae0ba444	Merge `8d26140970` into sapling-pr-archive-EntilZha	2025-02-06 09:42:51 -08:00
Pedro Rodriguez	8d26140970	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-06 17:42:42 +00:00
Pedro Rodriguez	afedb16598	Update checkpointing to use fsspec (#39 ) Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 09:41:58 -08:00
Pedro Rodriguez	d44902da97	Merge `f058373889` into sapling-pr-archive-EntilZha	2025-02-06 09:37:27 -08:00
Pedro Rodriguez	f058373889	Update checkpointing to use fsspec Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:37:21 +00:00
Pedro Rodriguez	e13495c351	Merge `341264685a` into sapling-pr-archive-EntilZha	2025-02-06 09:34:50 -08:00
Pedro Rodriguez	341264685a	Update checkpointing to use fsspec Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:34:44 +00:00
Pedro Rodriguez	2d1c766050	Merge `45bfe94c1e` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-05 17:27:35 -08:00
Pedro Rodriguez	45bfe94c1e	Broken train reproducing bf16 error Summary: Test Plan:	2025-02-06 01:27:23 +00:00
Srinivasan Iyer	739dc71a0a	Add rope fp32 (#43 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * Log model * Add flag for rope outer in fp32 --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:19:37 -08:00
Srinivasan Iyer	6fbaf7266f	fix stool (#44 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:18:40 -08:00
Srinivasan Iyer	7cf8fab49b	Fix wandb logging (#42 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 16:24:39 -08:00
Pedro Rodriguez	8f1a9a858e	Minimal working eval Summary: Test Plan:	2025-02-05 22:47:01 +00:00
Pedro Rodriguez	48cf4dfee1	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-05 22:47:01 +00:00
Pedro Rodriguez	1377fcb010	Merge `2f42633b07` into sapling-pr-archive-EntilZha	2025-02-05 14:32:03 -08:00
Pedro Rodriguez	2f42633b07	Add bpb and n_bytes to metric logging Summary: Test Plan:	2025-02-05 22:27:15 +00:00
Pedro Rodriguez	b2f2a6a76e	merge commit for archive created by Sapling	2025-02-05 22:10:37 +00:00
Pedro Rodriguez	1450464031	Update checkpointing to use fsspec Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 19:09:14 +00:00
Pedro Rodriguez	c3d7f720f0	Merge `b6e53f1d4c` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-04 16:55:37 -08:00
Pedro Rodriguez	b6e53f1d4c	Update checkpointing to use fsspec Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:55:32 +00:00
Pedro Rodriguez	c79b1fdbd0	Fix distributed all reduce grad norm (#40 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-04 16:53:50 -08:00
Pedro Rodriguez	e1cd15ec30	merge commit for archive created by Sapling	2025-02-05 00:53:00 +00:00
Pedro Rodriguez	ac257bac19	Fix distributed all reduce grad norm Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-05 00:52:52 +00:00
Pedro Rodriguez	2d68e5126d	merge commit for archive created by Sapling	2025-02-05 00:51:52 +00:00

1 2 3 4

160 commits