vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 13:32:14 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	d44902da97	Merge `f058373889` into sapling-pr-archive-EntilZha	2025-02-06 09:37:27 -08:00
Pedro Rodriguez	f058373889	Update checkpointing to use fsspec Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:37:21 +00:00
Pedro Rodriguez	e13495c351	Merge `341264685a` into sapling-pr-archive-EntilZha	2025-02-06 09:34:50 -08:00
Pedro Rodriguez	341264685a	Update checkpointing to use fsspec Summary: - Make the data/checkpoint code fsspec compatible - Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` These currently won't work due to the torch distributed save, but theses hould be tested at a later date ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-06 17:34:44 +00:00
Pedro Rodriguez	2d1c766050	Merge `45bfe94c1e` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-05 17:27:35 -08:00
Pedro Rodriguez	45bfe94c1e	Broken train reproducing bf16 error Summary: Test Plan:	2025-02-06 01:27:23 +00:00
Srinivasan Iyer	739dc71a0a	Add rope fp32 (#43 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details * Log model * Add flag for rope outer in fp32 --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:19:37 -08:00
Srinivasan Iyer	6fbaf7266f	fix stool (#44 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 17:18:40 -08:00
Srinivasan Iyer	7cf8fab49b	Fix wandb logging (#42 ) Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-05 16:24:39 -08:00
Pedro Rodriguez	8f1a9a858e	Minimal working eval Summary: Test Plan:	2025-02-05 22:47:01 +00:00
Pedro Rodriguez	48cf4dfee1	Allow ArrowIterator to read from json Summary: Test Plan:	2025-02-05 22:47:01 +00:00
Pedro Rodriguez	1377fcb010	Merge `2f42633b07` into sapling-pr-archive-EntilZha	2025-02-05 14:32:03 -08:00
Pedro Rodriguez	2f42633b07	Add bpb and n_bytes to metric logging Summary: Test Plan:	2025-02-05 22:27:15 +00:00
Pedro Rodriguez	b2f2a6a76e	merge commit for archive created by Sapling	2025-02-05 22:10:37 +00:00
Pedro Rodriguez	1450464031	Update checkpointing to use fsspec Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 19:09:14 +00:00
Pedro Rodriguez	c3d7f720f0	Merge `b6e53f1d4c` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-04 16:55:37 -08:00
Pedro Rodriguez	b6e53f1d4c	Update checkpointing to use fsspec Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:55:32 +00:00
Pedro Rodriguez	c79b1fdbd0	Fix distributed all reduce grad norm (#40 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-04 16:53:50 -08:00
Pedro Rodriguez	e1cd15ec30	merge commit for archive created by Sapling	2025-02-05 00:53:00 +00:00
Pedro Rodriguez	ac257bac19	Fix distributed all reduce grad norm Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-05 00:52:52 +00:00
Pedro Rodriguez	2d68e5126d	merge commit for archive created by Sapling	2025-02-05 00:51:52 +00:00
Pedro Rodriguez	9cf7847e26	Fix distributed all reduce grad norm Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-05 00:51:27 +00:00
Pedro Rodriguez	8db01ac392	Merge `4ad4889405` into sapling-pr-archive-EntilZha	2025-02-04 16:30:42 -08:00
Pedro Rodriguez	4ad4889405	Update checkpointing to use fsspec Summary: - Make the data/checkpoint code fsspec compatible Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:30:38 +00:00
Pedro Rodriguez	97e3bc0427	Merge `b2058fb0f6` into sapling-pr-archive-EntilZha	2025-02-04 16:30:13 -08:00
Pedro Rodriguez	b2058fb0f6	Update checkpointing to use fsspec Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run unit tests and the commands below ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/ ```	2025-02-05 00:29:37 +00:00
Pedro Rodriguez	740d76cd69	Merge `e742218d65` into sapling-pr-archive-EntilZha	2025-02-04 16:21:14 -08:00
Pedro Rodriguez	e742218d65	Update checkpointing to use fsspec Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ```	2025-02-05 00:20:58 +00:00
Pedro Rodriguez	9c4cca558b	Merge `bc39591032` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-04 10:19:56 -08:00
Pedro Rodriguez	bc39591032	Several changes to enable entropy model training/eval Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100 ``` ``` python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ``` ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null ```	2025-02-04 18:19:49 +00:00
Pedro Rodriguez	f73b9e1a41	merge commit for archive created by Sapling	2025-02-04 18:05:21 +00:00
Pedro Rodriguez	ab399e981d	Several changes to enable entropy model training/eval Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan: Run ``` torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/entropy_model.yaml eval=null max_steps=10100 ```	2025-02-04 18:05:16 +00:00
Pedro Rodriguez	48df9ce785	Merge `c6ef4285e2` into sapling-pr-archive-EntilZha	2025-02-04 10:03:26 -08:00
Pedro Rodriguez	c6ef4285e2	Several changes to enable entropy model training/eval Summary: - Make arrow iterator able to read from jsonl files, the entropies are omitted in this case - Make the data/checkpoint code fsspec compatible - Fix issues with all reduce with non-bf16 in dist_sum and norm computation. - Minimal fixes to get eval to run, it is slow currently - Add bpb numbers during training Test Plan:	2025-02-04 18:03:19 +00:00
Pedro Rodriguez	4ff8341738	Merge `11cad6c84d` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-03 18:29:37 -08:00
Pedro Rodriguez	11cad6c84d	WIP parallel copy script Summary: Test Plan:	2025-01-28 00:57:06 +00:00
Pedro Rodriguez	7044771a12	This includes fixes that make checkpointing and reloading work correctly. (#35 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-27 16:56:42 -08:00
Pedro Rodriguez	4db801a532	Merge `caf82b924e` into sapling-pr-archive-EntilZha Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details	2025-01-27 16:54:52 -08:00
Pedro Rodriguez	caf82b924e	This includes fixes that make checkpointing and reloading work correctly. It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-28 00:54:47 +00:00
Pedro Rodriguez	c2f1e4845e	Merge `e02ba763b0` into sapling-pr-archive-EntilZha	2025-01-27 16:38:54 -08:00
Pedro Rodriguez	e02ba763b0	This includes fixes that make checkpointing and reloading work correctly. It also batches in a first set of changes for fixing eval code Summary: Test Plan:	2025-01-28 00:38:46 +00:00
Pedro Rodriguez	7622d28b74	Initial codes and scripts for training entropy model (#34 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details Summary: Test Plan:	2025-01-27 09:46:44 -08:00
Pedro Rodriguez	b1c12dd275	Merge `34ca1f7d4b` into sapling-pr-archive-EntilZha Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details	2025-01-24 13:59:47 -08:00
Pedro Rodriguez	34ca1f7d4b	Initial codes and scripts for training entropy model Summary: Test Plan:	2025-01-24 21:59:42 +00:00
Pedro Rodriguez	f1a2589266	Merge `fb09022e5e` into sapling-pr-archive-EntilZha	2025-01-24 13:55:48 -08:00
Pedro Rodriguez	fb09022e5e	Initial codes and scripts for training entropy model Summary: Test Plan:	2025-01-24 21:55:41 +00:00
Pedro Rodriguez	a809259e71	Use load_async flag to not start MP iterator (#33 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-01-24 10:57:20 -08:00
Pedro Rodriguez	17b727465f	Merge `bd461af91a` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-01-24 10:56:34 -08:00
Pedro Rodriguez	bd461af91a	Use load_async flag to not start MP iterator Summary: Test Plan:	2025-01-24 18:56:28 +00:00
Pedro Rodriguez	bc42cebd7d	Update file check script to check sizes (#32 ) Some checks failed Lint with isort / lint (push) Has been cancelled Details Lint with Black / lint (push) Has been cancelled Details Summary: Test Plan:	2025-01-22 13:06:46 -08:00

1 2

81 commits