vikarti.anatra/blt

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 05:22:16 +00:00

Author	SHA1	Message	Date
Pedro Rodriguez	de774bd98b	Merge `203bff3696` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-21 17:27:19 -08:00
Pedro Rodriguez	203bff3696	Pass mask in packing_iterator, correctly handle last batch, fix masking Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details This commit does/fixes the following: 1. Adds unit tests for byte and patch packing to ensure it works correctly 2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified. 3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file. 4. Correctly forward the mask if it exists for byte packing Test Plan: ``` pytest bytelatent/ ``` Testing these changes more thoroughly in a stacked PR that fixes evals	2025-02-22 01:27:13 +00:00
Pedro Rodriguez	a0fa496aa2	merge commit for archive created by Sapling	2025-02-22 01:22:31 +00:00
Pedro Rodriguez	1ede87e1ae	Pass mask in packing_iterator, correctly handle last batch	2025-02-22 01:22:25 +00:00
Pedro Rodriguez	c233487b95	Merge `2655e4cf82` into sapling-pr-archive-EntilZha	2025-02-21 17:13:18 -08:00
Pedro Rodriguez	2655e4cf82	Remove byte tokenizer and add config args to switch between byte/patch packing Summary: Test Plan: ``` python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null checkpoint.dump.every=1000 checkpoint.eval.every=100000 eval=null pytest bytelatent/ ```	2025-02-22 01:13:13 +00:00
Pedro Rodriguez	44b1e5eaa1	Merge `edf86f6689` into sapling-pr-archive-EntilZha	2025-02-21 17:12:00 -08:00
Pedro Rodriguez	edf86f6689	Remove byte tokenizer and add config args to switch between byte/patch packing Summary: Test Plan:	2025-02-22 01:05:59 +00:00
Pedro Rodriguez	62a3ff55bf	merge commit for archive created by Sapling	2025-02-22 00:46:36 +00:00
Pedro Rodriguez	eac7a3fdbe	Pass mask in packing_iterator, correctly handle last batch	2025-02-22 00:46:29 +00:00
Pedro Rodriguez	fc3399ef40	Update iterator inheritance, pass file format args, limit iterator (#63 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details - Create a common class to use in all inheritance for states - Add a limit iterator that we can use in evals - Modify ArrowFileIterator behavior to not do arrow path inference if file_format='json' - Make EvalArgs valid - Move testing iterators to a common directory to allow usage in multiple test files - Make it so that SequenceIterator can take a None rng_state, to disable all rng ops (for eval mainly) Test Plan: - `pytest bytelatent` - `python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null eval=null`	2025-02-21 16:21:07 -08:00
Pedro Rodriguez	92b9a75391	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-21 19:26:36 +00:00
Pedro Rodriguez	3e9de62763	Pass mask in packing_iterator, correctly handle last batch Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-21 19:26:29 +00:00
Pedro Rodriguez	06a17a0ddc	Merge `45456fa6d8` into sapling-pr-archive-EntilZha Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-20 12:16:25 -08:00
Pedro Rodriguez	86abff94d0	Merge `55ddb0f84b` into sapling-pr-archive-EntilZha	2025-02-20 12:16:14 -08:00
Pedro Rodriguez	45456fa6d8	Add vocab and seq len abstract fields	2025-02-20 20:15:46 +00:00
Pedro Rodriguez	55ddb0f84b	Pass mask in packing_iterator, correctly handle last batch	2025-02-20 20:15:46 +00:00
Pedro Rodriguez	8baeef13a1	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-20 00:57:24 +00:00
Pedro Rodriguez	0ffe2ab685	Update iterator inheritance, pass file format args, limit iterator - Create a common class to use in all inheritance for states - Add a limit iterator that we can use in evals - Modify ArrowFileIterator behavior to not do arrow path inference if file_format='json' - Make EvalArgs valid - Move testing iterators to a common directory to allow usage in multiple test files - Make it so that SequenceIterator can take a None rng_state, to disable all rng ops (for eval mainly) Test Plan: - `pytest bytelatent` - `python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null eval=null`	2025-02-20 00:57:17 +00:00
Pedro Rodriguez	3c1c247809	Merge `2a717d6b40` into sapling-pr-archive-EntilZha	2025-02-19 16:38:06 -08:00
Pedro Rodriguez	2a717d6b40	Update iterators	2025-02-20 00:35:04 +00:00
Pedro Rodriguez	b0956bde99	Make apex logs less noisy (#60 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-02-18 10:45:56 -08:00
Pedro Rodriguez	4b57d05c3b	Merge `2f247263b9` into sapling-pr-archive-EntilZha Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details	2025-02-18 10:43:12 -08:00
Pedro Rodriguez	2f247263b9	Make apex logs less noisy Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-02-18 18:43:06 +00:00
Pedro Rodriguez	82ab5930ec	Make it possible to specify multiple config files (#54 ) Summary: Make it possible to specify multiple config files. Parsing CLI is not a special case anymore, just uses the same config inheritance method. Test Plan: Test that this iterpolates in the right order via unit tests Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is: - Default pydantic args - Included configs, eg `config` - CLI args ``` python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null ``` Summary: Test Plan:	2025-02-18 10:42:44 -08:00
Pedro Rodriguez	75fd18716e	merge commit for archive created by Sapling	2025-02-18 18:41:21 +00:00
Pedro Rodriguez	3117ac1f1f	Make it possible to specify multiple config files Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Make it possible to specify multiple config files. Parsing CLI is not a special case anymore, just uses the same config inheritance method. Test Plan: Test that this iterpolates in the right order via unit tests Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is: - Default pydantic args - Included configs, eg `config` - CLI args ``` python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null ``` Summary: Test Plan:	2025-02-18 18:41:02 +00:00
CharlesCNorton	9f29e0de18	fix(README): correct typo in quickstart instructions (#62 ) Changed "your can activate the environment" to "you can activate the environment" for clarity.	2025-02-18 09:47:58 -08:00
Pedro Rodriguez	f912535cb7	Merge `655eca670d` into sapling-pr-archive-EntilZha Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details	2025-02-14 15:46:06 -08:00
Pedro Rodriguez	88dedaa2ec	Merge `a3e0647d03` into sapling-pr-archive-EntilZha	2025-02-14 15:45:43 -08:00
Pedro Rodriguez	655eca670d	Minimal working eval Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-02-14 23:45:29 +00:00
Pedro Rodriguez	a3e0647d03	Make apex logs less noisy Summary: Test Plan:	2025-02-14 23:45:28 +00:00
Pedro Rodriguez	52590842e0	merge commit for archive created by Sapling	2025-02-14 22:51:24 +00:00
Pedro Rodriguez	f94babc94e	Make it possible to specify multiple config files Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Make it possible to specify multiple config files. Parsing CLI is not a special case anymore, just uses the same config inheritance method. Test Plan: Test that this iterpolates in the right order via unit tests Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is: - Default pydantic args - Included configs, eg `config` - CLI args ``` python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null ``` Summary: Test Plan:	2025-02-14 22:50:57 +00:00
Pedro Rodriguez	018bf98798	Merge `aa78c96ea4` into sapling-pr-archive-EntilZha	2025-02-14 13:06:55 -08:00
Pedro Rodriguez	aa78c96ea4	Make it possible to specify multiple config files Summary: Make it possible to specify multiple config files. Parsing CLI is not a special case anymore, just uses the same config inheritance method. Test Plan: Test that this iterpolates in the right order via unit tests Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is: - Default pydantic args - Included configs, eg `config` - CLI args ``` python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null ```	2025-02-14 21:06:50 +00:00
Pedro Rodriguez	ed6300375f	Merge `bec0164820` into sapling-pr-archive-EntilZha	2025-02-14 13:04:04 -08:00
Pedro Rodriguez	bec0164820	Make it possible to specify multiple config files Summary: Test Plan: Test that this iterpolates in the right order, config -> configs -> cli args ``` # All three sources python -m bytelatent.print_config config=bytelatent/configs/debug.yaml configs=[internal/configs/s3_debug.yaml] eval=null # What worked before python -m bytelatent.print_config config=internal/configs/s3_debug.yaml eval=null ```	2025-02-14 21:03:57 +00:00
Pedro Rodriguez	1c7031b4c4	Merge `be3ff12cfe` into sapling-pr-archive-EntilZha	2025-02-14 13:03:38 -08:00
Pedro Rodriguez	be3ff12cfe	Make it possible to specify multiple config files Summary: Test Plan: Test that this iterpolates in the right order, config -> configs -> cli args ``` # All three sources python -m bytelatent.print_config config=bytelatent/configs/debug.yaml configs=[internal/configs/s3_debug.yaml] eval=null # What worked before python -m bytelatent.print_config config=internal/configs/s3_debug.yaml eval=null ```	2025-02-14 21:03:26 +00:00
Srinivasan Iyer	f3e8125f74	using apex rmsnorm (#57 ) Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details * using apex rmsnorm * added message for missing apex * black * missed a print --------- Co-authored-by: Srini Iyer <sviyer@meta.com>	2025-02-14 11:22:03 -08:00
Srinivasan Iyer	c49e25171e	Update README.md (#58 )	2025-02-14 11:16:49 -08:00
Pedro Rodriguez	8c61ab5e67	Fix multiprocessing dataloader checkpointing and use it in the train script (#50 ) Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-13 11:58:23 -08:00
Pedro Rodriguez	84afa0f121	merge commit for archive created by Sapling Some checks are pending Lint with Black / lint (push) Waiting to run Details Lint with isort / lint (push) Waiting to run Details	2025-02-13 19:01:55 +00:00
Pedro Rodriguez	53529dcc78	Fix multiprocessing dataloader checkpointing and use it in the train script Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-02-13 19:01:49 +00:00
Pedro Rodriguez	76e7b001bb	Merge `0c6cb995a0` into sapling-pr-archive-EntilZha	2025-02-13 10:39:03 -08:00
Pedro Rodriguez	0c6cb995a0	Fix multiprocessing dataloader checkpointing and use it in the train script Summary: Test Plan:	2025-02-13 18:38:58 +00:00
Pedro Rodriguez	85c2f28f26	Test first batch matches (#53 ) Summary: Test Plan:	2025-02-13 10:05:08 -08:00
Pedro Rodriguez	45d52b7ae3	Merge `ab8f8a4412` into sapling-pr-archive-EntilZha	2025-02-13 10:04:43 -08:00
Pedro Rodriguez	ab8f8a4412	Test first batch matches Some checks failed Lint with Black / lint (push) Has been cancelled Details Lint with isort / lint (push) Has been cancelled Details Summary: Test Plan:	2025-02-13 18:04:30 +00:00

1 2 3 4

166 commits