Commit graph

5 commits

Author SHA1 Message Date
Pedro Rodriguez 203bff3696 Pass mask in packing_iterator, correctly handle last batch, fix masking
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-22 01:27:13 +00:00
Pedro Rodriguez 2655e4cf82 Remove byte tokenizer and add config args to switch between byte/patch packing
Summary:

Test Plan:

```
python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null checkpoint.dump.every=1000 checkpoint.eval.every=100000 eval=null

pytest bytelatent/
```
2025-02-22 01:13:13 +00:00
Pedro Rodriguez fc3399ef40
Update iterator inheritance, pass file format args, limit iterator (#63)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
- Create a common class to use in all inheritance for states
- Add a limit iterator that we can use in evals
- Modify ArrowFileIterator behavior to not do arrow path inference if file_format='json'
- Make EvalArgs valid
- Move testing iterators to a common directory to allow usage in multiple test files
- Make it so that SequenceIterator can take a None rng_state, to disable all rng ops (for eval mainly)

Test Plan:

- `pytest bytelatent`
- `python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null eval=null`
2025-02-21 16:21:07 -08:00
Pedro Rodriguez 7622d28b74
Initial codes and scripts for training entropy model (#34)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
Summary:

Test Plan:
2025-01-27 09:46:44 -08:00
Pedro Rodriguez bcc039bb75 Initial commit 2024-12-12 15:32:30 -08:00