blt/bytelatent/data/iterators
Pedro Rodriguez 203bff3696
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Pass mask in packing_iterator, correctly handle last batch, fix masking
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-22 01:27:13 +00:00
..
__init__.py Initial commit 2024-12-12 15:32:30 -08:00
abstract_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
arrow_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
dev_iterators.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
limit_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
looping_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
multiprocess_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
packing_iterator.py Pass mask in packing_iterator, correctly handle last batch, fix masking 2025-02-22 01:27:13 +00:00
preprocess_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
sampling_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
sequence_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
test_arrow_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
test_iters.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
test_limit_iterator.py Update iterator inheritance, pass file format args, limit iterator (#63) 2025-02-21 16:21:07 -08:00
test_packing_iterator.py Pass mask in packing_iterator, correctly handle last batch, fix masking 2025-02-22 01:27:13 +00:00