Commit graph

178 commits

Author SHA1 Message Date
Pedro Rodriguez
62cb8936ee merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-25 01:38:44 +00:00
Pedro Rodriguez
52d5603b4f Pass mask in packing_iterator, correctly handle last batch, fix masking
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-25 01:38:38 +00:00
Pedro Rodriguez
f48ad82d96
Merge 6147207155 into sapling-pr-archive-EntilZha 2025-02-24 17:35:42 -08:00
Pedro Rodriguez
6147207155 Pass mask in packing_iterator, correctly handle last batch, fix masking
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-25 01:35:38 +00:00
Pedro Rodriguez
2a04df1130
Merge 3aaeb8ac14 into sapling-pr-archive-EntilZha 2025-02-24 17:34:18 -08:00
Pedro Rodriguez
3aaeb8ac14 Pass mask in packing_iterator, correctly handle last batch, fix masking
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-25 01:34:13 +00:00
Pedro Rodriguez
f3781cc0ca merge commit for archive created by Sapling 2025-02-25 00:04:43 +00:00
Pedro Rodriguez
edccc0873d Remove byte tokenizer and add config args to switch between byte/patch packing
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:

```
python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null checkpoint.dump.every=1000 checkpoint.eval.every=100000 eval=null

pytest bytelatent/
```
2025-02-24 23:56:43 +00:00
Pedro Rodriguez
ff36aa8642
Add vocab and seq len abstract fields (#66)
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-24 14:41:58 -08:00
Pedro Rodriguez
bbd1edd90d
Merge 4c6ee1aef0 into sapling-pr-archive-EntilZha 2025-02-24 14:41:33 -08:00
Pedro Rodriguez
4c6ee1aef0 Add vocab and seq len abstract fields
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-02-24 22:41:01 +00:00
Bocheng Li
a6ed14f689
Fix: Correct model_args usage in parallelize_model call (#69) 2025-02-24 14:40:38 -08:00
Pedro Rodriguez
de774bd98b
Merge 203bff3696 into sapling-pr-archive-EntilZha
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-02-21 17:27:19 -08:00
Pedro Rodriguez
203bff3696 Pass mask in packing_iterator, correctly handle last batch, fix masking
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
This commit does/fixes the following:

1. Adds unit tests for byte and patch packing to ensure it works correctly
2. Fixes a bug where for batches that end up with <max_length number of bytes (e.g., short patches), the mask was including elements that had value pad_id. This fixes the mask by setting it to be !=pad_id, if its not specified.
3. Correctly handles the last batch, where previously it would crash. This didn't affect training since we had enough data and/or looped iterators, but for evaluation perplexity, it comes up if we validation on an entire file.
4. Correctly forward the mask if it exists for byte packing

Test Plan:

```
pytest bytelatent/
```

Testing these changes more thoroughly in a stacked PR that fixes evals
2025-02-22 01:27:13 +00:00
Pedro Rodriguez
a0fa496aa2 merge commit for archive created by Sapling 2025-02-22 01:22:31 +00:00
Pedro Rodriguez
1ede87e1ae Pass mask in packing_iterator, correctly handle last batch 2025-02-22 01:22:25 +00:00
Pedro Rodriguez
c233487b95
Merge 2655e4cf82 into sapling-pr-archive-EntilZha 2025-02-21 17:13:18 -08:00
Pedro Rodriguez
2655e4cf82 Remove byte tokenizer and add config args to switch between byte/patch packing
Summary:

Test Plan:

```
python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null checkpoint.dump.every=1000 checkpoint.eval.every=100000 eval=null

pytest bytelatent/
```
2025-02-22 01:13:13 +00:00
Pedro Rodriguez
44b1e5eaa1
Merge edf86f6689 into sapling-pr-archive-EntilZha 2025-02-21 17:12:00 -08:00
Pedro Rodriguez
edf86f6689 Remove byte tokenizer and add config args to switch between byte/patch packing
Summary:

Test Plan:
2025-02-22 01:05:59 +00:00
Pedro Rodriguez
62a3ff55bf merge commit for archive created by Sapling 2025-02-22 00:46:36 +00:00
Pedro Rodriguez
eac7a3fdbe Pass mask in packing_iterator, correctly handle last batch 2025-02-22 00:46:29 +00:00
Pedro Rodriguez
fc3399ef40
Update iterator inheritance, pass file format args, limit iterator (#63)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
- Create a common class to use in all inheritance for states
- Add a limit iterator that we can use in evals
- Modify ArrowFileIterator behavior to not do arrow path inference if file_format='json'
- Make EvalArgs valid
- Move testing iterators to a common directory to allow usage in multiple test files
- Make it so that SequenceIterator can take a None rng_state, to disable all rng ops (for eval mainly)

Test Plan:

- `pytest bytelatent`
- `python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null eval=null`
2025-02-21 16:21:07 -08:00
Pedro Rodriguez
92b9a75391 merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-21 19:26:36 +00:00
Pedro Rodriguez
3e9de62763 Pass mask in packing_iterator, correctly handle last batch
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-21 19:26:29 +00:00
Pedro Rodriguez
06a17a0ddc
Merge 45456fa6d8 into sapling-pr-archive-EntilZha
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-20 12:16:25 -08:00
Pedro Rodriguez
86abff94d0
Merge 55ddb0f84b into sapling-pr-archive-EntilZha 2025-02-20 12:16:14 -08:00
Pedro Rodriguez
45456fa6d8 Add vocab and seq len abstract fields 2025-02-20 20:15:46 +00:00
Pedro Rodriguez
55ddb0f84b Pass mask in packing_iterator, correctly handle last batch 2025-02-20 20:15:46 +00:00
Pedro Rodriguez
8baeef13a1 merge commit for archive created by Sapling
Some checks are pending
Lint with Black / lint (push) Waiting to run
Lint with isort / lint (push) Waiting to run
2025-02-20 00:57:24 +00:00
Pedro Rodriguez
0ffe2ab685 Update iterator inheritance, pass file format args, limit iterator
- Create a common class to use in all inheritance for states
- Add a limit iterator that we can use in evals
- Modify ArrowFileIterator behavior to not do arrow path inference if file_format='json'
- Make EvalArgs valid
- Move testing iterators to a common directory to allow usage in multiple test files
- Make it so that SequenceIterator can take a None rng_state, to disable all rng ops (for eval mainly)

Test Plan:

- `pytest bytelatent`
- `python -m bytelatent.train config=../internal-blt/configs/entropy_model.yaml logging.wandb=null eval=null`
2025-02-20 00:57:17 +00:00
Pedro Rodriguez
3c1c247809
Merge 2a717d6b40 into sapling-pr-archive-EntilZha 2025-02-19 16:38:06 -08:00
Pedro Rodriguez
2a717d6b40 Update iterators 2025-02-20 00:35:04 +00:00
Pedro Rodriguez
b0956bde99
Make apex logs less noisy (#60)
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-02-18 10:45:56 -08:00
Pedro Rodriguez
4b57d05c3b
Merge 2f247263b9 into sapling-pr-archive-EntilZha
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-02-18 10:43:12 -08:00
Pedro Rodriguez
2f247263b9 Make apex logs less noisy
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-02-18 18:43:06 +00:00
Pedro Rodriguez
82ab5930ec
Make it possible to specify multiple config files (#54)
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```


Summary:

Test Plan:
2025-02-18 10:42:44 -08:00
Pedro Rodriguez
75fd18716e merge commit for archive created by Sapling 2025-02-18 18:41:21 +00:00
Pedro Rodriguez
3117ac1f1f Make it possible to specify multiple config files
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```


Summary:

Test Plan:
2025-02-18 18:41:02 +00:00
CharlesCNorton
9f29e0de18
fix(README): correct typo in quickstart instructions (#62)
Changed "your can activate the environment" to "you can activate the environment" for clarity.
2025-02-18 09:47:58 -08:00
Pedro Rodriguez
f912535cb7
Merge 655eca670d into sapling-pr-archive-EntilZha
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
2025-02-14 15:46:06 -08:00
Pedro Rodriguez
88dedaa2ec
Merge a3e0647d03 into sapling-pr-archive-EntilZha 2025-02-14 15:45:43 -08:00
Pedro Rodriguez
655eca670d Minimal working eval
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Test Plan:
2025-02-14 23:45:29 +00:00
Pedro Rodriguez
a3e0647d03 Make apex logs less noisy
Summary:

Test Plan:
2025-02-14 23:45:28 +00:00
Pedro Rodriguez
52590842e0 merge commit for archive created by Sapling 2025-02-14 22:51:24 +00:00
Pedro Rodriguez
f94babc94e Make it possible to specify multiple config files
Some checks failed
Lint with Black / lint (push) Has been cancelled
Lint with isort / lint (push) Has been cancelled
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```


Summary:

Test Plan:
2025-02-14 22:50:57 +00:00
Pedro Rodriguez
018bf98798
Merge aa78c96ea4 into sapling-pr-archive-EntilZha 2025-02-14 13:06:55 -08:00
Pedro Rodriguez
aa78c96ea4 Make it possible to specify multiple config files
Summary:

Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.

Test Plan:

Test that this iterpolates in the right order via unit tests

Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:

- Default pydantic args
- Included configs, eg `config`
- CLI args

```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null

```
2025-02-14 21:06:50 +00:00
Pedro Rodriguez
ed6300375f
Merge bec0164820 into sapling-pr-archive-EntilZha 2025-02-14 13:04:04 -08:00
Pedro Rodriguez
bec0164820 Make it possible to specify multiple config files
Summary:

Test Plan:

Test that this iterpolates in the right order, config -> configs -> cli args

```
# All three sources
python -m bytelatent.print_config config=bytelatent/configs/debug.yaml configs=[internal/configs/s3_debug.yaml] eval=null

# What worked before
python -m bytelatent.print_config config=internal/configs/s3_debug.yaml eval=null
```
2025-02-14 21:03:57 +00:00