Summary:
Make it possible to specify multiple config files.
Parsing CLI is not a special case anymore, just uses the same config inheritance method.
Test Plan:
Test that this iterpolates in the right order via unit tests
Sample usage, loads the internal config, which references bytelatent/configs/entropy_model.yaml. The precendence order is:
- Default pydantic args
- Included configs, eg `config`
- CLI args
```
python -m bytelatent.print_config config=internal/configs/entropy_model.yaml eval=null
```
Summary:
Test Plan:
Summary:
Currently, arrow iterator can only read arrow files. However, the pyarrow library can read
other formats, including jsonlines. This allows the same ArrowIterator to read from jsonlines,
so we can read from the original source data, and simply omit the entropy column when doing so
Test Plan:
Run train script until dataloader starts
Summary:
- Make the data/checkpoint code fsspec compatible
- Still will not work with s3 saves, due to `torch.distributed.checkpoint.save` not being out of the box workable with `fsspec`. Will implement in followup PR
Test Plan:
Run unit tests and the commands below
```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```
```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100
```
These currently won't work due to the torch distributed save, but theses hould be tested at a later date
```
python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
```
torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100 dump_dir=s3://blt/scratch/checkpoint-test/
```
Summary:
With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures
Test Plan:
- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
* allow loading of the entropy model directly
* remove unused argument
* remove spammy warning
* allow patch_batch_size to be adjusted in the forward() method
* revert to original patcher style, fix warning
* allow grads when calculating entropies
* fix grad flow
* return preds from calculate_entropies()
* remove legacy arg
* fix an error with monotonicity and small sequence lengths
* ensure patcher is serializable
* revert patcher to original
* remove unused import
Summary:
- Refactor local model configs to be separate and clearer
- Add attention arguments and correct which attention is used in local models
- Preparation for being able to have an entropy train script
- Fix failing unit tests
Test Plan:
Summary:
For compatibility with either local/nfs or S3 datasets, swap to fsspec.
Add a tool to compare local and remote filesystems
Test Plan:
- Ran regular train script
- Ran with config with data in S3