vikarti.anatra/blt

Fork 0

mirror of https://github.com/facebookresearch/blt.git synced 2025-02-23 21:42:14 +00:00

Commit graph

Author	SHA1	Message	Date
Pedro Rodriguez	ac257bac19	Fix distributed all reduce grad norm Summary: With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures Test Plan: - Run unit tests: - Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100` - Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`	2025-02-05 00:52:52 +00:00

Author

SHA1

Message

Date

Pedro Rodriguez

ac257bac19

Fix distributed all reduce grad norm

Summary:

With >1 GPU, but only 1 node, all reduces fail when inputs are not bf16. This uses a modified copy of torch's grad norm to avoid failures

Test Plan:

- Run unit tests:
- Run single gpu training: `python -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`
- Run 1 node, multi-gpu training `torchrun --nproc-per-node 8 -m bytelatent.train config=internal/configs/s3_debug.yaml eval=null checkpoint.dump.every=100`

2025-02-05 00:52:52 +00:00

1 commit