CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
概要
arXiv:2604.24013v2 Announce Type: cross Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overh…