ScaleAcross: Designing Multi-Data-Center Infrastructure for Geo-Distributed AI Training
ScaleAcross: 为地理分布式AI训练设计多数据中心基础设施
Naved Inam, Aryan Alpesh Bhavsar, Masabattula Teja Nikhil, Sidharth Sharma
AI总结 本文提出一个基于EVPN-VXLAN的可扩展仿真框架,用于研究地理分布式AI训练中的同步密集型通信和跨站点数据交换问题,通过ECMP、BFD和队列对感知流量分配机制提升性能。
详情
AI模型的快速增长和日益增长的数据主权要求正在推动跨多个数据中心的地理分布式AI训练的转变。这种部署引入了由同步密集型通信、跨站点数据交换和广域网延迟约束引起的系统级挑战。本文研究了EVPN-VXLAN作为地理分布式AI训练环境的基础设施基础,并提出了一个可扩展的仿真框架,用于在现实广域网条件下系统研究分布式AI工作负载。所提出的框架结合了VXLAN覆盖网络和基于EVPN的数据中心间连接,并使用ContainerLab和FRRouting(FRR)实现。该框架进一步集成了等价多路径(ECMP)路由、双向转发检测(BFD)和队列对感知流量分配机制,旨在改善同步密集型AI工作负载的通信行为,同时保持与商品基础设施的兼容性。通过使用真实的广域网仿真,我们表征了采用AllReduce和参数服务器通信模式的分布式训练工作负载下的通信和系统行为。结果提供了对地理分布式AI环境中流量分布、弹性和基础设施行为的见解,突显了可重现的多数据中心基础设施框架在可扩展分布式AI训练中的潜力。
The rapid growth of AI models and increasing data sovereignty requirements are driving the transition toward geo-distributed AI training across multiple data centers. Such deployments introduce system-level challenges arising from synchronization-intensive communication, cross-site data exchange, and wide-area latency constraints. This paper investigates EVPN--VXLAN as an infrastructure foundation for geo-distributed AI training environments and presents a scalable emulation framework for systematically studying distributed AI workloads under realistic wide-area conditions. The proposed framework combines VXLAN overlays with EVPN-based inter-data-center connectivity and is implemented using ContainerLab and FRRouting (FRR). The framework further incorporates Equal-Cost Multi-Path (ECMP) routing, Bidirectional Forwarding Detection (BFD), and a queue-pair-aware traffic distribution mechanism designed to improve communication behavior for synchronization-intensive AI workloads while preserving compatibility with commodity infrastructure. Using realistic WAN emulation, we characterize communication and system behavior under distributed training workloads employing AllReduce and Parameter Server communication patterns. Results provide insights into traffic distribution, resilience, and infrastructure behavior in geo-distributed AI environments, highlighting the potential of reproducible multi-data-center infrastructure frameworks for scalable distributed AI training.