智算中心Scale-Out网络的演进及GSE的实践

发布时间:2025-05-15 作者:程伟强,李新双,白艳,吕勇

 

摘要:探讨了人工智能(AI)大模型时代智算中心网络面临的技术挑战,重点分析了传统互联网协议(IP)网络在负载均衡和突发流量处理方面的局限性,并对比了基于以太网融合远程直接内存访问技术(RoCE)的优化与网络架构重构两种技术路线。研究聚焦中国自主研发的全调度以太网(GSE)技术,详细阐述了其核心技术:基于报文容器(PKTC)的负载均衡机制和动态全调度队列(DGSQ)端到端拥塞控制技术,这些技术有效解决了智算网络中的流量极化和拥塞丢包问题。同时,系统分析了GSE网络设备在接口设计、转发引擎和队列管理等关键环节的创新架构,论证了该技术在构建高带宽、低时延、无阻塞新型网络方面的技术优势,为智算中心网络演进提供了重要参考。

关键词:AI大模型;智算中心网络;Scale-Out;GSE;RoCE;负载均衡;拥塞避免

 

 

Abstract: The technical challenges faced by intelligent computing center networks in the era of large-scale artificial intelligence (AI) models are discussed, focusing on analyzing the limitations of traditional Internet Protocol (IP) networks in load balancing and burst traffic handling. It compares two technical approaches: optimization based on remote direct memory access over converged Ethernet (RoCE) and network architecture reconstruction. The research centers on China's independently developed global scheduling Ethernet (GSE) technology, detailing its core innovations: the packet container (PKTC)-based load balancing mechanism and the dynamic global scheduling queue (DGSQ) end-to-end congestion control technology, which effectively addresses traffic polarization and congestion packet loss in intelligent computing networks. Additionally, it systematically analyzes the innovative architecture of GSE network equipment in key areas such as interface design, forwarding engines, and queue management, demonstrating the technical advantages of this approach in building high-bandwidth, low-latency, and non-blocking next-generation networks. The findings provide important insights for the evolution of intelligent computing center networks.

Keywords: AI large-scale model; intelligent computing center; scale-out; GSE; RoCE; load balance; congestion avoidance