Shandong Mobile and ZTE Harness Multi-Agent Collaboration for End-to-End Fault Management

Release Date:2025-09-22 By Zhou Bo, Song Peishuang

China Mobile serves over one billion subscribers and operates the world’s largest 5G network. As the leader in the modern mobile industry chain, it spearheads national tech innovation and drives coordinated ecosystem growth. Autonomous network (AN) is the first of 10 key sub-chains launched by the group, aimed at lifting O&M efficiency through AI and accelerating the journey to high-level autonomy.

In 2024, China Mobile issued joint research propositions for the AN sub-chain to its provincial branches and industry partners, calling on the industry to focus on cutting-edge AI technologies and join efforts to overcome bottlenecks in AN.

The rapidly increasing network scale and business complexity, coupled with scattered O&M tools and know-how, are throttling fault response. Faster detection, accurate root cause localization, streamlined dispatch, and closed-loop handling—while easing engineer workloads—has become mission-critical.

In response, Shandong Mobile and ZTE have partnered under China Mobile’s AN sub-chain, forming a joint team harnessing AI O&M large models to drive breakthroughs in fault-monitoring.

Integrated AI Computing Appliance: A Foundation for Large-Model Innovation

GPU shortage poses a major risk to large-model deployment. To address this, the project team carried out in-depth investigations into networking conditions and site surveys. After extensive discussions on solutions, a full-stack intelligent computing solution was developed—taking only 45 days from demand initiation to implementation—thus laying a solid foundation for the innovative application of large models.

ZTE provides a full-stack intelligent computing solution covering computing power, network, capabilities, intelligence, and applications, meeting the differentiated requirements for performance, cost and service requirements across diverse scenarios. The appliance integrates high-performance GPUs, user-friendly training and inference platforms, and mainstream large models, solving the "last mile" problem in the commercial implementation of large models.

Through hardware-software collaborative optimization, GPU performance is maximized, while development complexity is reduced. End-to-end toolchains covering data preparation, model training, and inference, significantly lower the technical barriers for enterprises to develop AI applications, enabling operators to go from idea to production within hours.

The appliance supports ZTE’s Nebula Telecom Large Model and mainstream open-source large models, offers one-click RAG deployment, and secures the entire pipeline with full-link encryption and zero-trust identity controls—delivering fast, secure, and stable inference while ensuring data security and privacy for operators.

Exploring "AI+" Fault Monitoring Innovation

Powered by the Nebula Telecom Large Model engine, ZTE has introduced large–small model collaboration and multi-agent collaboration to empower "AI+" fault monitoring scenarios. This enables more accurate fault identification and root-cause analysis, more efficient process integration for faster fault closure, and smarter intent-driven O&M, reducing the workload of O&M engineers.

Multi-Agent Collaboration Drives Efficient Closed Loops

As shown in Fig. 1, the solution adopts a two-level intent routing strategy. The master control agent performs intent recognition, routing, and process control, assigning tasks to specialized business agents such as identification, analysis, scheduling, and evaluation. These business agents complete their respective tasks and return results to the master control agent, which then determines whether to proceed to the next stage. Through such collaboration, the fault handling process is driven towards a closed loop.

 

To minimize error accumulation in agent collaboration, the paradigm is shifting from API to language programming interface (LPI). By enabling agents to interact through LPI, the accuracy of multi-agent collaboration is enhanced.  

Large–Small Model Collaboration Enhances Event Detection  

In the fault detection phase, an identification agent is established leveraging large–small model collaboration technology. The small model handles dynamic data aggregation, while the large model generates event summaries, providing concise yet comprehensive event conclusions and enabling intelligent event generation within one minute.

CoT Reasoning Reshapes Fault Localization Capabilities

In the fault analysis phase, an innovative fusion of fault knowledge and chain-of-thought (CoT) causal reasoning is introduced. This approach performs comprehensive reasoning based on the fault case database and alarm data, improving the accuracy of cross-domain fault analysis across multiple factors to over 91%.

Shifting Backend Capabilities to Mobile

A scheduling agent and a knowledge copilot are developed based on large models, bringing fault knowledge, operational data, and atomic API capabilities to the mobile app. This enhances the on-site engineer's ability to solve problems independently and improves the interaction efficiency between frontline and back-office teams.

Decoupled Capabilities Accelerate Value Creation

By opening up capabilities and embedding AI into the existing network fault management system, scheduling system, and mobile app, pilot verification has been carried out in cross-domain fault scenarios spanning IP networks, transmission networks, and power and environment systems—achieving 1-minute intelligent fault detection, 91% root cause analysis accuracy, and minimized human effort.

This project practice has been recognized by TM Forum’s GenAI IG1345, the China Communications Standards Association (CCSA), ICT China 2024 Excellent Cases, as well as receiving the 2024 BRICS Industrial Innovation Contest Excellent Project Award, providing a valuable reference for large-model applications in the telecommunications industry.

In the future, ZTE and Shandong Mobile will continue to deepen their cooperation, further expand value-driven fault monitoring scenarios, accelerate the integration of AI innovation into O&M workflows, reduce the workload of O&M personnel, enhance operational efficiency, and achieve a closed loop of value creation and results.