The traditional fault diagnosis solution has the disadvantages of long fault diagnosis procedure, low fault location efficiency, relying on expert ability, large manpower investment, and high O&M cost. Its diagnosis rules and procedures are hard-coded, which cannot flexibly and quickly deal with various fault diagnosis scenarios.
ZTE has launched its intelligent fault location solution based on knowledge graph and machine learning. The solution not only uses machine learning to improve the intelligence of fault diagnosis and shorten the fault diagnosis time, but also adopts knowledge graph and data mining in combination with manual confirmation and summary to form fault knowledge graph. All this helps to accumulate and inherit the knowledge and experience in O&M and reduce its cost.
—Fault location module:
Intelligent Fault Diagnosis Framework
The overall framework of the intelligent fault diagnosis solution is illustrated in Fig. 1. Fault location involves fault location module, fault knowledge graph generation module, online and offline model training module, fault labeling module, fault knowledge graph module, and AI model.
It is the core module responsible for fault location based on the trained AI model and the generated fault knowledge graph.
—Fault knowledge graph generation module:
It uses multi-dimensional data mining technology to mine the labeled fault data such as alarms, abnormal performance, abnormal logs and abnormal configurations, and generates a complete fault knowledge graph for fault diagnosis in combination with manual confirmation and summary.
—Online and offline model training module:
It is responsible for data cleaning and feature processing based on labeled fault data, generating training sample sets, and training AI models. During the implementation, the labeled fault data on the existing network can be collected regularly and concentrated in the data lake for offline training. Online training can also be performed directly based on the labeled fault data.
—Fault labeling module:
It labels the restored faults in the existing network to form the labeled fault data for follow-up model training and system self-learning.
—Fault knowledge graph module:
It represents and stores fault knowledge using the knowledge graph technology, which can be used for fault location, recognition, classification, recovery and stop-loss.
It is used for fault location. The machine learning algorithms can be selected as desired to establish the corresponding AI model. This solution selects a Bayesian network and sets up the Bayesian AI model. Other machine learning algorithms can also be selected.
Intelligent Fault Location Solution
Before fault location, it is necessary to generate a fault knowledge graph. Specifically, with the help of knowledge graph and graph database, the multi-dimensional data mining technology is used to mine the labeled fault data such as alarms, abnormal performance, abnormal logs and abnormal configurations and to generate a complete fault knowledge graph in combination with manual confirmation and summary. The fault knowledge graph includes fault mode, symptom and propagation relationship, object, diagnosis, impacts, root cause, stoploss, recovery, and other related knowledge, which can be used for intelligent fault diagnosis. It is also necessary to select and determine an AI model, and use the training data to complete offline training of the AI model. In this solution, Bayesian network is selected as the AI model.
The Bayesian network model is automatically generated based on its generation algorithm and also the fault mode, symptom and propagation relationship, and root cause in the fault knowledge graph. The generated Bayesian network model uses the collected labeled fault data for data cleaning and feature processing, generates the corresponding training sample sets, and completes the offline training.
The primary goal of fault location is to find the fault location. For a transport network, it is to find the location of the faulty NE. Considering the characteristics of the transport network, during the actual fault location process, it is also necessary to find the faulty service path according to existing configuration data and to narrow the scope of fault location. The fault is then located based on the data provided in the fault knowledge graph. The knowledge included in the fault knowledge graph such as fault mode, symptom and propagation relationship, and root cause can be used to find the network nodes with fault symptoms and take them as suspected fault nodes.
Another goal of fault location is to further find the root cause. All possible suspected root causes can be found using graph searching algorithm based on the fault mode, symptom and propagation relationship, and root cause included in the fault knowledge graph. With the fault symptom data, the trained Bayesian network model can be then used to deduce the probability of suspected root causes.
The last step is automatic fault diagnosis. Since only the suspected fault nodes and corresponding suspected root causes and probabilities were found before, it is also necessary to automatically diagnose the root cause and give the final diagnosis result based on diagnosis rules corresponding to the fault root cause in the fault knowledge graph. For the root causes that cannot be diagnosed automatically or must be diagnosed by people, processing suggestions and probabilities are given directly.
After fault recovery, fault labeling is carried out where the results of intelligent fault diagnosis are modified and labeled with correct fault labels, and the fault data is automatically saved to the fault database for subsequent online training. Fault labeling is to continuously form training sample sets for subsequent model training and system self-learning.
To improve the accuracy of fault location, it is also necessary to train the Bayesian network online regularly based on the labeled fault data. The way of online training is similar to that of offline training. The labeled fault data needs to be cleaned and processed first to generate standard training sample sets, and then the Bayesian network is trained. Through the online model training, the system has the ability of self-learning, and its accuracy of fault location is getting higher and higher.
ZTE's intelligent fault diagnosis solution based on fault knowledge graph and AI model provides automatic fault diagnosis, and accumulates and inherits knowledge and experience in O&M. The solution significantly improves the efficiency of fault diagnosis efficiency, reduces O&M costs, and helps operators build quality transport networks.