Abstract: In recent years, microservice architecture has gradually gained popularity in internet companies. However, due to the complex and dynamically changing nature of microservices systems, failure detection has become more challenging. Traditional root cause analysis methods mostly rely on a single modal of data, which is insufficient to cover all failure information. Existing multimodal methods require collecting high-quality labeled samples and often face challenges in classifying unknown failure categories. To address these challenges, this paper proposes a root cause analysis framework based on a masked graph autoencoder (GAE). The main process involves feature extraction, feature dimension reduction based on GAE, and online clustering combined with expert input. The method is experimentally evaluated on two public datasets and compared with two baseline methods, demonstrating significant advantages in the experimental results with 16% labeled samples.
Keywords: root cause analysis; multimodal data; self-supervised learning; online clustering