Parallel Spectral Clustering Based on MapReduce

Release Date:2013-07-24 Author:Qiwei Zhong, Yunlong Lin, Junyang Zou, Kuangyan Zhu, Qiao Wang, and Lei Hu Click:

[Abstract] Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.

[Keywords] Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.