基于云计算的大数据挖掘平台

发布时间:2013-08-12 作者:何清,庄福振 阅读量:

[摘要] 开发了一个基于云计算的并行分布式大数据挖掘平台——PDMiner。PDMiner实现了各种并行数据挖掘算法,如数据预处理、关联规则分析以及分类、聚类等算法。实验结果表明,并行分布式数据挖掘平台PDMiner中实现的并行算法,能够处理大规模数据集,达到太字节级;具有很好的加速比性能;实现的并行算法可以在商用机器构建的并行平台上稳定运行,整合了已有的计算资源,提高了计算资源的利用效率;可以有效地应用到实际海量数据挖掘中。在PDMiner中还开发了工作流子系统,提供友好统一的接口界面方便用户定义数据挖掘任务。

[关键词] 云计算;分布式并行数据挖掘;海量数据

[Abstract] In this paper, we develop a parallel and distributed data mining toolkit platform called PDMiner. This platform is based on cloud computing. PDMiner is used to preprocess data, analyze association rules, and parallel classification and clustering. Our experimental results show that the parallel algorithms in PDMiner can tackle data sets up to one terabyte. They are very efficient because they have good speedup, and they are easily extended so that they can be executed in a cluster of commodity machines. This means that full use is made of computing resources. The algorithms are also efficient for practical data mining. We also develop a knowledge flow subsystem that helps the user define a data mining task in PDMiner.

[Keywords] cloud computing; parallel and distributed data mining; big data