Spark计算引擎的数据对象缓存优化研究

发布时间:2016-03-17 作者:陈康,王彬,冯琳 阅读量:

[摘要] 研究了Spark并行计算集群对于内存的使用行为,认为其主要工作是通过对内存行为进行建模与分析,并对内存的使用进行决策自动化,使调度器自动识别出有价值的弹性分布式数据集(RDD)并放入缓存。另外,也对缓存替换策略进行优化,代替了原有的近期最少使用(LRU)算法。通过改进缓存方法,提高了任务在资源有限情况下的运行效率,以及在不同集群环境下任务效率的稳定性。

[关键词] 并行计算;缓存;Spark;RDD

[Abstract] In this paper, Spark parallel computing cluster for memory is studied. Its main work is about modeling and analysis of memory behavior in the computing engine and making the cache strategy automatic. Thus, the scheduler can recognize a valuable data object to be cached in the memory. A new cache replacement algorithm is proposed to replace least recently used (LRU) and have better performance in some applications. Thus, the performance and reliability of the Spark computing engine can be improved.

[Keywords] parallel computing; cache; Spark; resilient distributed dataset(RDD)

下载阅览: PDF