中兴官网消费者业务运营商业务政企业务

终端商城

选择语言

简体中文 English

Spark计算引擎的数据对象缓存优化研究

发布时间：2016-03-17 作者：陈康,王彬,冯琳

[摘要] 研究了Spark并行计算集群对于内存的使用行为，认为其主要工作是通过对内存行为进行建模与分析，并对内存的使用进行决策自动化，使调度器自动识别出有价值的弹性分布式数据集（RDD）并放入缓存。另外，也对缓存替换策略进行优化，代替了原有的近期最少使用（LRU）算法。通过改进缓存方法，提高了任务在资源有限情况下的运行效率，以及在不同集群环境下任务效率的稳定性。

[关键词] 并行计算；缓存；Spark；RDD

[Abstract] In this paper, Spark parallel computing cluster for memory is studied. Its main work is about modeling and analysis of memory behavior in the computing engine and making the cache strategy automatic. Thus, the scheduler can recognize a valuable data object to be cached in the memory. A new cache replacement algorithm is proposed to replace least recently used (LRU) and have better performance in some applications. Thus, the performance and reliability of the Spark computing engine can be improved.

[Keywords] parallel computing; cache; Spark; resilient distributed dataset(RDD)

下载阅览： PDF

本期相关文章

大数据的开放式创新

试论大数据之“大”

大数据分析平台——从扩展性优先到性能优先

典型大数据计算框架分析

分布式数据处理系统内存对象管理问题分析

大数据存储系统中负载均衡的数据迁移算法

基于概念网的媒体大数据分析和结构化描述方法

BC-BSP：一个基于BSP的高可扩展并行迭代图处理系统

大数据安全必须面对的攻击假设矩阵

导读