SPBD: Streamlining Big-Data Processing in Cloud Environments

Release Date:2013-07-24 Author:Tung Nguyen, Jingwen Zhang, and Weisong Shi Click:

[Abstract] Many applications, such as those in genomics, are designed for one machine. This is not problematic if the input data set is small and can fit into the memory of a single powerful machine. However, the application and its algorithms are limited by the capacity and performance of the machine (the application cannot run in parallel). A single machine cannot handle very large data sets. In recent research, cloud computing and MapReduce have been used together to store and process big data. There are three main steps in handling data in the cloud: 1) the user uploads the data, 2) the data is processed, and 3) results are returned. When the size of the data reaches a certain scale, transmission time becomes the dominant factor; however, most research to date has only been focused on reducing the processing time. Also, it is generally assumed that the data is already stored in the cloud. This assumption does not hold because many organizations now store their data locally. In this paper, we propose SPBD (pronounced “speed”) to minimize overall user wait time. We abstract overall processing time as an optimization problem and derive the optimal solution. When evaluated on our private cloud platform, SPBD is shown to reduce user wait time by up to 34% for a traditional WordCount application and up to 31% for a metagenomic application.

[Keywords] bigdata; genomics; NGS; MapReduce; cloud