Task duplication has been widely adopted to mitigate the impact of stragglers that run much longer than normal tasks. However, task duplication on data pipelining case would generate excessive traffic over the datacenter networks. In this paper, we study minimizing the traffic cost for data pipelining task replications and design a controller that chooses the data generated by the first finished task and discards data generated later by other replications belonging to the same task. Each task replication communicates with the controller when it finishes a data processing, which causes additional network overhead. Hence, we try to reduce the network overhead and make a trade⁃off between the delay of data block and the network overhead. Finally, extensive simulation results demonstrate that our proposal can minimize network traffic cost under data pipelining case.
cloud computing; big data; shuffling; task duplication; traffic