Title: Harnessing input redundancy in a MapReduce framework
Abstract: The proliferation of data parallel programming on large clusters has set a new research avenue: accommodating numerous types of data-intensive applications with a feasible plan. Behind the many research efforts, we can observe that there exists a nontrivial amount of redundant I/O in the execution of data-intensive applications. Even the locality-aware scheduling policy in a MapReduce framework is not effective in a cluster environment where storage nodes cannot provide a computation service. In this paper, we introduce Split-Cache to improve the performance of data-intensive OLAP-style applications by reducing redundant I/O in a MapReduce framework. The key strategy to achieve the goal is to cut down the I/O redundancy of reading common input data among applications. SplitCache caches the first input stream in the computing nodes and reuses them for future demand. In execution of the TPC-H benchmark, we achieved 65.5% faster execution and 87% reduction in network traffic in average.
Publication Year: 2010
Publication Date: 2010-03-22
Language: en
Type: article
Indexed In: ['crossref']
Access and Citation
Cited By Count: 10
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot