Title: A grid-based middleware for scalable processing of remote data
Abstract: As scientific simulations are generating large amounts of data, analyzing this data to gain insights into scientific phenomena is increasingly becoming a challenge. With the emergence of grid computing, analysis of large geographically distributed scientific datasets, also referred to as distributed data-intensive science, has emerged as an important area in recent years. Despite many advances in supporting data grids, we believe that an important challenge remains unaddressed, which is developing efficient and scalable applications that can transparently access and process data from remote servers. It is our belief that a middleware supporting remote datamining would make the development of remote data analysis applications more efficient and less time consuming, allowing the programmer to concentrate on specifying the processing to be performed on data, rather than efficiency of data retrieval or scalability.
In this thesis, we present design and evaluation of a middleware that targets mining data resident on remote repositories, and supports a high-level interface for developing data mining and scientific data processing applications. Our middleware, referred to as FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grids), is based on a precursor system, FREERIDE, created to provide run-time parallelization support for performing generalized reduction computations on locally stored data. We have created 2 implementations of FREERIDE-G, with the main difference between the two being the mechanisms supporting storage and access to remote data.
Our first implementation supports mining data resident on ADR-based servers and uses Active Data Repository for data retrieval and TCP/IP sockets for data delivery to the processing site. Our second implementation provides support for mining data resident on SRB-based servers, and uses Storage Resource Broker (which is a de facto standard for remote data access) for both data retrieval and its delivery to the processing site. Both implementations were evaluated using 5 data processing applications developed for our middleware. We have also conducted an in depth study of how performance of the SRB-based implementation is effected by size of the unit of the remote I/O request, I/O concurrency, and limited network bandwidth available for data transfer.
In order to make our middleware compliant with the grid computing standards, we have also integrated the compute node client component of our SRB-based implementation with Globus Toolkit and MPICH-G2. Besides integration with grid standards, the benefits of using Globus Toolkit for middleware deployment are increased service portability and overcoming potential heterogeneity during grid service startup and management. As a part of this work we evaluated the overhead of using the pre-WS components of the Globus Toolkit for middleware deployment, and found such overhead to be quite modest.
Clearly, that if alternative computing resources and dataset replicas are available, our middleware performance can be dependent on selecting the right one. In order to facilitate dataset replica and computing resource selection process, an accurate performance prediction framework was also developed as a part of our middleware. The approach we use to model performance considers a breakdown of application execution time into data retrieval, data communication, and data processing component, and leverage our familiarity with the structure of computation supported by FREERIDE-G.
Also, based on where data to be processed has been generated or how it is shared, interesting load balancing and scheduling considerations may arise. Across geographically distributed sources, data may be partitioned in a number of ways, including horizontally, where data instances are partitioned across different repositories, or vertically, where different attributes may be partitioned across repositories. Our middleware supports efficient processing of such data through a load balancing resource allocation and scheduling algorithm, which minimizes the total time spent on processing the data. To solve this scheduling problem, we consider weighted sum of two factors, a load balancing factor and a term that captures the amount of time spent by processing nodes waiting for the data, and supporting data integration in cases of vertical partitioning.
Publication Year: 2008
Publication Date: 2008-01-01
Language: en
Type: article
Access and Citation
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot