Boosting hive efficiency: a novel dual-process architecture for asynchronous and parallel data loading

Rahman, Mohammad Ashekur

View/Open

21166025_CSE.pdf (574.4Kb)

Date

2024-10

Publisher

Brac University

Abstract

The efficiency of data loading processes in Hive, a critical component of modern big data ecosystems, is often hindered by sequential bottlenecks that limit overall performance. This thesis proposes a novel asynchronous and parallel data loading architecture designed to address these challenges, enhancing Hive’s data ingestion capabilities. The architecture comprises two distinct processes: the Landing Batch Process, which manages data loading into the Hadoop Distributed File System (HDFS), and the Staging Batch Process, responsible for loading data into Hive tables. By operating these processes asynchronously and in parallel, the proposed design significantly accelerates data handling. Experimental evaluations compared the performance of the proposed architecture in scenarios without parallelism and with two parallel processes against the traditional sequential approach. Three diverse datasets—NOAA weather data, Threat data, and Stock market data—were tested to assess the scalability and robustness of the solution. The results revealed substantial performance improvements across all datasets. The NOAA dataset exhibited a reduction in total processing time of 42%, the Threat dataset achieved a 42.5% reduction, and the Stock dataset showed the greatest improvement, with a 43.42% decrease in total processing time. Notably, parallel processing reduced the landing time from 451 seconds to 402.66 seconds for the NOAA dataset, from 861 seconds to 763 seconds for Threat data, and from 2,643 seconds to 2,342 seconds for Stock data. Additionally, the average landing iteration time was significantly reduced across the datasets, further underscoring the efficiency gains of parallel execution. These findings demonstrate the broad applicability and efficiency of the proposed architecture, making it a powerful tool for overcoming the traditional limitations of Hive’s data loading processes in high-volume environments. This thesis concludes that the asynchronous and parallel approach offers a significant advancement in data loading efficiency, making it a viable solution for high-volume data environments. Future research will explore further optimization of the staging process, scalability analysis with additional parallel processes, and integration with real-time data frameworks, aiming to establish a robust and scalable architecture for big data applications in Hive and beyond.

Keywords

Parallel processing; Asynchronous processing; Multiprocessing; NOAA dataset; Hive efficiency

LC Subject Headings

Parallel programming (Computer science).; Data warehousing.

Description

This thesis is submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering, 2024.

Cataloged from the PDF version of the thesis.

Includes bibliographical references (pages 77-78).

Department

Department of Computer Science and Engineering, Brac University

Type

Thesis

Collections

Thesis & Report, MSc (Computer Science and Engineering) [87]