Boosting hive efficiency: a novel dual-process architecture for asynchronous and parallel data loading
View/ Open
Date
2024-10Publisher
Brac UniversityAuthor
Rahman, Mohammad AshekurMetadata
Show full item recordAbstract
The efficiency of data loading processes in Hive, a critical component of modern
big data ecosystems, is often hindered by sequential bottlenecks that limit overall
performance. This thesis proposes a novel asynchronous and parallel data loading
architecture designed to address these challenges, enhancing Hive’s data ingestion
capabilities. The architecture comprises two distinct processes: the Landing
Batch Process, which manages data loading into the Hadoop Distributed File System
(HDFS), and the Staging Batch Process, responsible for loading data into Hive
tables. By operating these processes asynchronously and in parallel, the proposed
design significantly accelerates data handling.
Experimental evaluations compared the performance of the proposed architecture
in scenarios without parallelism and with two parallel processes against the traditional
sequential approach. Three diverse datasets—NOAA weather data, Threat
data, and Stock market data—were tested to assess the scalability and robustness of
the solution. The results revealed substantial performance improvements across all
datasets. The NOAA dataset exhibited a reduction in total processing time of 42%,
the Threat dataset achieved a 42.5% reduction, and the Stock dataset showed the
greatest improvement, with a 43.42% decrease in total processing time. Notably,
parallel processing reduced the landing time from 451 seconds to 402.66 seconds
for the NOAA dataset, from 861 seconds to 763 seconds for Threat data, and from
2,643 seconds to 2,342 seconds for Stock data. Additionally, the average landing
iteration time was significantly reduced across the datasets, further underscoring
the efficiency gains of parallel execution.
These findings demonstrate the broad applicability and efficiency of the proposed
architecture, making it a powerful tool for overcoming the traditional limitations of
Hive’s data loading processes in high-volume environments. This thesis concludes
that the asynchronous and parallel approach offers a significant advancement in data
loading efficiency, making it a viable solution for high-volume data environments.
Future research will explore further optimization of the staging process, scalability
analysis with additional parallel processes, and integration with real-time data
frameworks, aiming to establish a robust and scalable architecture for big data applications
in Hive and beyond.