Boosting hive efficiency: a novel dual-process architecture for asynchronous and parallel data loading

Rahman, Mohammad Ashekur

dc.contributor.advisor	Sadeque, Farig Yousuf
dc.contributor.author	Rahman, Mohammad Ashekur
dc.date.accessioned	2024-11-28T05:42:52Z
dc.date.available	2024-11-28T05:42:52Z
dc.date.copyright	©2024
dc.date.issued	2024-10
dc.identifier.other	ID 21166025
dc.identifier.uri	http://hdl.handle.net/10361/24838
dc.description	This thesis is submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering, 2024.	en_US
dc.description	Cataloged from the PDF version of the thesis.
dc.description	Includes bibliographical references (pages 77-78).
dc.description.abstract	The efficiency of data loading processes in Hive, a critical component of modern big data ecosystems, is often hindered by sequential bottlenecks that limit overall performance. This thesis proposes a novel asynchronous and parallel data loading architecture designed to address these challenges, enhancing Hive’s data ingestion capabilities. The architecture comprises two distinct processes: the Landing Batch Process, which manages data loading into the Hadoop Distributed File System (HDFS), and the Staging Batch Process, responsible for loading data into Hive tables. By operating these processes asynchronously and in parallel, the proposed design significantly accelerates data handling. Experimental evaluations compared the performance of the proposed architecture in scenarios without parallelism and with two parallel processes against the traditional sequential approach. Three diverse datasets—NOAA weather data, Threat data, and Stock market data—were tested to assess the scalability and robustness of the solution. The results revealed substantial performance improvements across all datasets. The NOAA dataset exhibited a reduction in total processing time of 42%, the Threat dataset achieved a 42.5% reduction, and the Stock dataset showed the greatest improvement, with a 43.42% decrease in total processing time. Notably, parallel processing reduced the landing time from 451 seconds to 402.66 seconds for the NOAA dataset, from 861 seconds to 763 seconds for Threat data, and from 2,643 seconds to 2,342 seconds for Stock data. Additionally, the average landing iteration time was significantly reduced across the datasets, further underscoring the efficiency gains of parallel execution. These findings demonstrate the broad applicability and efficiency of the proposed architecture, making it a powerful tool for overcoming the traditional limitations of Hive’s data loading processes in high-volume environments. This thesis concludes that the asynchronous and parallel approach offers a significant advancement in data loading efficiency, making it a viable solution for high-volume data environments. Future research will explore further optimization of the staging process, scalability analysis with additional parallel processes, and integration with real-time data frameworks, aiming to establish a robust and scalable architecture for big data applications in Hive and beyond.	en_US
dc.description.statementofresponsibility	Mohammad Ashekur Rahman
dc.format.extent	92 pages
dc.language.iso	en	en_US
dc.publisher	Brac University	en_US
dc.rights	Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subject	Parallel processing	en_US
dc.subject	Asynchronous processing	en_US
dc.subject	Multiprocessing	en_US
dc.subject	NOAA dataset	en_US
dc.subject	Hive efficiency	en_US
dc.subject.lcsh	Parallel programming (Computer science).
dc.subject.lcsh	Data warehousing.
dc.title	Boosting hive efficiency: a novel dual-process architecture for asynchronous and parallel data loading	en_US
dc.type	Thesis	en_US
dc.contributor.department	Department of Computer Science and Engineering, Brac University
dc.description.degree	M.Sc. in Computer Science and Engineering

Files in this item

Name:: 21166025_CSE.pdf
Size:: 574.4Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Thesis & Report, MSc (Computer Science and Engineering) [87]

Show simple item record