Show simple item record

dc.contributor.advisorSadeque, Farig Yousuf
dc.contributor.authorRahman, Mohammad Ashekur
dc.date.accessioned2024-11-28T05:42:52Z
dc.date.available2024-11-28T05:42:52Z
dc.date.copyright©2024
dc.date.issued2024-10
dc.identifier.otherID 21166025
dc.identifier.urihttp://hdl.handle.net/10361/24838
dc.descriptionThis thesis is submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering, 2024.en_US
dc.descriptionCataloged from the PDF version of the thesis.
dc.descriptionIncludes bibliographical references (pages 77-78).
dc.description.abstractThe efficiency of data loading processes in Hive, a critical component of modern big data ecosystems, is often hindered by sequential bottlenecks that limit overall performance. This thesis proposes a novel asynchronous and parallel data loading architecture designed to address these challenges, enhancing Hive’s data ingestion capabilities. The architecture comprises two distinct processes: the Landing Batch Process, which manages data loading into the Hadoop Distributed File System (HDFS), and the Staging Batch Process, responsible for loading data into Hive tables. By operating these processes asynchronously and in parallel, the proposed design significantly accelerates data handling. Experimental evaluations compared the performance of the proposed architecture in scenarios without parallelism and with two parallel processes against the traditional sequential approach. Three diverse datasets—NOAA weather data, Threat data, and Stock market data—were tested to assess the scalability and robustness of the solution. The results revealed substantial performance improvements across all datasets. The NOAA dataset exhibited a reduction in total processing time of 42%, the Threat dataset achieved a 42.5% reduction, and the Stock dataset showed the greatest improvement, with a 43.42% decrease in total processing time. Notably, parallel processing reduced the landing time from 451 seconds to 402.66 seconds for the NOAA dataset, from 861 seconds to 763 seconds for Threat data, and from 2,643 seconds to 2,342 seconds for Stock data. Additionally, the average landing iteration time was significantly reduced across the datasets, further underscoring the efficiency gains of parallel execution. These findings demonstrate the broad applicability and efficiency of the proposed architecture, making it a powerful tool for overcoming the traditional limitations of Hive’s data loading processes in high-volume environments. This thesis concludes that the asynchronous and parallel approach offers a significant advancement in data loading efficiency, making it a viable solution for high-volume data environments. Future research will explore further optimization of the staging process, scalability analysis with additional parallel processes, and integration with real-time data frameworks, aiming to establish a robust and scalable architecture for big data applications in Hive and beyond.en_US
dc.description.statementofresponsibilityMohammad Ashekur Rahman
dc.format.extent92 pages
dc.language.isoenen_US
dc.publisherBrac Universityen_US
dc.rightsBrac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subjectParallel processingen_US
dc.subjectAsynchronous processingen_US
dc.subjectMultiprocessingen_US
dc.subjectNOAA dataseten_US
dc.subjectHive efficiencyen_US
dc.subject.lcshParallel programming (Computer science).
dc.subject.lcshData warehousing.
dc.titleBoosting hive efficiency: a novel dual-process architecture for asynchronous and parallel data loadingen_US
dc.typeThesisen_US
dc.contributor.departmentDepartment of Computer Science and Engineering, Brac University
dc.description.degreeM.Sc. in Computer Science and Engineering


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record