• Login
    • Library Home
    View Item 
    •   BracU IR
    • School of Data and Sciences (SDS)
    • Department of Computer Science and Engineering (CSE)
    • Thesis & Report, BSc (Computer Science and Engineering)
    • View Item
    •   BracU IR
    • School of Data and Sciences (SDS)
    • Department of Computer Science and Engineering (CSE)
    • Thesis & Report, BSc (Computer Science and Engineering)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Detecting document similarity in large document collecting using MapReduce and the Hadoop framework

    Thumbnail
    View/Open
    Detecting Document Similarity in Large Document Collection using MapReduce and the Hadoop Framework.pfg.pdf (2.445Mb)
    Date
    2012-12
    Publisher
    BRAC University
    Author
    Momtaz, Anik
    Amreen, Sadika
    Metadata
    Show full item record
    URI
    http://hdl.handle.net/10361/2379
    Abstract
    The everlasting necessity to process data is only becoming more and more challenging due to the exponential growth of the data itself. We are talking about exabytes, zettabytes and even yottabytes of data; generally referred to as Big Data. Hence, the conventional processing methods of data have become obsolete when handling Big Data. It is simply not feasible to use a single machine to analyze data of such tremendous volume. This is where Hadoop comes in. Simply put, using the Hadoop Distributive File System (HDFS), an enormous chunk of data can be divided into smaller pieces and be distributed amongst multiple machines referred to as nodes to parallel process them using a technique called MapReduce. The potential for such a concept is limitless. However, for our thesis, we have used the HDFS to identify similarities between multiple documents. The initial idea was to make an algorithm to detect full or partial plagiarism in documents as there are countless materials of interest readily available on the internet. However, upon successfully being able to implement an algorithm for the English language, we realized that there is no record of any work on document similarity detection carried on upon Bangla language. Therefore, with some modifications to our existing algorithm to fit our specifications (as the Bangla language is completely different from the English language as far as construction is concerned), we were able to develop an algorithm to detect document similarities on a broad scale using the Ferret model.
    Keywords
    Computer science and engineering
    Description
    This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2012.
     
    Cataloged from PDF version of thesis report.
     
    Includes bibliographical references (page 46).
    Department
    Department of Computer Science and Engineering, BRAC University
    Type
    Thesis
    Collections
    • Thesis & Report, BSc (Computer Science and Engineering)

    Copyright © 2008-2023 Ayesha Abed Library, Brac University 
    Contact Us | Send Feedback
     

     

    Policy Guidelines

    • BracU Policy
    • Publisher Policy

    Browse

    All of BracU Institutional RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage Statistics

    Copyright © 2008-2023 Ayesha Abed Library, Brac University 
    Contact Us | Send Feedback