Show simple item record

dc.contributor.advisorMostakim, Moin
dc.contributor.authorKader, Kazi Samiul
dc.contributor.authorNawar, Sagufa
dc.contributor.authorAnanna, Nusrat Sharmin
dc.contributor.authorKhan, Sarah
dc.date.accessioned2015-09-03T10:27:20Z
dc.date.available2015-09-03T10:27:20Z
dc.date.copyright2015
dc.date.issued2015
dc.identifier.otherID 11101011
dc.identifier.otherID 11101007
dc.identifier.otherID 11101031
dc.identifier.otherID 11201022
dc.identifier.urihttp://hdl.handle.net/10361/4381
dc.descriptionThis thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2015.en_US
dc.descriptionCataloged from PDF version of thesis report.
dc.descriptionIncludes bibliographical references (page 56-58).
dc.description.abstractWeb page clustering is an important part of modern web technology. By structuring similar web pages together we can find related information, suggest similar choices etc. All modern search engines depend on web page clustering. It is interesting to work on this topic as it presents a novel academic challenge and also practical application. In this thesis we clustered web pages by using the HTML tag structure of web pages. We represented each web page as a vector of tag percentages and clustered them using k-means clustering algorithm and DBSCAN clustering algorithm. We selected k-means and DBSCAN algorithm because they are well known clustering algorithms and also they have not been applied together and compared in the field of web page clustering as we did in this thesis. After clustering on three different category of five websites in three stages, both algorithms produced over minimum 88% accuracy in clustering compared to the original clusters. In this process we used the weka data mining software, because it is well tested in terms of accuracy and efficiency. It is also open source.en_US
dc.description.statementofresponsibilityKazi Samiul Kader
dc.description.statementofresponsibilitySagufa Nawar
dc.description.statementofresponsibilityNusrat Sharmin Ananna
dc.description.statementofresponsibilitySarah Khan
dc.format.extent58 pages
dc.language.isoenen_US
dc.publisherBRAC Universityen_US
dc.rightsBRAC University thesis are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subjectComputer science and engineeringen_US
dc.subjectWeb page clusteringen_US
dc.titleClustering web pages based on doc type structure in a distributed manneren_US
dc.typeThesisen_US
dc.contributor.departmentDepartment of Computer Science and Engineering, BRAC University
dc.description.degreeB. Computer Science and Engineering


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record