Clustering web pages based on doc type structure in a distributed manner
Abstract
Web page clustering is an important part of modern web technology. By
structuring similar web pages together we can find related information, suggest
similar choices etc. All modern search engines depend on web page
clustering. It is interesting to work on this topic as it presents a novel academic
challenge and also practical application. In this thesis we clustered
web pages by using the HTML tag structure of web pages. We represented
each web page as a vector of tag percentages and clustered them using
k-means clustering algorithm and DBSCAN clustering algorithm. We selected
k-means and DBSCAN algorithm because they are well known clustering
algorithms and also they have not been applied together and compared
in the field of web page clustering as we did in this thesis. After
clustering on three different category of five websites in three stages, both
algorithms produced over minimum 88% accuracy in clustering compared
to the original clusters. In this process we used the weka data mining software,
because it is well tested in terms of accuracy and efficiency. It is also
open source.