DBSCAN Algorithm for Document Clustering

Document clustering is a problem of automatically grouping similar document into categories based on some similarity metrics. Almost all available data, usually on the web, are unclassified so we need powerful clustering algorithms that work with these types of data. All common search engines return a list of pages relevant to the user query. This list needs to be generated fast and as correct as possible. For this type of problems, because the web pages are unclassified, we need powerful clustering algorithms. In this paper we present a clustering algorithm called DBSCAN – Density-Based Spatial Clustering of Applications with Noise – and its limitations on documents (or web pages) clustering. Documents are represented using the “bag-of-words” representation (word occurrence frequency). For this type o representation usually a lot of algorithms fail. In this paper we use Information Gain as feature selection method and evaluate the DBSCAN algorithm by its capacity to integrate in the clusters all the samples from the dataset.

eISSN:: 2067-354X
Language:: English

Publication timeframe:: 2 times per year
Journal Subjects:: Computer Sciences, other, Business and Economics, Mathematics and Statistics for Economists, Mathematics, Engineering, Electrical Engineering, Fundamentals of Electrical Engineering, General Mathematics

Journal RSS Feed

DBSCAN Algorithm for Document Clustering

Published Online: Mar 20, 2020

Page range: 58 - 66

DOI: https://doi.org/10.2478/ijasitels-2019-0007

Keywords
Document Classification, Information Gain, Naive Bayes, Weka framework

© 2019 Radu G. Creţulescu et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

DBSCAN Algorithm for Document Clustering

Published Online: Mar 20, 2020

Page range: 58 - 66

DOI: https://doi.org/10.2478/ijasitels-2019-0007

KeywordsDocument Classification, Information Gain, Naive Bayes, Weka framework

© 2019 Radu G. Creţulescu et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Keywords
Document Classification, Information Gain, Naive Bayes, Weka framework