Albanian Text Classification: Bag of Words Model and Word Analogies

Open access

Abstract

Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • 1. Antonellis I. Bouras C. Poulopoulos V. (2006) “Personalized news categorization through scalable text classification” in Zhou X. Li J. Shen H. T. Kitsuregawa M. Zhang Y. (Eds.) Frontiers of WWW Research and Development – APWeb 2006 Springer Berlin Heidelberg pp. 391-401.

  • 2. Bojanowski P. Grave E. Joulin A. Mikolov T. (2017) “Enriching word vectors with subword information” Transactions of the Association of Computational Linguistics Vol. 5 pp.135-146.

  • 3. Chaudhari S. V. Lade S. (2013) “Classification of News and Research Articles Using Text Pattern Mining” IOSR Journal of Computer Engineering (IOSR-JCE) Vol. 14 No. 5 pp. 120-126.

  • 4. Cortes C. Vapnik V. (1995) “Support-vector networks” Machine Learning Vol. 20 No. 3 pp. 273-297.

  • 5. Crammer K. Dekel O. Keshet J. Shalev-Shwartz S. Singer Y. (2006) “Online passive-aggressive algorithms” Journal of Machine Learning Research Vol. 7 pp. 551-585.

  • 6. Gui Y. Gao Z. Li R. Yang X. (2012) “Hierarchical text classification for news articles based-on named entities” in Zhou S. Zhangs S. Karypis G. (Eds.) Advanced Data Mining and Applications Springer Berlin Heidelberg pp. 318-329.

  • 7. Hartmann N. Fonseca E. Shulby C. Treviso M. Rodrigues J. Aluisio S. (2017) “Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks” in Proceedings of Symposium in Information and Human Language Technology Uberlandia MG Brazil pp. 122-131.

  • 8. Joulin A. Grave E. Bojanowski P. Mikolov T. (2016) “Bag of tricks for efficient text classification” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol. 2 Short Papers pp. 427-431.

  • 9. Jurka T. P. Collingwood L. Boydstun A. E. Grossman E. van Atteveldt W. (2013) “RTextTools: A supervised learning package for text classification” The R Journal Vol. 5 No. 1 pp. 6-12.

  • 10. Liparas D. HaCohen-Kerner Y. Moumtzidou A. Vrochidis S. Kompatsiaris I. (2014) “News Articles Classification Using Random Forests and Weighted Multimodal Features” in Lamas D. Buitelaar P. (Eds.) Multidisciplinary Information Retrieval Springer Cham pp. 63-75.

  • 11. Manning C. D. Raghavan P. Schutze H. (2008). Introduction to Information Retrieval New York Cambridge University Press.

  • 12. Mikolov T. Chen K. Corrado G. Dean J. (2013) “Efficient estimation of word representations in vector space” in Proceedings of the International Conference on Learning Representations (ICLR 2013) available at: https://arxiv.org/pdf/1301.3781.pdf

  • 13. September 2013).

  • 14. Natural Language Processing Group (2014). Web corpora of Bosnian Croatian and Serbian top-level domain published available at: http://nlp.ffzg.hr/web-corpora-of-bosniancroatian-and-serbian-top-level-domain-published/ (7 September 2014).

  • 15. Pedregosa F. Varoquaux G. Gramfort A. Michel V. Thirion B. Grisel O. Blondel M. Prettenhofer P. Weiss R. Dubourg V. Vanderplas J. Passos A. Cournapeau D. Brucher M. Perrot M. Duchesnay E. (2011) “Scikit-learn: Machine Learning in Python” In Journal of Machine Learning Research Vol. 12 pp. 2825-2830.

  • 16. Raschka S. (2015). Python machine learning Birmingham Packt Publishing Ltd.

  • 17. Rubin T. N. Chambers A. Smyth P. Steyvers M. (2012) “Statistical topic models for multilabel document classification” Machine Learning Vol. 88 No. 1-2 pp. 157-208.

  • 18. Scannell K. P. (2007) “The Crúbadán Project: Corpus building for under-resourced languages” in Fairon C. Naets H. Kilgarriff A. de Schryver G. M. (Eds.) Building and Exploring Web Corpora Proceedings of the 3rd Web as Corpus Workshop Vol. 4 pp. 5-15.

  • 19. Swezey R. M. Sano H. Shiramatsu S. Ozono T. Shintani T. (2012) “Automatic detection of news articles of interest to regional communities” International Journal of Computer Science and Network Security Vol. 12 No. 6 pp. 99-106.

  • 20. Tyers F. M. Alperen M. S. (2010) “South-east European times: A parallel corpus of Balkan languages” in Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages pp. 49-53.

  • 21. Zhou D. Resnick P. Mei Q. (2011) “Classifying the Political Leaning of News Articles and Users from User Votes” in 5th International AAAI Conference on Web and Social Media North America pp. 417-424.

Search
Journal information
Impact Factor


CiteScore 2018: 0.57

SCImago Journal Rank (SJR) 2018: 0.165
Source Normalized Impact per Paper (SNIP) 2018: 0.388

Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 179 179 41
PDF Downloads 139 139 32