Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Maria Mihaela Truşcă

Open Access

Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Maria Mihaela Truşcă

| Feb 13, 2020

Proceedings of the International Conference on Applied Statistics

Volume 1 (2019): Issue 1 (October 2019)

About this article

Cite

Page range: 496 - 503

DOI: https://doi.org/10.2478/icas-2019-0043

Keywords
Textual Analysis, Support Vector Machine, Word2Vec, Doc2Vec, TF-IDF index

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Support Vector Machine model is one of the most intensive used text data classifiers ever since the moment of its development. However, its performance depends not only on its features but also on data preprocessing and model tuning. The main purpose of this paper is to compare the efficiency of more Support Vector Machine models using both TF-IDF approach and Word2Vec and Doc2Vec neural networks for text data representation. Besides the data vectorization process, I try to enhance the models’ efficiency by identifying which kind of kernel fits better the data or if it is just better to opt for the linear case. My results prove that for the “Reuters 21578” dataset, nonlinear Support Vector Machine is more efficient when the conversion of text data into numerical attributes is realized using Word2Vec models instead of TF-IDF and Doc2Vec representations. When it is considered that data meet linear separability requirements, TF-IDF representation outperforms all other options. Surprisingly, Doc2Vec models have the lowest performance and only in terms of computational cost they provide satisfactory results. This paper proves that while Word2Vec models are truly efficient for text data representation, Doc2Vec neural networks are unable to exceed even TF-IDF index representation. This evidence contradicts the common idea according to which Doc2Vec models should provide a better insight into the training data domain than Word2Vec models and certainly than the TF-IDF index.

eISSN:: 2668-6309
Language:: English

Publication timeframe:: Volume Open
Journal Subjects:: Computer Sciences, Artificial Intelligence, Business and Economics, Political Economics, Macroecomics, Mathematics and Statistics for Economists, Statitistics, Econometrics

Journal RSS Feed

Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Published Online: Feb 13, 2020

Page range: 496 - 503

DOI: https://doi.org/10.2478/icas-2019-0043

Keywords
Textual Analysis, Support Vector Machine, Word2Vec, Doc2Vec, TF-IDF index

© 2019 Maria Mihaela Truşcă, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Published Online: Feb 13, 2020

Page range: 496 - 503

DOI: https://doi.org/10.2478/icas-2019-0043

KeywordsTextual Analysis, Support Vector Machine, Word2Vec, Doc2Vec, TF-IDF index

© 2019 Maria Mihaela Truşcă, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Keywords
Textual Analysis, Support Vector Machine, Word2Vec, Doc2Vec, TF-IDF index