Cite

Support Vector Machine model is one of the most intensive used text data classifiers ever since the moment of its development. However, its performance depends not only on its features but also on data preprocessing and model tuning. The main purpose of this paper is to compare the efficiency of more Support Vector Machine models using both TF-IDF approach and Word2Vec and Doc2Vec neural networks for text data representation. Besides the data vectorization process, I try to enhance the models’ efficiency by identifying which kind of kernel fits better the data or if it is just better to opt for the linear case. My results prove that for the “Reuters 21578” dataset, nonlinear Support Vector Machine is more efficient when the conversion of text data into numerical attributes is realized using Word2Vec models instead of TF-IDF and Doc2Vec representations. When it is considered that data meet linear separability requirements, TF-IDF representation outperforms all other options. Surprisingly, Doc2Vec models have the lowest performance and only in terms of computational cost they provide satisfactory results. This paper proves that while Word2Vec models are truly efficient for text data representation, Doc2Vec neural networks are unable to exceed even TF-IDF index representation. This evidence contradicts the common idea according to which Doc2Vec models should provide a better insight into the training data domain than Word2Vec models and certainly than the TF-IDF index.