References  Soumya, G. K., Shibily, J., 2014. Text classification by augmenting bagofwords (BOW) representation with co-occurrence feature. OSR Journal of Computer Engineering (IOSR-JCE) , 16 (1), 34–38.  Boukhaled, M. A., Ganascia, J.-G., 2015. Using Function Words for Authorship Attribution: Bag-Of-Words vs. Sequential Rules . The 11th International Workshop on Natural Language Processing and Cognitive Science, Oct 2014, Venice, Italy. DE GRUYTER, Natural Language Processing and Cognitive Science Proceedings, 2014, 115–122, 2015.  Serva, M
Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
Despite the rapid growth of other types of social media, Internet discussion forums remain a highly popular communication channel and a useful source of text data for analyzing user interests and sentiments. Being suited to richer, deeper, and longer discussions than microblogging services, they particularly well reflect topics of long-term, persisting involvement and areas of specialized knowledge or experience. Discovering and characterizing such topics and areas by text mining algorithms is therefore an interesting and useful research direction. This work presents a case study in which selected classification algorithms are applied to posts from a Polish discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana. The utility of two different vector text representations is examined: the simple bag of words representation and the more refined embedded global vectors one. While the former is found to work well for the multinomial naive Bayes algorithm, the latter turns out more useful for other classification algorithms: logistic regression, SVMs, and random forests. The obtained results suggest that post-classification can be applied for measuring publication intensity of particular topics and, in the case of forums related to psychoactive substances, for monitoring the risk of drug-related crime.
the CrowdFlower crowdsourcing platform and by examining their selection of term dependence. We produce the algorithmic assessments using four state-of-the-art term dependence ranking models ( Lioma et al., 2015 ; Metzler & Croft, 2005 ). Given a query, both user and algorithmic approaches decide if the query contains heavily dependent terms that should be treated as a fixed phrase instead of a bagofwords. We compare retrieval performance between user and algorithmic methods of deciding term dependence, and also against a bagofwords (no term dependence) baseline
Document clustering is a problem of automatically grouping similar document into categories based on some similarity metrics. Almost all available data, usually on the web, are unclassified so we need powerful clustering algorithms that work with these types of data. All common search engines return a list of pages relevant to the user query. This list needs to be generated fast and as correct as possible. For this type of problems, because the web pages are unclassified, we need powerful clustering algorithms. In this paper we present a clustering algorithm called DBSCAN – Density-Based Spatial Clustering of Applications with Noise – and its limitations on documents (or web pages) clustering. Documents are represented using the “bag-of-words” representation (word occurrence frequency). For this type o representation usually a lot of algorithms fail. In this paper we use Information Gain as feature selection method and evaluate the DBSCAN algorithm by its capacity to integrate in the clusters all the samples from the dataset.
In this paper we present a novel approach to minimally supervised synonym extraction. The approach is based on the word embeddings and aims at presenting a method for synonym extraction that is extensible to various languages.
We report experiments with word vectors trained by using both the continuous bag-of-words model (CBoW) and the skip-gram model (SG) investigating the effects of different settings with respect to the contextual window size, the number of dimensions and the type of word vectors. We analyze the word categories that are (cosine) similar in the vector space, showing that cosine similarity on its own is a bad indicator to determine if two words are synonymous. In this context, we propose a new measure, relative cosine similarity, for calculating similarity relative to other cosine-similar words in the corpus. We show that calculating similarity relative to other words boosts the precision of the extraction. We also experiment with combining similarity scores from differently-trained vectors and explore the advantages of using a part-of-speech tagger as a way of introducing some light supervision, thus aiding extraction.
We perform both intrinsic and extrinsic evaluation on our final system: intrinsic evaluation is carried out manually by two human evaluators and we use the output of our system in a machine translation task for extrinsic evaluation, showing that the extracted synonyms improve the evaluation metric.
Recent studies have shown that machine learning can identify individuals with mental illnesses by analyzing their social media posts. Topics and words related to mental health are some of the top predictors. These findings have implications for early detection of mental illnesses. However, they also raise numerous privacy concerns. To fully evaluate the implications for privacy, we analyze the performance of different machine learning models in the absence of tweets that talk about mental illnesses. Our results show that machine learning can be used to make predictions even if the users do not actively talk about their mental illness. To fully understand the implications of these findings, we analyze the features that make these predictions possible. We analyze bag-of-words, word clusters, part of speech n-gram features, and topic models to understand the machine learning model and to discover language patterns that differentiate individuals with mental illnesses from a control group. This analysis confirmed some of the known language patterns and uncovered several new patterns. We then discuss the possible applications of machine learning to identify mental illnesses, the feasibility of such applications, associated privacy implications, and analyze the feasibility of potential mitigations.
on bag-of-words, 10th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 353 - 358. 18. Zheng Z., Zhang H., Wang B., Gao Z. (2012), Robust traffic sign recognition and tracking for advanced driver assistance systems, International IEEE Conference on Intelligent Transportation Systems (ITSC), 704 - 709.
, Proceedings of the IEEE Congress on Evolutionary Computation, pp. 2872-2878, 1-6, 2008.  M. Srinivas, L.M. and L.M. Patnik, Genetic Algorithms: A Survey, IEEE Computer Society Press, Los lamitos, 1994.  W. Siedlecki, and J. Sklansky, A note on genetic algorithms for large-scale feature selection, Pattern Recognition Letters, vol. 10(5), pp. 335-347, 1989.  M.F. Caropreso, and S. Matwin, Beyond the BagofWords: A Text Representation for Sentence Selection, StateplaceBerlin: Springer-Verlag, pp. 324-335, 2006.  R. Kohavi, and G.H. John, Wrappers for feature
.-H. Zhou. Understanding Bag-of-Words Model: A Statistical Framework. – Int. J. Mach. Learn. Cybern., Vol. 1 , 2010, No 1, pp. 43-52. 22. P. Ekman, R. J. Davidson, Eds. The Nature of Emotion: Fundamental Questions. New York, NY, US, Oxford University Press, 1994. 23. Nosshi, A., A. Asem, M. B. Senousy. Hybrid Recommender System Using Emotional Fingerprints Model. – Int. J. Inf. Retr. Res., Vol. 9 , 2019, No 3, p. Article 4 (in Press). 24. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. – In: Proc. of 14th International