Genetic analysis of cabbages and related cultivated plants using the bag-of-words model

Open access


In this study, we aim to introduce the analytical method bag-of-words, which is mainly used as a tool for the analysis (document classification, authorship attribution and so on; e.g. [1, 2]) of natural languages. Quantitative linguistic methods similar to bag-of-words (e.g. Damerau–Levenshtein distance in the paper by Serva and Petroni [3]) have been used for the mapping of language evolution within the field of glottochronology. We attempt to apply this method in the field of biological taxonomy – on the Brassicaceae (Cruciferae) family. The subjects of our interest are well-known cultivated crops, which at first sight are morphologically very different and culturally perceived as objects of different interests (e.g. oil from oilseed rape, turnip as animal feed and cabbage as a side dish). Despite the phenotypic divergence of these crops, they are very closely related, which is not morphologically obvious at first sight. For this reason, we think that Brassicaceae crops are appropriate illustrative examples for introducing the method. For the analysis, we use genetic markers (internal transcribed spacer [ITS] and maturase K [matK]). Until now, the bag-of-words model has not been used for biological taxonomisation purposes; therefore, the results of the bagof-words analysis are compared with the existing very well-developed Brassica taxonomy. Our goal is to present a method that is suitable for language development reconstruction as well as possibly being usable for biological taxonomy purposes.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Soumya G. K. Shibily J. 2014. Text classification by augmenting bag of words (BOW) representation with co-occurrence feature. OSR Journal of Computer Engineering (IOSR-JCE) 16 (1) 34–38.

  • [3] Boukhaled M. A. Ganascia J.-G. 2015. Using Function Words for Authorship Attribution: Bag-Of-Words vs. Sequential Rules. The 11th International Workshop on Natural Language Processing and Cognitive Science Oct 2014 Venice Italy. DE GRUYTER Natural Language Processing and Cognitive Science Proceedings 2014 115–122 2015.

  • [5] Serva M. Petroni I. F. 2008. Indo-European Languages Tree by Levenshtein Distance. EPL (Europhysics Letters) 81 680–685.

  • [7] Swadesh M. 1952. Lexico-statistic dating of prehistoric ethnic contacts. Proceedings of American Philosophical Society 96 452–463.

  • [9] Swadesh M. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21 121–137.

  • [11] Embleton S. 2000. Lexicostatistics/Glottochronology: from Swadesh to Sankoff to Starostin to future horizons. In: C. Renfrew A. McMahon and L. Trask (eds.) Time Depth in Historical Linguistics 1. Cambridge: McDonald Institute for Archaeological Research pp. 143–165.

  • [14] Toldo R. Castellani U. Fusiello A. 2009. A bag of words approach for 3D object categorization. In: Gagalowicz A. Philips W. (eds.) Computer vision/computer graphics Collaboration techniques. MIRAGE 2009. Lecture Notes in Computer Science Vol. 5496. Berlin: Springer.

  • [17] Zhang Y. Jin R. Zhou Z. H. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1 (1–4) 43–52.

  • [19] Bolshoy A. Volkovich Z. Kirzhner V. et al. 2010. Genome clustering from linguistic models to classification of genetic texts. Berlin: Springer.

  • [21] Lovato P. 2015. Bag of words approaches for Bioinformatics. Ph.D. thesis Dept. of Computer Science University of Verona series TD-03-15.

  • [23] Harris Z. 1954. Distributional structure. Word 10 (2/3) 146–62.

  • [25] Huang C. H. Sun R. Hu Y. et al. 2016. Resolution of Brassicaceae phylogeny using nuclear genes uncovers nested radiations and supports convergent morphological evolution. Molecular Biology and Evolution 33(2) 394–412.

  • [27] Francisco-Ortega J. Fuertes-Aguilar J. Gómez Campo C. et al. 1999. Internal transcribed spacer sequence phylogeny of Crambe L. (Brassicaceae): molecular data revealed two old world disjunctions. Molecular Phylogenetics and Evolution 11 361–380.

  • [29] Koch M. Haubold B. Mitchell-Olds T. 2001. Molecular systematics of the Brassicaceae: evidence from coding plastidic matK and nuclear Chs sequences. American Journal of Botany 88 534–44.

  • [31] Koch M. Sharma A. K. Sharma A. 2003. Molecular phylogenetics evolution and population biology in Brassicaceae. Plant Genome: Biodiversity and Evolution 1 1–35.

  • [33] Warwick S. I. Sauder C. 2005. Phylogeny of tribe Brassiceae (Brassicaceae) based on chloroplast restriction site polymorphisms and nuclear ribosomal internal transcribed spacer and chloroplast trnL intron sequences. Canadian Journal of Botany 83 467–483.

  • [35] Warwick S. I. Francis A. Al-Shehbaz A. I. 2006. Brassicaceae: Species checklist and database on CDROM. Plant Systematics and Evolution 259 249–258.

  • [38] Mummenhoff K. Al-Shehbaz I. A. Bakker F. T. et al. 2005. Phylogeny morphological evolution and speciation of endemic Brassicaceae genera in the Cape flora of southern Africa. Annals of the Missouri Botanical Garden 92 400–424.

  • [40] Couvreur T. Franzke A. Al-Shehbaz I. A. et al. 2010. Molecular phylogenetics temporal diversification and principles of evolution in the mustard family (Brassicaceae). Molecular Biology and Evolution 27 55–71.

  • [42] Franzke A. Lysak M. A. Al-Shehbaz I. A. et al. 2011. Cabbage family affairs: the evolutionary history of Brassicaceae. Trends in Plant Science 16(2) 108–116.

  • [44] Al-Shehbaz A. I. Beilstein M. A. Kellogg E. A. 2006. Systematics and phylogeny of the Brassicaceae (Cruciferae): an overview. Plant Systematics and Evolution 259 89–120.

  • [46] Hayek A. 1911. Entwurf eines Cruciferensystems auf phylogenetischer Grundlage. Beihefte zum Botanischen Centralblatt 27 127–335.

  • [48] Nagaharu U. 1935. Genome analysis in Brassica with special reference to the experimental formation of B. napus and peculiar mode of fertilization. Journal of Japanese Botany 7 389–452.

  • [50] Sadowski J. Kole C. 2011. Genetics genomics and breeding of vegetable Brassicas. Enfield NH USA: Science Publishers.

  • [52] Schulz O. E. Engler A. Harms H. 1936. Cruciferae Die natürlichen Pflanzenfamilien. Leipzig Germany Verlag Von Wilhelm Engelmann 227–658.

  • [54] Bailey C. D. Koch M. A. Mayer M. et al. 2006. Toward a global phylogeny of the Brassicaceae. Molecular Biology and Evolution 23 2142–2160

  • [56] Liu L. Zhao B. Tan D. Wang J. 2012. Phylogenetic relationships of Brassicaceae species based on matK sequences. Pakistan Journal of Botany 44 (2) 619–626.

  • [58] Maggioni L. 2015. Domestication of Brassica oleracea L. Doctoral Thesis No. 2015:74 Faculty of Landscape Architecture Horticulture and Crop Production Science.

  • [60] Juniper B. E. Watkins R. Harris S. A. 1998. The origin of the apple. Acta Hor-ticulturae 484. 27–33.

  • [62] Crespo M. B. Lledo M. D. Fay M. F. et al. 2000. Subtribe Vellinae (Brassiceae Brassicaceae): a combined analysis of ITS nrDNA sequences and morphological data. Annals of Botany 86 53–62.

  • [63] 31. German D. A. Friesen N. Neuffer B. et al. 2009. Contribution to ITS phylogeny of the Brassicaceae with special reference to some Asian taxa. Plant Systematics and Evolution 283 33–56.

Journal information
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 74 74 12
PDF Downloads 67 67 13