Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method and the evaluation of this model is an interesting problem on its own. Topic interpretability measures have been developed in recent years as a more natural option for topic quality evaluation, emulating human perception of coherence with word sets correlation scores. In this paper, we show experimental evidence of the improvement of topic coherence score by restricting the training corpus to that of relevant information in the document obtained by Entity Recognition. We experiment with job advertisement data and find that with this approach topic models improve interpretability in about 40 percentage points on average. Our analysis reveals as well that using the extracted text chunks, some redundant topics are joined while others are split into more skill-specific topics. Fine-grained topics observed in models using the whole text are preserved.
If the inline PDF is not rendering correctly, you can download the PDF file here.
Airoldi, E. M., E. A. Erosheva, S. E. Fienberg, C. Joutard, T. Love, and S. Shringarpure. Reconceptualizing the classification of PNAS articles. In Proceedings of the National Academy of Sciences of the USA, volume 107, pages 20899–20904, 2010.
Aletras, Nikolaos and Mark Stevenson. Evaluating Topic Coherence Using Distributional Semantics. In 10th Int. Conf. on Computational Semantics (IWCS’13), 2013.
Asuncion, A., M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI’09), pages 27–34, 2009.
Bikel, D. M., R. Schwartz, and R. M. Weischedel. An Algorithm that Learns What’s in a Name. Journal of Machine Learning, 34:211–231, 1999.
Blei, D. M. and J. D. Lafferty. Correlated Topic Models. In Weiss, Y., B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems, pages 147–154. MIT Press, Cambridge, 2005.
Blei, D. M. and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference of Machine Learning (ICML ’06), pages 113–120, August 2006.
Blei, D. M., A. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
Blei, D. M., T. L. Griffiths, and M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57:7.1–7.30, 2007.
Cardenas Acosta, Ronald, Kevin Bello Medina, Alberto Coronado, and Elizabeth Villota. Engineering job ads corpus, 2016. URL http://hdl.handle.net/11234/1-2673. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
Carreras, X., L. Marquez, and L. Padró. Wide-Coverage Spanish Named Entity Extraction. In VIII Conferencia Iberoamericana de Inteligencia Artificial, IBERAMIA’02, 2002.
Chang, Jonathan, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. In Bengio, Y., D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 288–296. Curran Associates, Inc., 2009.
Collins, M. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithm. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002.
Erosheva, E. A., S. E. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. In Proceedings of the National Academy of Sciences of the USA, volume 101, pages 5220–5227, 2004.
Griffiths, T. L. and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences of the USA, volume 101, pages 5228–5235, 2004.
Grün, Bettina and Kurt Hornik. topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13):1–30, 2011. doi: 10.18637/jss.v040.i13.
Hall, Mark M., Paul D. Clough, and Mark Stevenson. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections. In Second International Conference on Theory and Practice of Digital Libraries (ERCIMDL), 2012.
Lafferty, John D., Andrew McCallum, and Fernando C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML, pages 282–289, San Francisco, CA, USA, 2001. ISBN 1-55860-778-1.
Lau, Jey Han, David Newman, and Timothy Baldwin. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In European Chapter of the Association for Computational Linguistics (EACL’14), 2014.
Liang, P. and M. Collins. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology, 2005.
Miller, S., J. Guinness, and A. Zamanian. Name tagging with word clusters and discriminative training. In Proceedings of the Proceedings of HLT-NAACL 2004, pages 337–342, 2004.
Mimno, David M., Hanna M. Wallach, Edmund M. Talley, Miriam Leenders, and Andrew Mc-Callum. Optimizing Semantic Coherence in Topic Models. In Empirical Methods in Natural Language Processing (EMNLP 2011), 2011.
Musat, Claudiu Cristian, Julien Velcin, Stefan Trausan-Matu, and Marian-Andrei Rizoiu. Improving Topic Evaluation Using Conceptual Knowledge. In 22nd International Joint Conference on Artificial Intelligence (IJCAI-2011), 2011.
Newman, D., C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 680–686, August 2006.
Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic Evaluation of Topic Coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 100–108, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. ISBN 1-932432-65-5. URL http://dl.acm.org/citation.cfm?id=1857999.1858011.
Nguyen, Thang, Jordan L. Boyd-Graber, Jeffrey Lund, Kevin D. Seppi, and Eric K. Ringger. Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models. In North American Chapter of the Association for Computational Linguistics (NAACL 2015), 2015.
Paul, Michael J. and Roxana Girju. A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics. In 24th Annual Conference on Artificial Intelligence (AAAI-10), 2010.
Phan, Xuan Hieu, Minh Le Nguyen, and Susumu Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In 17th International World Wide Web Conference (WWW 2008), pages 91–100, 2008.
Ramshaw, L. A. and M. P. Marcus. Text chunking using transformation-based learning. In Proceedings of the Third Workshop on Very Large Corpora. ACL, 1995.
Reisinger, Joseph, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney. Spherical Topic Models. In 27th International Conference on Machine Learning (ICML 2010), 2010.
Röder, Michael, Andreas Both, and Alexander Hinneburg. Exploring the Space of Topic Coherence Measures. In Proceedings of WSDM, 2015.
Stevens, Keith, W. Philip Kegelmeyer, David Andrzejewski, and David Buttler. Exploring Topic Coherence over Many Models and Many Topics. In Proceedings of EMNLP-CoNLL’12, 2012.
Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566–1581, 2006.
Yang, Yi, Doug Downey, and Jordan L. Boyd-Graber. Efficient Methods for Incorporating Knowledge into Topic Models. In Empirical Methods in Natural Language Processing, 2015.