Three Methods for Occupation Coding Based on Statistical Learning

Open access

Abstract

Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • ALLBUS. 2015. Available at: http://www.gesis.org/allbus (accessed October 10 2016).

  • Appel M.V. and E. Hellerman. 1983. “Census Bureau Experiments with Automated Industry and Occupation Coding.” In Proceedings of the American Statistical Association Section on Survey Research Methods. August 15-18 1983 Toronto Canada. 32-40.

  • Belloni M. A. Brugiavini E. Meschi and K. Tijdens. 2014. Measurement Error in Occupational Coding: an Analysis on SHARE Data. Ca’ Foscari University of Venice Department of Economics Working Paper 24. Doi: http://dx.doi.org/10.2139/ssrn.2539080.

  • Bethmann A. M. Schierholz K. Wenzig and M. Zielonka. 2014. “Automatic Coding of Occupations.” In Proceedings of Statistics Canada Symposium. August 29-31 2014 Québec Canada. Available at: http://www.statcan.gc.ca/sites/default/files/media/14291-eng.pdf (accessed October 10 2016).

  • Chen B.-C. R.H. Creecy and M.V. Appel. 1993. “Error Control of Automated Industry and Occupation Coding.” Journal of Official Statistics 9: 729-745. http://www.jos.nu/Articles/abstract.asp?article¼94729 (accessed October 10 2016).

  • Clarke F.R. and S.J. Brooker. 2011. Use of Machine Learning for Automated Survey Coding. In Proceedings of the 58th ISI World Statistics Congress. August 21-26 2011 Dublin Ireland.

  • Conrad F.G. M.P. Couper and J.W. Sakshaug. 2016. “Classifying Open-Ended Reports: Factors Affecting the Reliability of Occupation Codes.” Journal of Official Statistics 32: 75-92. Doi: http://dx.doi.org/10.1515/JOS-2016-0003.

  • Creecy R.H. B.M. Masand S.J. Smith and D.L. Waltz. 1992. “Trading MIPS and Memory for Knowledge Engineering.” Communications of the ACM 35: 48-64. Doi: http://dx.doi.org/10.1145/135226.135228.

  • Day J. 2014. Using an Autocoder to Code Industry and Occupation in the American Community Survey. Presentation for the Federal Economic Statistics Advisory Committee Meeting. Available at: http://www2.census.gov/adrm/fesac/2014-06-13_day.pdf (accessed October 10 2016).

  • Elias P. 1997. “Occupational Classification (ISCO-88): Concepts Methods Reliability Validity and Cross-National Comparability.” OECD Labour Market and Social Policy Occasional Papers 20 OECD Publishing. Available at: https://ideas.repec.org/p/oec/elsaaa/20-en.html (accessed October 10 2016).

  • Elias P. and M. Birch. 2010. Tuning CASCOT for Industry and Occupation Coding in the Scottish Census of Population 2011. Technical Report Institute for Employment Research. Coventry: University of Warwick.

  • Ferrillo A. S. Macchia and P. Vicari. 2008. “Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions.” In Proceedings of the European Conference on Quality in Official Statistics - Q2008. July 8-11 2008 Rome Italy. Available at: http://q2008.istat.it/sessions/paper/15Ferrillo.pdf (accessed January 2017).

  • Fix E. and J.L. Hodges. 1951. Discriminatory Analysis Nonparametric Discrimination: Consistency Properties. Technical Report USAF School of Aviation Medivine Randolph Field Texas. Project 21-49-004 Rept. 4 Contract AF41(128)-31 February 1951.

  • Friedman J.H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189-1232. Available at: http://www.jstor.org/stable/2699986 (accessed October 10 2016).

  • Ganzeboom Harry B.G. and Donald J. Treiman. 2003. “Three Internationally Standardised Measures for Comparative Research on Occupational Status.” In Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-Economic Variables edited by J.H.P. Hoffmeyer-Zlotnik and C. Wolf pp. 159-193. Doi: http://dx.doi.org/10.1007/978-1-4419-9186-7_9.

  • Geis A. 2011. Handbuch fu¨r die Berufsvercodung. Technical Report GESIS Mannheim Germany. Available at: http://www.gesis.org/fileadmin/upload/dienstleistung/tools_standards/handbuch_der_berufscodierung_110304.pdf (accessed October 10 2016).

  • Geis A.J. and J.H.P. Hoffmeyer-Zlotnik. 2000. “Stand der Berufsvercodung.” ZUMA Nachrichten 24: 103-128.

  • Iezzi D.F. M. Lori F. Lorenzini M. Nicosia and S. Stoppiello. 2014. “An Application of Text Mining Technique for the Census of Nonprofit Institutions.” In Statistical Methods and Applications from a Historical Perspective edited by F. Crescenzi and S. Mignani pp. 143-152. Springer. Doi: http://dx.doi.org/10.1007/978-3-319-05552-7_13.

  • International Labour Office. 1990. International Standard Classification of Occupations ISCO-88. International Labour Office. Available at: http://www.ilo.org/public/libdoc/ilo/1990/90B09_411_engl.pdf (accessed October 10 2016).

  • Joachims T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning Volume 1398. April 21-23 1998 Chemnitz Germany 137-142. Doi: http://dx.doi.org/10.1007/BFb0026683.

  • Jones R. and P. Elias. 2004. CASCOT: Computer-Assisted Structured Coding Tool. Technical Report Institute for Employment Research. Coventry: University of Warwick. Available at: http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ (accessed October 10 2016).

  • Jung Y. J. Yoo S.-H. Myaeng and D.-C. Han. 2008. “A Web-Based Automated System for Industry and Occupation Coding.” In Web Information Systems Engineering - WISE 2008 edited by J. Bailey D. Maier K.-D. Schewe B. Thalheim and X. Wang. Volume 5175 443-457. Springer. Doi: http://dx.doi.org/10.1007/978-3-540-85481-4_33.

  • Kalpic D. 1994. “Automated Coding of Census Data.” Journal of Official Statistics 10: 449-463.

  • Knaus R. 1987. “Methods and Problems in Coding Natural Language Survey Data.” Journal of Official Statistics 3: 45-67.

  • Koch A. and M. Wasmer. 2004. “Der ALLBUS als Instrument zur Untersuchung sozialen Wandels: Eine Zwischenbilanz nach 20 Jahren.” In Sozialer und Politischer Wandel in Deutschland edited by R. Schmitt-Beck M. Wasmer and A. Koch 13-41. VS Verlag fu¨r Sozialwissenschaften.

  • Maitra R. and I.P. Ramler. 2010. “A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere.” Journal of Computational and Graphical Statistics 19: 377-396. Doi: http://dx.doi.org/10.1198/jcgs.2009.08155.

  • Meyer D. E. Dimitriadou K. Hornik A. Weingessel and F. Leisch. 2014. e1071: Misc Functions of the Department of Statistics TU Wien. Available at: http://CRAN.R-project.org/package¼e1071 (accessed October 10 2016).

  • O’Reagan R.T. 1972. “Computer-Assigned Codes from Verbal Responses.” Communications of the ACM 15: 455-459. Doi: http://dx.doi.org/10.1145/361405.361419.

  • Ossiander E.M. and S. Milham. 2006. “A Computer System for Coding Occupation.” American Journal of Industrial Medicine 49: 854-857. Doi: http://dx.doi.org/10.1002/ajim.20355.

  • Platt J. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” In Advances in Large Margin Classifiers edited by A.J. Smola P. Bartlett B. Scho¨lkopf and D. Schuurmans 61-74. Cambridge Massachusetts: MIT Press.

  • R Core Team. 2014. “R: A Language and Environment for Statistical Computing. Vienna Austria: R Foundation for Statistical Computing.” Available at: http://www.R-project.org/ (accessed October 10 2016).

  • Russ D.E. K.-Y. Ho C.A. Johnson and M.C. Friesen. 2014. “Computer-Based Coding of Occupation Codes for Epidemiological Analyses.” In Proceedings of the 27th IEEE International Symposium on Computer-Based Medical Systems. May 27-29 2014 New York USA 347-350. Doi: http://dx.doi.org/10.1109/CBMS.2014.79.

  • Schierholz M. 2014. “Automating Survey Coding for Occupation.” Master’s thesis Ludwig-Maximilians-Universita¨t Munich. Available at: https://epub.ub.uni-muenchen.de/21444/index.html (accessed October 10 2016).

  • Scholtus S. R. van de Laar and L. Willenborg. 2014. The Memobust Handbook on Methodology for Modern Business Statistics. Available at: https://ec.europa.eu/eurostat/cros/system/files/NTTS2013fullPaper_246.pdf (accessed January 2017).

  • Scholz E. and M. Wasmer. 2009. German General Social Survey 2006. English Translation of the German “ALLBUS”- Questionnaire. Technical Report GESIS Mannheim Germany. Available at: http://nbn-resolving.de/urn:nbn:de:0168-ssoar-207035 (accessed October 10 2016).

  • Schonlau M. and N. Guenther. 2016. Text Mining Using N-Grams. Social Science Research Network. Doi: http://dx.doi.org/10.2139/ssrn.2759033.

  • Silla C.N. and A.A. Freitas. 2011. “A Survey of Hierarchical Classification across Different Application Domains.” Data Mining and Knowledge Discovery 22: 31-72. Doi: http://dx.doi.org/10.1007/s10618-010-0175-9.

  • Snowball. 2015. Available at: http://snowball.tartarus.org/algorithms/german/stemmer.html (accessed October 10 2016).

  • Statistisches Bundesamt. 2010. Demographische Standards. Technical Report Wiesbaden Germany. Available at: https://www.destatis.de/DE/Methoden/StatistikWissenschaft- Band17.html (accessed October 10 2016).

  • Thompson M. M.E. Kornbau and J. Vesely. 2012. “Creating an Automated Industry and Occupation Coding Process for the American Community Survey.” Available at: http://ftp.census.gov/adrm/fesac/2014-06-13_thompson_kornbau_vesely.pdf (accessed October 10 2016).

  • Tijdens K. 2014. “Dropout Rates and Response Times of an Occupation Search Tree in a Web Survey.” Journal of Official Statistics 30: 23-43. Doi: http://dx.doi.org/10.2478/jos-2014-0002.

  • Tijdens K. 2015. “Self-Identification of Occupation in Web Surveys: Requirements for Search Trees and Look-Up Tables.” Survey Methods: Insights from the Field (SMIF). Doi: http://dx.doi.org/10.13094/SMIF-2015-00008.

  • Tourigny J.Y. and J. Moloney. 1995. “The 1991 Canadian Census of Population Experience with Automated Coding.” In United Nations Statistical Commission on Statistical Data Editing.

  • Vapnik V.N. 2000. The Nature of Statistical Learning Theory. 2nd edition. New York: Springer.

  • Weiss S.M. N. Indurkhya T. Zhang and F. Damerau. 2010. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer.

  • Wenzowski M.J. 1988. “ACTR - A Generalised Automated Coding System.” Survey Methodology 14: 299-308.

  • Yu C. 2002. High-Dimensional Indexing: Transformational Approaches to High- Dimensional Range and Similarity Searches. Volume 2341. Berlin: Springer. Doi: http://dx.doi.org/10.1007/3-540-45770-4.

  • Züll C. 2014. Berufscodierung. Technical Report GESIS - Leibniz Institut fu¨r Sozialwissenschaften (SDM Survey Guidelines). Mannheim. Doi: http://dx.doi.org/10.15465/sdm-sg_019.

Search
Journal information
Impact Factor

IMPACT FACTOR 2018: 0.837
5-year IMPACT FACTOR: 0.934

CiteScore 2018: 1.04

SCImago Journal Rank (SJR) 2018: 0.963
Source Normalized Impact per Paper (SNIP) 2018: 1.020

Cited By
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 1362 936 38
PDF Downloads 738 542 14