Three Methods for Occupation Coding Based on Statistical Learning

Open access

Abstract

Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.

ALLBUS. 2015. Available at: http://www.gesis.org/allbus (accessed October 10, 2016).

Appel, M.V. and E. Hellerman. 1983. “Census Bureau Experiments with Automated Industry and Occupation Coding.” In Proceedings of the American Statistical Association, Section on Survey Research Methods. August 15-18, 1983, Toronto, Canada. 32-40.

Belloni, M., A. Brugiavini, E. Meschi, and K. Tijdens. 2014. Measurement Error in Occupational Coding: an Analysis on SHARE Data. Ca’ Foscari University of Venice, Department of Economics, Working Paper 24. Doi: http://dx.doi.org/10.2139/ssrn.2539080.

Bethmann, A., M. Schierholz, K. Wenzig, and M. Zielonka. 2014. “Automatic Coding of Occupations.” In Proceedings of Statistics Canada Symposium. August 29-31, 2014, Québec, Canada. Available at: http://www.statcan.gc.ca/sites/default/files/media/14291-eng.pdf (accessed October 10, 2016).

Chen, B.-C., R.H. Creecy, and M.V. Appel. 1993. “Error Control of Automated Industry and Occupation Coding.” Journal of Official Statistics 9: 729-745. http://www.jos.nu/Articles/abstract.asp?article¼94729 (accessed October 10, 2016).

Clarke, F.R. and S.J. Brooker. 2011. Use of Machine Learning for Automated Survey Coding. In Proceedings of the 58th ISI World Statistics Congress. August 21-26, 2011, Dublin, Ireland.

Conrad, F.G., M.P. Couper, and J.W. Sakshaug. 2016. “Classifying Open-Ended Reports: Factors Affecting the Reliability of Occupation Codes.” Journal of Official Statistics 32: 75-92. Doi: http://dx.doi.org/10.1515/JOS-2016-0003.

Creecy, R.H., B.M. Masand, S.J. Smith, and D.L. Waltz. 1992. “Trading MIPS and Memory for Knowledge Engineering.” Communications of the ACM 35: 48-64. Doi: http://dx.doi.org/10.1145/135226.135228.

Day, J. 2014. Using an Autocoder to Code Industry and Occupation in the American Community Survey. Presentation for the Federal Economic Statistics Advisory Committee Meeting. Available at: http://www2.census.gov/adrm/fesac/2014-06-13_day.pdf (accessed October 10, 2016).

Elias, P. 1997. “Occupational Classification (ISCO-88): Concepts, Methods, Reliability, Validity and Cross-National Comparability.” OECD Labour Market and Social Policy Occasional Papers 20, OECD Publishing. Available at: https://ideas.repec.org/p/oec/elsaaa/20-en.html (accessed October 10, 2016).

Elias, P. and M. Birch. 2010. Tuning CASCOT for Industry and Occupation Coding in the Scottish Census of Population 2011. Technical Report, Institute for Employment Research. Coventry: University of Warwick.

Ferrillo, A., S. Macchia, and P. Vicari. 2008. “Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions.” In Proceedings of the European Conference on Quality in Official Statistics - Q2008. July 8-11, 2008, Rome, Italy. Available at: http://q2008.istat.it/sessions/paper/15Ferrillo.pdf (accessed January 2017).

Fix, E. and J.L. Hodges. 1951. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report, USAF School of Aviation Medivine, Randolph Field, Texas. Project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951.

Friedman, J.H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189-1232. Available at: http://www.jstor.org/stable/2699986 (accessed October 10, 2016).

Ganzeboom, Harry B.G. and Donald J. Treiman. 2003. “Three Internationally Standardised Measures for Comparative Research on Occupational Status.” In Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-Economic Variables, edited by J.H.P. Hoffmeyer-Zlotnik and C. Wolf, pp. 159-193. Doi: http://dx.doi.org/10.1007/978-1-4419-9186-7_9.

Geis, A. 2011. Handbuch fu¨r die Berufsvercodung. Technical Report, GESIS, Mannheim, Germany. Available at: http://www.gesis.org/fileadmin/upload/dienstleistung/tools_standards/handbuch_der_berufscodierung_110304.pdf (accessed October 10, 2016).

Geis, A.J. and J.H.P. Hoffmeyer-Zlotnik. 2000. “Stand der Berufsvercodung.” ZUMA Nachrichten 24: 103-128.

Iezzi, D.F., M. Lori, F. Lorenzini, M. Nicosia, and S. Stoppiello. 2014. “An Application of Text Mining Technique for the Census of Nonprofit Institutions.” In Statistical Methods and Applications from a Historical Perspective, edited by F. Crescenzi and S. Mignani, pp. 143-152. Springer. Doi: http://dx.doi.org/10.1007/978-3-319-05552-7_13.

International Labour Office. 1990. International Standard Classification of Occupations, ISCO-88. International Labour Office. Available at: http://www.ilo.org/public/libdoc/ilo/1990/90B09_411_engl.pdf (accessed October 10, 2016).

Joachims, T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, Volume 1398. April 21-23, 1998, Chemnitz, Germany, 137-142. Doi: http://dx.doi.org/10.1007/BFb0026683.

Jones, R. and P. Elias. 2004. CASCOT: Computer-Assisted Structured Coding Tool. Technical Report, Institute for Employment Research. Coventry: University of Warwick. Available at: http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ (accessed October 10, 2016).

Jung, Y., J. Yoo, S.-H. Myaeng, and D.-C. Han. 2008. “A Web-Based Automated System for Industry and Occupation Coding.” In Web Information Systems Engineering - WISE 2008, edited by J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, and X. Wang. Volume 5175, 443-457. Springer. Doi: http://dx.doi.org/10.1007/978-3-540-85481-4_33.

Kalpic, D. 1994. “Automated Coding of Census Data.” Journal of Official Statistics 10: 449-463.

Knaus, R. 1987. “Methods and Problems in Coding Natural Language Survey Data.” Journal of Official Statistics 3: 45-67.

Koch, A. and M. Wasmer. 2004. “Der ALLBUS als Instrument zur Untersuchung sozialen Wandels: Eine Zwischenbilanz nach 20 Jahren.” In Sozialer und Politischer Wandel in Deutschland, edited by R. Schmitt-Beck, M. Wasmer, and A. Koch, 13-41. VS Verlag fu¨r Sozialwissenschaften.

Maitra, R. and I.P. Ramler. 2010. “A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere.” Journal of Computational and Graphical Statistics 19: 377-396. Doi: http://dx.doi.org/10.1198/jcgs.2009.08155.

Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. 2014. e1071: Misc Functions of the Department of Statistics, TU Wien. Available at: http://CRAN.R-project.org/package¼e1071 (accessed October 10, 2016).

O’Reagan, R.T. 1972. “Computer-Assigned Codes from Verbal Responses.” Communications of the ACM 15: 455-459. Doi: http://dx.doi.org/10.1145/361405.361419.

Ossiander, E.M. and S. Milham. 2006. “A Computer System for Coding Occupation.” American Journal of Industrial Medicine 49: 854-857. Doi: http://dx.doi.org/10.1002/ajim.20355.

Platt, J. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” In Advances in Large Margin Classifiers, edited by A.J. Smola, P. Bartlett, B. Scho¨lkopf, and D. Schuurmans, 61-74. Cambridge, Massachusetts: MIT Press.

R Core Team. 2014. “R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.” Available at: http://www.R-project.org/ (accessed October 10, 2016).

Russ, D.E., K.-Y. Ho, C.A. Johnson, and M.C. Friesen. 2014. “Computer-Based Coding of Occupation Codes for Epidemiological Analyses.” In Proceedings of the 27th IEEE International Symposium on Computer-Based Medical Systems. May 27-29, 2014, New York, USA, 347-350. Doi: http://dx.doi.org/10.1109/CBMS.2014.79.

Schierholz, M. 2014. “Automating Survey Coding for Occupation.” Master’s thesis, Ludwig-Maximilians-Universita¨t Munich. Available at: https://epub.ub.uni-muenchen.de/21444/index.html (accessed October 10, 2016).

Scholtus, S., R. van de Laar, and L. Willenborg. 2014. The Memobust Handbook on Methodology for Modern Business Statistics. Available at: https://ec.europa.eu/eurostat/cros/system/files/NTTS2013fullPaper_246.pdf (accessed January 2017).

Scholz, E., and M. Wasmer. 2009. German General Social Survey 2006. English Translation of the German “ALLBUS”- Questionnaire. Technical Report, GESIS, Mannheim, Germany. Available at: http://nbn-resolving.de/urn:nbn:de:0168-ssoar-207035 (accessed October 10, 2016).

Schonlau, M., and N. Guenther. 2016. Text Mining Using N-Grams. Social Science Research Network. Doi: http://dx.doi.org/10.2139/ssrn.2759033.

Silla, C.N., and A.A. Freitas. 2011. “A Survey of Hierarchical Classification across Different Application Domains.” Data Mining and Knowledge Discovery 22: 31-72. Doi: http://dx.doi.org/10.1007/s10618-010-0175-9.

Snowball. 2015. Available at: http://snowball.tartarus.org/algorithms/german/stemmer.html (accessed October 10, 2016).

Statistisches Bundesamt. 2010. Demographische Standards. Technical Report, Wiesbaden, Germany. Available at: https://www.destatis.de/DE/Methoden/StatistikWissenschaft- Band17.html (accessed October 10, 2016).

Thompson, M., M.E. Kornbau, and J. Vesely. 2012. “Creating an Automated Industry and Occupation Coding Process for the American Community Survey.” Available at: http://ftp.census.gov/adrm/fesac/2014-06-13_thompson_kornbau_vesely.pdf (accessed October 10, 2016).

Tijdens, K. 2014. “Dropout Rates and Response Times of an Occupation Search Tree in a Web Survey.” Journal of Official Statistics 30: 23-43. Doi: http://dx.doi.org/10.2478/jos-2014-0002.

Tijdens, K. 2015. “Self-Identification of Occupation in Web Surveys: Requirements for Search Trees and Look-Up Tables.” Survey Methods: Insights from the Field (SMIF). Doi: http://dx.doi.org/10.13094/SMIF-2015-00008.

Tourigny, J.Y., and J. Moloney. 1995. “The 1991 Canadian Census of Population Experience with Automated Coding.” In United Nations Statistical Commission on Statistical Data Editing.

Vapnik, V.N. 2000. The Nature of Statistical Learning Theory. 2nd edition. New York: Springer.

Weiss, S.M., N. Indurkhya, T. Zhang, and F. Damerau. 2010. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer.

Wenzowski, M.J. 1988. “ACTR - A Generalised Automated Coding System.” Survey Methodology 14: 299-308.

Yu, C. 2002. High-Dimensional Indexing: Transformational Approaches to High- Dimensional Range and Similarity Searches. Volume 2341. Berlin: Springer. Doi: http://dx.doi.org/10.1007/3-540-45770-4.

Züll, C. 2014. Berufscodierung. Technical Report, GESIS - Leibniz Institut fu¨r Sozialwissenschaften (SDM Survey Guidelines). Mannheim. Doi: http://dx.doi.org/10.15465/sdm-sg_019.

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information


IMPACT FACTOR 2017: 0.662
5-year IMPACT FACTOR: 1.113

CiteScore 2017: 0.74

SCImago Journal Rank (SJR) 2017: 1.158
Source Normalized Impact per Paper (SNIP) 2017: 0.860

Cited By

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 593 593 76
PDF Downloads 310 310 61