Three Methods for Occupation Coding Based on Statistical Learning

Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.

eISSN:: 2001-7367
Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Mathematics, Probability and Statistics

Journal RSS Feed

Three Methods for Occupation Coding Based on Statistical Learning

Published Online: Feb 21, 2017

Page range: 101 - 122

Received: Mar 01, 2016

Accepted: Oct 01, 2016

DOI: https://doi.org/10.1515/jos-2017-0006

Keywords
Automated coding, Machine learning, ISCO-88, ALLBUS

© by Hyukjun Gweon

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Three Methods for Occupation Coding Based on Statistical Learning

Published Online: Feb 21, 2017

Page range: 101 - 122

Received: Mar 01, 2016

Accepted: Oct 01, 2016

DOI: https://doi.org/10.1515/jos-2017-0006

KeywordsAutomated coding, Machine learning, ISCO-88, ALLBUS

© by Hyukjun Gweon

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Keywords
Automated coding, Machine learning, ISCO-88, ALLBUS