This paper introduces OccCANINE, a new tool for automatically transforming occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. The authors finetune a pre-existing language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took days and weeks.
The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. The authors show that OccCANINE has an overall accuracy of 93.5 percent, with precision, recall, and F1-score above 90 percent. The tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history, and various related disciplines.
The authors provide a detailed evaluation of the model's performance, including out-of-distribution testing, analysis of performance by label frequency, and examination of potential biases related to socioeconomic status. The results demonstrate the model's robustness, adaptability, and lack of systematic biases. The authors also provide recommendations on how to use OccCANINE and highlight its potential for broader applications beyond occupational data.
A otro idioma
del contenido fuente
arxiv.org
Consultas más profundas