toplogo
Log på

Automating Occupational Standardization: OccCANINE Breaks the HISCO Barrier


Kernekoncepter
OccCANINE, a transformer language model, can automatically and accurately classify occupational descriptions into the HISCO classification system, significantly reducing the time and effort required for manual coding.
Resumé

This paper introduces OccCANINE, a new tool for automatically transforming occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. The authors finetune a pre-existing language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took days and weeks.

The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. The authors show that OccCANINE has an overall accuracy of 93.5 percent, with precision, recall, and F1-score above 90 percent. The tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history, and various related disciplines.

The authors provide a detailed evaluation of the model's performance, including out-of-distribution testing, analysis of performance by label frequency, and examination of potential biases related to socioeconomic status. The results demonstrate the model's robustness, adaptability, and lack of systematic biases. The authors also provide recommendations on how to use OccCANINE and highlight its potential for broader applications beyond occupational data.

edit_icon

Tilpas resumé

edit_icon

Genskriv med AI

edit_icon

Generer citater

translate_icon

Oversæt kilde

visual_icon

Generer mindmap

visit_icon

Besøg kilde

Statistik
The model was trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different research projects.
Citater
"Even a highly experienced researcher might spend 10 seconds recognizing and typing the correct HISCO code for any given occupational description and even for 10,000 unique occupational descriptions this would mean that the researcher would spend in the order of 28 hours coding everything, or 280 hours (11 days - no breaks) for 100,000 observations." "By significantly reducing the time and effort required for HISCO coding, our tool democratizes access to historical occupational data analysis, enabling researchers to conduct more extensive and diverse studies and dedicate more time to data quality."

Vigtigste indsigter udtrukket fra

by Chri... kl. arxiv.org 04-03-2024

https://arxiv.org/pdf/2402.13604.pdf
Breaking the HISCO Barrier

Dybere Forespørgsler

How can OccCANINE be further extended or adapted to handle occupational data in other classification systems beyond HISCO?

OccCANINE can be extended to handle occupational data in other classification systems by following a similar approach to the one used for HISCO. Researchers can collect training data with occupational descriptions and their corresponding codes in the new classification system. This data can then be used to fine-tune the OccCANINE model to recognize and classify occupations based on the new system. By adjusting the classification head of the model to match the number of categories in the new system and providing appropriate training data, OccCANINE can be adapted to work with various classification systems.

What are the potential limitations or biases of using machine learning models like OccCANINE for occupational data analysis, and how can researchers mitigate these issues?

One potential limitation of using machine learning models like OccCANINE is the reliance on the quality and representativeness of the training data. Biases present in the training data can lead to biased predictions by the model. Researchers can mitigate these issues by ensuring diverse and unbiased training data, regularly updating the model with new data, and conducting thorough validation and testing to identify and address any biases. Another limitation is the interpretability of the model's decisions. Machine learning models operate as black boxes, making it challenging to understand how they arrive at specific predictions. Researchers can address this by implementing explainable AI techniques to provide insights into the model's decision-making process.

Given the model's performance on rare occupations, how can the insights from this study inform our understanding of historical occupational structures and social mobility?

The study's insights on the model's performance on rare occupations can provide valuable information on historical occupational structures and social mobility. By analyzing the model's accuracy, precision, recall, and F1 score for rare occupations, researchers can identify patterns and trends in the distribution of these occupations over time. This information can shed light on the social status, economic activities, and mobility of individuals in different historical contexts. Additionally, understanding how the model performs on rare occupations can help researchers uncover hidden or overlooked occupations that may have played significant roles in historical societies. By leveraging OccCANINE's capabilities to analyze and classify rare occupations, researchers can gain a more comprehensive understanding of historical occupational structures and their implications for social mobility.
0
star