Core Concepts
Explainable machine learning approaches can provide valuable insights for geolinguistic authorship profiling in forensic linguistics, complementing traditional qualitative methods.
Abstract
This paper explores the use of explainable machine learning approaches for geolinguistic authorship profiling in forensic linguistics. The authors focus on dialect classification as a means of geolinguistic profiling, using a dataset of German social media posts.
The key highlights and insights are:
The authors fine-tuned BERT-based language models (XLM-RoBERTa and German BERT) on the dialect classification task, achieving high accuracies (up to 75.31% for the 3-class setting).
To understand the inner workings of the classifiers, the authors employed a leave-one-word-out (LOO) approach, which extracts the lexical features most relevant for the classification.
The extracted lexical features reflect regional linguistic variation, including dialectal items, textualizations of regional pronunciation, and place names. This provides valuable insights for geolinguistic profiling.
While a significant proportion of the extracted features are place names, the authors note that the models also rely on other dialectal lexical items, which can complement the expertise of forensic linguists.
The explainable nature of the approach allows for verification of the model's decisions and can help introduce the method to legal settings, even if the classifiers themselves may not meet court admissibility standards.
Overall, the authors demonstrate that explainable machine learning approaches can provide valuable insights for geolinguistic authorship profiling in forensic linguistics, complementing traditional qualitative methods.
Stats
The corpus consists of approximately 240 million tokens from about 8,500 locations, with 388 locations having a token count of over 10,000.
Quotes
"Even though research in forensic linguistics works more and more with statistical and computational approaches, authorship profiling often remains a manual task. This is at times credited to the black-box approaches in current NLP research, meaning that the lack of explainability precludes these approaches from being used in legal settings (see Nini, 2023)."
"While the approach does not fully explain the inner workings of the model, experts can use the extracted features to a) verify that the model indeed reached a sound decision, for example by evaluating the features against previous dialectological findings, and b) use the explanations to introduce the method to law enforcement or jurisprudence."