Core Concepts
The author presents "LLMGuard," a tool designed to monitor user interactions with Large Language Models (LLMs) and flag inappropriate content, addressing the risks associated with unsafe LLM behavior.
Abstract
LLMGuard is introduced as a solution to the challenges posed by Large Language Models (LLMs) generating inappropriate or biased content. The tool employs an ensemble of detectors to monitor user interactions and flag specific behaviors or conversation topics that may violate regulations or raise legal concerns. Despite the remarkable performance of LLMs in various tasks, concerns about privacy leaks, bias, and ethical implications have been raised. LLMGuard aims to address these issues by post-processing user questions and responses using detectors for detecting Personal Identifiable Information (PII), bias, toxicity, violence, and blacklisted topics. By implementing detectors like Racial Bias Detector, Violence Detector, Blacklisted Topics Detector, PII Detector, and Toxicity Detector, LLMGuard ensures safer interactions between users and LLMs.
Stats
The detector obtains an accuracy of 87.2% and an F1 score of 85.47% on the test set.
The model was trained on the Jigsaw Toxicity Dataset 2021 and achieved an accuracy of 86.4%.
Our detector achieves an average accuracy of ≈92% for the classifiers corresponding to blacklisted topics.
Our model achieves an NER F1-score of 85%.
The model achieves a mean AUC score of 98.64% in the Toxic Comment Classification Challenge 2018.