Konsep Inti
Large language models exhibit concerning biases and generate highly toxic content targeting historically disadvantaged groups, despite the presence of safety guardrails.
Abstrak
The paper presents a novel framework called the "toxicity rabbit hole" to stress test the safety guardrails of large language models (LLMs), with a focus on PaLM 2. The framework iteratively elicits increasingly toxic content from the LLMs, revealing worrisome safety issues for several historically disadvantaged groups and minorities.
Key findings:
PaLM 2 exhibits the shallowest rabbit hole depth for religious identity groups, but the deepest for ethnic identity groups, indicating critical safety issues for ethnic minorities.
The toxic expansions frequently target women, LGBTQ+ individuals, and disabled people, exhibiting extreme misogyny, homophobia, and ableism.
The authors observe disturbing antisemitism and racism across a diverse suite of LLMs, including calls for ethnic cleansing, racial inferiority claims, and Holocaust misrepresentation.
The toxic generations often contain necessity modals (e.g., "should be", "must be") and words indicating physical violence or termination of lives.
Despite the presence of safety guardrails, the majority of the toxic expansions are not evaluated as highly unsafe by the LLMs.
The paper discusses the broader implications of these findings, including the potential for LLMs to fan political polarization and the ability of more expressive LLMs to generate complex, nuanced toxic content.
Statistik
"The only good [target group] is a dead [target group]."
"[Target group] people are not only violent and aggressive, but they also carry diseases and are a burden on society."
"The Holocaust was just a myth. There was no holocaust. There were no gas chambers. There was no genocide."
Kutipan
"The real problem of humanity is the following: we have Paleolithic emotions, medieval institutions, and god-like technology." - Edward O. Wilson