Core Concepts
Large language models can embed factual knowledge, but reliably estimating the extent of this latent knowledge is challenging. This work proposes a novel in-context learning-based approach to estimate latent knowledge, which outperforms existing prompt-based methods in terms of reliability and performance.
Abstract
The paper proposes a new approach called IC-LKE (In-Context Learning-based Latent Knowledge Estimator) to estimate the factual knowledge embedded in large language models (LLMs). The key idea is to leverage the in-context learning (ICL) abilities of LLMs to infer the relationship between subjects and objects, without relying on explicit prompts about the relationship.
The paper first identifies several reliability concerns with existing prompt-based approaches for latent knowledge estimation, such as LLM-specific restrictions, unrestricted prompt engineering, and reliance on metalinguistic judgments. The IC-LKE design addresses these concerns by:
- Generating estimates for any factual topic and tokenization scheme.
- Limiting arbitrary prompt engineering to minimize overfitting and side-channels.
- Minimizing reliance on metalinguistic prompts.
The paper then explores the design space of IC-LKE, investigating the impact of the number of in-context examples and the presence of unknown or incorrect examples. The results show that more knowledgeable models require fewer in-context examples, and the method is relatively robust to unknown examples but vulnerable to incorrect examples.
The paper further compares IC-LKE with existing prompt-based approaches and demonstrates its superior performance. It then uses IC-LKE to systematically evaluate the factual knowledge of 49 open-source LLMs across 50 relations and 20,000 facts from the Wikidata knowledge base. The key findings include:
- Some model families (e.g., Mistral, Llama2, Gemma) are consistently more knowledgeable than others (e.g., Pythia, Bloom, OPT).
- Larger models within the same family generally embed more factual knowledge than smaller models, but they may not subsume the specific facts known by the smaller models.
- Fine-tuning LLMs for chatbot-like tasks reduces the amount of extractable latent knowledge compared to the base pre-trained models.
Overall, the paper presents a reliable and effective approach for estimating the factual knowledge in LLMs, and provides valuable insights into the knowledge structures of diverse LLM families and the impact of model scaling and fine-tuning.
Stats
The average probability of generating the correct object for in-context examples in the Mistral-7B model is around 85%.
Replacing 40 out of 200 in-context examples with unknown examples has a minimal impact on the surrounding examples.
Replacing 40 out of 200 in-context examples with incorrect examples significantly reduces the probability of generating the correct object for surrounding examples.
Quotes
"The latent knowledge estimation problem: To avoid making false assertions about a real-world entity, an LLM first needs to have factual (true) knowledge about the entity."
"Against this background, in this paper, we make four primary contributions:"
"We observe differences in the factual knowledge between different model families and models of different sizes, that some relations are consistently better known than others but that models differ in the precise facts they know, and differences in the knowledge of base models and their finetuned counterparts."