Conceitos Básicos
Across different prompting techniques and language models, non-native language prompts outperform native language prompts in eliciting desired outputs for a variety of social media and news-related NLP tasks.
Resumo
This study investigates the impact of different prompt structures (native, non-native, and mixed) on the performance of large language models (LLMs) across 11 NLP tasks associated with 12 Arabic datasets. The authors conducted 197 experiments involving 3 LLMs (GPT-4o, Llama-3.1-8b, and Jais-13b-chat) and 3 prompting techniques (zero-shot and few-shot).
The key findings are:
- Few-shot prompting shows improved performance compared to zero-shot, corroborating previous findings.
- Across different prompt setups, the non-native prompt outperforms the native and mixed prompts, with Llama 3.1 being 7% and 8% better than non-native and native prompts, respectively.
- For new tasks with no training data, the zero-shot setup with non-native prompts performs the best across all models.
- GPT-4o outperforms all models in all prompt setups.
The authors also provide an error analysis, highlighting common issues with the Jais-13b model, such as misclassifying few-shot samples, hallucinating irrelevant responses, and returning only one class for the majority of samples.
Estatísticas
The dataset contains a total of 164,498 training instances and 48,772 original test instances across the 12 datasets.
Citações
"Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts."
"For a new task where no training data is available, the zero-shot setup is the ideal solution, and based on our findings, non-native prompts perform better across all models."
"GPT-4o outperforms all models in all prompt setups."