GPT-4 outperforms other models in generating code, showcasing potential as a reliable programming assistant.
The author proposes the LIEDER dataset to assess language models' knowledge of semantic properties in discourse entity recognition, revealing deficiencies in understanding the NOVELTY requirement. Despite mastering EXISTENCE, UNIQUENESS, and PLURALITY, large language models lack awareness of NOVELTY.
言語モデルのDE認識能力を評価する新しい方法を提案しました。
Large language models outperform few-shot approaches in text classification, highlighting the importance of prompt design and model size.
Large Language Models struggle in translating formal specifications accurately, limiting their utility in system design.
A principled methodology called TEL'M (Test and Evaluation of Language Models) is proposed to rigorously assess the capabilities of current and future language models across high-value commercial, government and national security applications.
Prometheus 2 is a state-of-the-art open-source language model that closely mirrors human and proprietary language model judgments for evaluating the quality of responses from various language models.