Core Concepts
Contemporary large language models struggle to effectively handle complex conditional and compositional reasoning required for aligning detailed user preferences with available flight options.
Abstract
This paper introduces GroundCocoa, a benchmark for evaluating the compositional and conditional reasoning capabilities of language models in the context of a flight booking task. The key highlights are:
GroundCocoa consists of natural language user requirements that vary in their conditional and compositional complexity. These requirements are grounded to a set of flight options, creating a multiple-choice task.
Experiments on several state-of-the-art language models, including GPT-4 Turbo, reveal a significant performance gap, with even the best model achieving only 67% accuracy. This underscores the challenges posed by conditional and compositional reasoning.
The authors analyze the impact of increasing complexity, measured by factors like reasoning width and conditional dependencies. They find that model performance degrades rapidly as complexity increases.
The authors also assess model robustness to unconventional user requirements, observing a notable drop in performance for the top-performing models, indicating a pretraining bias towards more typical needs.
The authors introduce entropy as a metric to quantify the confusion caused by conditional constraints in the user query, providing insights into why models may succeed or fail on certain samples.
Overall, the GroundCocoa benchmark highlights the limitations of current language models in handling the nuanced reasoning required for real-world task grounding, and calls for further advancements in compositional and conditional reasoning capabilities.
Stats
The price of the flight should be more than $1800.
The carbon emission of the flight should be above the average for that route.
The number of layovers on the route should be greater than 2.
Quotes
"Conditional reasoning involves the comprehension and application of logical rules that are typically structured in "if-then" formats. It is critical to personal decision-making in everyday life through an evaluation of potential scenarios and anticipation of consequences."
"Compositional reasoning, on the other hand, is the ability to combine solutions to simpler sub-problems, and integrate them in a structured manner to solve a more complex task. This cognitive process entails understanding the interplay between different sub-problems."