Core Concepts
BizBench introduces a benchmark to evaluate models' ability to reason about realistic financial problems through quantitative reasoning tasks, focusing on program synthesis and financial domain knowledge.
Abstract
BizBench is a benchmark that evaluates models' abilities to reason about financial problems through quantitative reasoning tasks. It comprises eight tasks that focus on program synthesis, quantity extraction, and domain knowledge in the finance domain. The benchmark aims to improve models' understanding of business and finance concepts by providing challenging tasks that require transparent reasoning processes.
The content discusses the challenges faced by large language models (LLMs) in reasoning about quantities and numbers in business and finance. It introduces BizBench as a solution to evaluate models' performance in this domain. The benchmark includes tasks such as program synthesis, quantity extraction, and domain knowledge evaluation to assess models' financial background knowledge, ability to parse financial documents, and capacity to solve problems with code.
BizBench consists of various tasks like FinCode for code generation from professional exams, SEC-Num for numerical span identification from SEC filings, and FormulaEval for testing knowledge of financial formulas. The evaluation of open-source and commercial LLMs highlights the need for improvement in models' financial understanding for real-world applications.
The study also includes few-shot experiments with state-of-the-art models like Falcon, MPT, StarCoder, Llama-2, Mistral/Mixtral, GPT variants to evaluate their performance on BizBench tasks. The analysis shows that model size, instruction tuning, and code-specific pretraining significantly impact task performance.
Overall, BizBench aims to push the boundaries of quantitative reasoning capabilities in finance by providing a challenging benchmark for evaluating model performance across various tasks related to program synthesis and financial domain knowledge.
Stats
Task Program Synthesis FinCode 121 16 ✓ ✓ ✓
CodeFinQA 844 4,669 ✓ ✓ ✓
CodeTAT-QA 392 2,864 ✓ ✓ ✓
Quantity Extraction ConvFinQA (E) 916 - ✓ ✓
TAT-QA (E) 248 - ✓ ✓
SEC-Num 2,000 6,845 ✓ ✓ ✓
Domain Knowledge FinKnow 877 - ✓
FormulaEval 50 - ✓ ✓ ✓
Quotes
"Large language models show strong performance on question-answering but struggle with reasoning about quantities."
"BizBench focuses on evaluating financial quantitative reasoning through program synthesis tasks."