Основные понятия
While advanced language models demonstrate strong potential in software engineering tasks, the effectiveness of traditional prompt engineering techniques diminishes with their use, particularly for reasoning models.
Статистика
The average length of CoT steps is 3.52 in code generation, 4.35 in code translation, and 1.38 in code summarization.
For problems where the length of o1-mini CoT steps is longer than or equal to 5, the performance of o1-mini is 16.67% better than GPT-4o.
For problems where the length of o1-mini CoT steps is shorter than 5, the performance of o1-mini is 2.89% better than GPT-4o.
In scenarios where the CoT length is under 3 steps, o1-mini underperforms compared to GPT-4o in 24% of cases.
Nearly 40% of o1-mini's incorrect answers under zero-shot prompting are due to improper output formats, compared to 0% for GPT-4o.