This research paper introduces a novel benchmark called ASH for evaluating the culinary creativity of large language models (LLMs) in the context of cuisine transfer. The researchers explore the ability of LLMs to adapt recipes to different cultural styles while maintaining the essence of the original dish.
The study aims to assess the generative and evaluative capabilities of LLMs in the culinary domain, specifically focusing on their ability to perform cuisine transfer in recipe generation.
The researchers created 800 standardized cuisine transfer instructions based on 20 base dishes and 40 cuisines. Six open-source LLMs were tasked with generating recipes based on these instructions, resulting in 4,800 recipes. The generated recipes were then evaluated using the ASH benchmark, which comprises three criteria: authenticity (how well the recipe maintains the essence of the original dish), sensitivity (how well the recipe reflects the culinary elements of the target cuisine), and harmony (the overall quality and balance between authenticity and sensitivity).
The study found that LLMs exhibit varying degrees of success in cuisine transfer. While some models demonstrate proficiency in incorporating cuisine-specific ingredients, they often struggle to maintain the authenticity of the base dish or achieve a harmonious blend of flavors. The evaluation also revealed discrepancies in how different LLMs interpret and rate the same recipes, highlighting the subjectivity inherent in culinary evaluation.
The ASH benchmark provides a valuable framework for evaluating the culinary creativity of LLMs. The findings suggest that while LLMs have made strides in recipe generation, they still lack the nuanced understanding of cultural cuisines required for consistently successful cuisine transfer.
This research contributes to the growing field of LLM evaluation by introducing a novel benchmark specifically designed for the culinary domain. The ASH benchmark can be used to guide the development of future LLMs with improved culinary knowledge and creative capabilities.
The study acknowledges limitations related to the sample size of cuisines and base dishes used, as well as the reliance on automated evaluation metrics. Future research could expand the benchmark to encompass a wider range of cuisines and incorporate human evaluation to provide a more comprehensive assessment of LLM-generated recipes.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문