แนวคิดหลัก
Customizing large language models can enhance the educational experience by providing more accurate and contextually relevant responses, but requires careful consideration of different approaches and their associated complexities.
บทคัดย่อ
The content discusses the process of customizing large language models (LLMs) for use in higher education institutions, drawing insights and experiences from a technical university in Switzerland, ETH Zurich.
The key points are:
-
LLMs have had a significant impact on higher education, but general models may not always be suitable for specialized tasks. Customization can lead to more accurate and contextually relevant responses.
-
There are three main approaches to customizing LLMs: training from scratch, fine-tuning, and augmentation. Each approach has different levels of complexity and resource requirements.
- Training from scratch is the most complex and resource-intensive, requiring enormous training datasets and computational power. This is typically only feasible for large corporations.
- Fine-tuning an existing pre-trained model is more manageable, but still requires considerable effort and can lead to potential issues like "forgetting" or "hallucination."
- Augmentation, such as using Retrieval Augmented Generation (RAG), is a relatively easy and cost-effective approach that leverages existing LLMs and custom document databases.
-
Regardless of the customization approach, the issue of where to run the inference for the customized models needs to be addressed, as it can involve significant costs and infrastructure requirements.
-
The author provides a summary of the advantages and drawbacks of each customization method, as well as practical insights and experiences from the university's efforts to develop a customized educational model.
สถิติ
"Besides having to decide on crucial architectural features, today's models have billions of weight parameters, which require an enormous amount of training materials to properly adjust and optimize; the training corpuses of today's general-purpose models encompass trillions of tokens."
"For the education model in particular, there are publicly available datasets for "basic training," that is, content for typical high school and introductory college curricula, but it turned out to be an uphill battle to obtain specialized and institution-specific course materials from faculty."
"For the semantic search, these documents need to be embedded, which can take a few seconds for single documents to minutes for course scripts and hours for extensive databases (this includes artificial wait times, since API-access to the embeddings is usually subject to token-rate restrictions; we found it practical to submit ten chunks at a time with a two-second wait in-between)."
"In practice, for our RAG-based bot, it turned out that inference costs $7.50 per student per course per semester for Azure AI Services."
คำพูด
"Common lore is that powerful commercial systems like GPT-4 have essentially ingested the "whole internet;" while that may be a hyperbole, curating the training corpus is extremely work-intensive, followed by computationally intensive training over months, followed by supervised and unsupervised tuning and detoxing."
"The standard method for determining relevance is converting these text chunks into token vectors, using so-called embeddings, for example OpenAI's ada-embeddings [24]. Embedding is also charged by token, but it turned out that the cost is negligible, even for large document sets."