Core Concepts
This paper provides a comprehensive catalog and classification of the most popular Transformer models, along with an introduction to the key aspects and innovations in Transformer architectures.
Abstract
The paper starts by introducing the Transformer architecture, including the encoder-decoder structure, the attention mechanism, and the distinction between foundation and fine-tuned models. It then presents a detailed catalog of the most prominent Transformer models, categorizing them based on their pretraining architecture, pretraining or fine-tuning task, and application.
The key highlights of the catalog include:
Pretraining Architecture: Encoder, Decoder, or Encoder-Decoder models
Pretraining Tasks: Language Modeling, Masked Language Modeling, Denoising Autoencoder, etc.
Applications: Natural language processing, image/video processing, protein folding, and more
Model details: Release date, number of parameters, training corpus, licensing, and the research lab behind each model
The paper also provides visualizations to help understand the relationships between different Transformer models, including a family tree and a chronological timeline. Finally, it discusses the significant impact of Transformer models, particularly in the recent surge of large language models like ChatGPT.