Core Concepts
Developing large context size transformers to understand long video and language sequences effectively.
Abstract
This content discusses the challenges of training models on millions-length video and language sequences, utilizing techniques like Blockwise RingAttention. It covers the stages of learning long-context language and vision-language models, architectural modifications for vision input, training steps, evaluation results, further details on training, related works, and future work.
Introduction
Current limitations in language models understanding aspects beyond text.
Importance of joint modeling with videos for broader AI capabilities.
Stage I: Learning Long-Context Language Models
Extending context using Blockwise RingAttention.
Training steps for growing context size effectively.
Stage II: Learning Long-Context Vision-Language Models
Architectural modifications for incorporating vision input.
Training steps for joint training on text-image and text-video data.
Further Details
MFU during training stages.
Training loss curves.
Scaling inference for million-length sequences.
Conclusion
Addressing challenges in understanding the world by combining language and video.
Utilizing large autoregressive models with a 1M context size effectively.
Stats
"We train one of the largest context size transformers to date on video and text sequences."
"Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens."
Quotes
"We curate a large dataset of diverse videos and books to train on millions-length multimodal sequences."
"Our work paves the way for advancing AI models with reliable reasoning and a grounded understanding of the world."