Training a large context size transformer model on long video and language sequences to achieve advanced AI capabilities.