Developing large context size transformers to understand long video and language sequences effectively.