2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining - Explained Simply | ArXiv Explained