Beyond Language Modeling: An Exploration of Multimodal Pretraining - Explained Simply | ArXiv Explained