VL-JEPA: Joint Embedding Predictive Architecture for Vision-language - Explained Simply | ArXiv Explained