Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models - Explained Simply | ArXiv Explained