Can Vision-Language Models Solve the Shell Game? - Explained Simply | ArXiv Explained