Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models - Explained Simply | ArXiv Explained