Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding - Explained Simply | ArXiv Explained