HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning - Explained Simply | ArXiv Explained