Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory - Explained Simply | ArXiv Explained