Perception Encoder: The best visual embeddings are not at the output of the network - Explained Simply | ArXiv Explained