D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI - Explained Simply | ArXiv Explained