Biological organisms learn to solve complex tasks by moving around their environment. This behavior can be replicated using transfer learning on a Siamese Convolutional Neural Network and pre-training it to predict egomotion (i.e. estimate, from one frame to the other, the motion the camera did). The egomotion dataset can be built artificially by applying image preprocessing or by collecting specific unlabeled images from a camera.

This approach breaks the computer vision paradigm that for feature learning you need to pre-train a neural network on a big, labelled, image dataset. Moreover, unlike egomotion based supervision, the results also suggest that the features learnt by a class-label based supervision model are not optimal for all visual tasks.