Very nice article Angus, thank you for sharing.
I’m getting started on these topics and I have a couple of questions regarding your article, I would appreciate your input on them:
1, You said:
“deep learning vision systems require thousands of labelled images for each thing you want them to recognise. Compare this with humans, who can learn to recognise objects and people before we can even talk — well before we could be said to be given any labelled data”
I think this is actually inaccurate and might be misleading. Because we actually do get a strong supervision when we’re learning. We do separate object entities in an unsupervised fashion but we do object recognition in a supervised manner. Differentiating mom and dad for example happens with supervision coming from our other senses and emotions. We know that “mama” is the one with the soft voice and delicious food who smells in a certain way, and of course, looks in a certain way. Vision provides the image but the other senses provide some sort of supervision or labels associated with that image. Would you agree?
2, Later you said:
“ We feed our network a single character at a time, transforming the input in subtle ways across a few input frames to simulate video input, or the way a human might see something move”
I think there is an implicit assumption here that seeing movie-like inputs has something to do with the superiority of human vision. Is that so? If yes, why do you think that way?
3, From the images, it seems that the reconstruction of rotated inputs is harder and less accurate. The reconstructed manifolds seem a little blurry. Why do you think that is?
Source: Deep Learning on Medium