Original article can be found here (source): Deep Learning on Medium
Action Modifiers: Learning from Adverbs in Instructional Videos
The narrations are used → as a signal to train the mode → this is very cool and amazing. (they formulated this as an embedding problem → remove/add the adverbs)
Now we are combining both visual cues as well as language ques → videos are popular types of videos → we can learn a lot of stuff from them.
Following certain instructions → step by step with the order, → this is (some) what, the authors are optimizing for. (the narrations signals → are very noisy → since there would be a mismatch between screen and audio).
So they had to reformulate this problem to → different embeddings and stuff. (there are couple of related works here)
Now using the online videos as resources → we are able to train huge models → and they just might cover a lot of distributions. (captioning, visual information retrieval → these are related works → and they use LSTM, GRU, or attention) → but they do not use audios of speech.
Even part of speech can be integrated together → object attributions in images → they are very important. Weakly embeddings → under weak supervision → these are using losses such as triplet loss).
Wow, this is some crazy setup → moving into a certain space, while moving out to certain embedding spaces → they are pretty awesome.
They would need to disentangle some portions of the data before → and they also introduce a multi-headed weakly supervised loss function. (visual representation is highly depended on the action)
If the action changes → the visual motion changes → this is a signal. (they use translation or some kind of image modification training → these are pretty awesome and cool).
A complicated → data setup seems to be required → to only get the frames where a certain action is performed. For every action in the video → that’s quite a lot.
So a large contribution → is done by developing the training procedures as well as → how to train this type of mode → since it is a quite complicated training procedure.
Part of Speech → they are all tagged in the HowTo100M Dataset.
Sometimes → the actions are not captured in the video → while some other frames we have → are pretty tricky results.
The embedding space is 300 dimension → Woof Woof. (the batch size is 512 → these are some powerful GPUS)
They had to create the baseline → since they were the first to do this →
When compared to other models → the author’s method does much better. (Glo VE Features are much better to use → compared to SVM).
The attention of the model → intensity of the color → where the model is looking at. (basically the author’s method is SOTA). (even the video length were analyzed → very interesting).
Video to adverb retrieval → very interesting problem.