Source: Deep Learning on Medium
This blog post is also featured in the third issue of Embodied AI, the definitive AI avatar newsletter.
Make sure to subscribe to the bi-weekly digest of the latest news, technology, and trends behind AI avatars, virtual beings, and digital humans.
Alexa: An amazing virtual assistant that lacks common sense and vision
In the last issue of Embodied AI, we argued in favor of transforming audio-based virtual assistants, such as Alexa, into AI-powered avatars for ease of skill discovery and more humanlike interactivity. In short, start by equipping Alexa and Siri with eyes on a screen.
Therefore, we are delighted to find out that both Boris Katz, a principal researcher at MIT who helped invent virtual assistants, and Rohit Prasad, head scientist of Alexa, share similar opinions about the current limitations to virtual assistants, i.e. common sense, situational awareness, and the important role of eyes for virtual assistants.
“Incredible progress…incredibly stupid”
That is quite harsh, but it is how Katz thinks of Alexa, Siri, and other virtual assistants in his interview with Technology Review’s Will Knight: a conflicted feeling of pride and embarrassment. On the one hand, Katz is proud of the progress on and the adoption of virtual assistants. But on the other hand, he thinks these programs are “incredibly stupid”.
To be fair, Alexa and her likes are not stupid: they are rather a feat of software engineering with tremendous potential for improvement. But Katz’s candid opinions draw three important takeaways. First, Katz is dubious that training models on huge amounts of data would solve language understanding. Second, language understanding should not be isolated from other modalities like visual, tactile, and other sensory inputs. Third, common sense and intuitive physics are essential for virtual assistants.
Alexa Needs Eyes
Prasad discusses a question at EmTech Digital: “Alexa, why aren’t you smarter?” Given that users have little patience for dumb virtual assistants, Alexa’s popularity demonstrates how good software hacks have become in the absence of true machine intelligence.
But while Alexa can quickly access an encyclopedia-like knowledge base to respond to simple commands, the hack could only go so far. Prasad’s opinion is that “[the] only way to make smart assistants really smart is to give it eyes and let it explore the world.”
Recent news suggests that Amazon has already created versions of Alexa with a camera and is betting on home robotics for “mobile Alexa”. This is highly exciting news. However, the adjacent possible, our favorite framework, may suggest that robotics will take many more years before adding concrete value to users?
How to Smarten Up AI Assistants
So how do we make AI assistants smarter? Here are the suggestions at TwentyBN for smarter virtual assistants: Deep learning and common sense AI.
It’s all about computation, baby
In a recent blog titled “The Bitter Lesson”, renowned AI scientist Rich Sutton reflects on the recent advancements in speech recognition, computer vision, chess, and Go, observing a pattern again and again: AI researchers tended to start off pursuing methods that leveraged human knowledge but what triumphed in the end are “brute force” methods that leverage computation.
Sutton offers two takeaways from the bitter lesson: First, general purpose methods that continue to scale with increased computation, such as search and learning, are the most powerful and effective approach to AI. Second, we should stop trying to find simple ways to think about the concepts of minds, as their complexity is endless. After all, our goal is to have AI agents that discover like we do, not which that contain what we have discovered.
We agree with Katz that virtual assistants must be smarter. But instead of mirroring artificial intelligence on human intelligence, we share Sutton’s opinion that deep learning, leveraging the massive computational power easily available, is the right way to make AI assistants smarter and product-ready.
Common Sense for AI
Illustrating the difficulty of true language understanding for virtual assistants, Katz mentions a Winograd schema example: “This book would not fit in the red box because it is too small.” Obviously, humans have no trouble understanding that it refers to the box. But what is intuitive to us can often elude the “smartest” AI.
Roland Memisevic, CEO at TwentyBN, has long argued that true language understanding must be grounded in vision. This is the reason why TwentyBN continues to collect millions of videos for our datasets to teach AI this physical common sense.
As it turns out, AI systems trained on TwentyBN’s video datasets have learned a lot. MIT’s CSAIL, leveraging our Something-Something and Jester data, has trained AI that can track how objects change over time. Take a look at the visual explanations for action recognition illustrated by our AI researcher, Raghav Goyal: