Source: Deep Learning on Medium
Last Friday, Embodied AI was enjoying a glass of wine at some friends’ apartment when he caught a glimpse of a newly purchased Google Home that sat in the corner. His curiosity for all embodied AI took over and he asked them what they use their new tech for. “Besides playing music,” they answered, “We ask it for the time, the weather, and…to turn the lights on and off.”
Like many consumers, Embodied AI’s friends have encountered the skill discovery problem in voice-based smart technology. While a smart speaker can do more than report the weather, turn on the lights, and order food — 4,200 things for Google Assistant alone — it cannot effectively communicate the countless ways in which it can assist us. Since skill discovery is a crucial element in making virtual assistants more effective and humanlike, we will explore the evolution of skill discovery, covering its challenges, current progress, and how it will continue to develop in the future.
Skill Discovery: 2 challenges and 8 recommendations
Benedict Evans (Andreessen Horowitz) calls skill discovery in smart speakers a fundamental UX puzzle: the Alexa audio-only interface is convenient, for example, until you expect it to recite its 80,000 skills one by one to a user.
There are 2 factors that make skill discovery especially challenging:
- Availability: virtual assistants’ skillsets are rapidly expanding. Voicebot reports that since 2018, Google Assistant’s capabilities increased by 2.5 times to 4,253 actions and Alexa’s increased by 2.2 times to almost 80,000.
- Affordances: users are unsure about what their virtual assistants are capable of, leading to misaligned expectations, and many of them neglecting to use the internet to understand the full breadth of their assistant’s skills.
8 ways for virtual speakers to help people discover all they can do
- Be proactive: instead of reactive, user-initiated interaction, enable virtual assistants to proactively engage users with their skills.
- Timing is everything: present users with skill suggestions during the moment-of-need to ensure that these capabilities are more likely to be remembered in the future.
- Use contextual and personal signals: leverage a combination of the user’s contextual and personal signals, including long-term habits and patterns
- Examine additional signals: contextual and personal information that is not yet accessible, such as human activities only observable with vision, are untapped opportunities for personalized recommendations.
- Consider privacy and utility: offer the right help at the right moment and proactively attribute it to the permissioned data access via recommendation explanations.
- Permit multiple recommendations: suggest multiple skills when the recommendation model’s confidence is below a certain threshold at which a definitive skill would typically be suggested.
- Leverage companion devices: allow access to various screens through WiFi or Bluetooth connectivity, such as smartphones, tablets, or desktop PC’s, to enrich context and assist in providing more-relevant skill suggestions.
- Support continuous learning: suggest new skills based on previous patterns of activity.
Progress on improving skill discovery
Most recently, Amazon introduced Alexa Conversations, a Deep Learning approach that allows developers to more effectively improve skill discovery with less effort, fewer lines of code, and less training data. While it is still in “preview”, Alexa Conversations has already generated considerable excitement among developers who build skills for the smart speaker.
Essentially, Alexa Conversations aims to establish a more natural and fluid interaction between Alexa and its users within a single skill. In future releases, the software is expected to bring multiple skills into a single conversation. It also claims to be able to handle ambiguous references, such as, “Are there any Italian restaurants nearby?” (near where?), as well as context preservation when transitioning from one skill to another, such as remembering the location of a certain movie theater when suggesting nearby restaurants.
At Amazon’s re:MARS AI and ML conference in June, Rohit Prasad, VP and head scientist at Alexa, mentioned that Alexa Conversations’ machine learning capabilities can help it predict a customer’s true intention and goal from the direction of the dialogue, thus proactively enabling flow across multiple skills during conversation. If these promises are met, the command-query interaction with Alexa will surely begin to feel more like a natural human interaction.
The Future: Seeing and embodied virtual assistants
The progress made by the Alexa team is surely exciting, but conversational AI is not the only area with room for improvement. At Embodied AI we endorse the integration of conversational AI and video understanding into an anthropomorphically embodied assistant brought about through the addition of a camera and screen to the existing speaker interface.
As our understanding of both natural language processing and computer vision continues to advance, there is little reason to limit virtual assistants to audio. The recent release of Amazon Echo Show 5, Facebook Portal and Google’s Nest Hub Max, all of which come with a camera and a screen, already foreshadow the industry’s movement towards virtual assistants that one day can see and be seen. One could reasonably speculate that the big tech companies are working on visually-enabled and embodied virtual assistants to replace their smart speakers in the near future. It’s a natural extension of their existing product lines.
Benefits of virtual assistants with a camera, screen, and anthropomorphic embodiment include:
- Multimodal I/O: instead of being restricted to audio, virtual assistants equipped with both speech I/O and video I/O are empowered with greater intelligence and a more engaging graphic user interface.
- Improved skill discovery experience: leveraging computer vision captures contextual and personal signals currently untapped by audio-only devices, allowing the transition from user-initiated interaction to proactive assistance.
- Companion instead of a servant: with digital, human-like bodies, virtual assistants will no longer be perceived as servants, but rather as helpers. While this does not directly improve skill discovery, it enriches the overall virtual assistant experience.
Roland Memisevic, TwentyBN’s CEO, envisions a future where our conversations with virtual assistants will, unlike with the current smart speakers, not feel like phone-calls:
“Embodied avatars will not necessarily need wake words but can consistently be here, see, and listen, especially when they are edge-powered and free of privacy concerns. Using computer vision to unlock context awareness for virtual assistants, we will shift the assistant paradigm from query-response to memory-infused companionship. Asking our future companions about what skills they have will feel as ludicrous as asking your best friend if they breath oxygen.”
“Hey Google, let’s wrap this up!”
Perhaps on another Friday evening, sometime in the near future, Embodied AI will revisit his friends’ rooftop flat and discover a new virtual assistant, one that is not only well-versed in conversation, but also equipped with eyes for understanding context and identifying needs not captured by words. Perhaps it might even have a digitized human body, becoming a virtual friend, who shares and adds to the lively atmosphere of a mid-summer’s night in Berlin.
Embodied AI reaches to take a sip from his glass of wine, and upon discovering that it is empty, hears his friends’ Google Assistant call from the corner, “I think we’re ready for another bottle of the white wine!”
Embodied AI is a bi-weekly newsletter of the latest news, technology, and trends behind AI avatars and virtual beings. Subscribe below: