Google AI Helps Sign Language ‘Take the Floor’ in Video Conferences

Original article was published by Synced on Artificial Intelligence on Medium

Google AI Helps Sign Language ‘Take the Floor’ in Video Conferences

We can often see sign language interpreters at press conferences or other public address scenarios, standing on stage and conveying speakers’ messages using hand gestures and body movements. But sign language has no such formal presence in the hectic environment of contemporary video conferencing, where the platforms generally use audio cues to spotlight the person speaking at a given moment. To enable signers to “take the floor” in such video meetings, a team of researchers from Google, Bar-Ilan University, and the University of Zurich recently developed a sign language detection model for video conferencing applications that can perform real-time identification of a person signing as an active speaker.

There are already effective ML-based sign language recognition systems for video that can identify and interpret the form and meaning of signs. The proposed browser-based model is designed to detect when signs appear.

The use of real-time video conferencing applications has greatly increased this year due to remote working arrangements. The team set out to develop a system that was easy to use and lightweight, to avoid confusing users or compromising call quality while enabling efficient and continuous video frame monitoring and detection. The model learns to isolate information from video that concerns physical actions in specified human body landmarks, such as joints.

The researchers developed a simple optical-flow representation of the observed motion of these human body landmarks based on pose estimation, which considerably reduces input size compared to using an entire HD image. Poses are extracted from each video frame and the optical flow from consecutive frames is continuously calculated using the landmarks. The human optical-flow representations are then fed to a temporally sensitive long-short term memory (LSTM) architecture neural network to classify whether a person is signing or not.

The researchers tested their approach on the German Sign Language corpus (DGS), comprising 301 videos of people signing and gloss annotations showing in which frames people are signing. Using a single-layer LSTM followed by a linear layer to predict when a person is signing using optical flow data, the model achieved up to 91.5 percent accuracy, with just 3.5 ms (0.0035 seconds) of processing time per frame.

When a video conference participant is detected using sign language is detected, an ultrasonic 20KHz audio tone is generated through their webcam to trigger the active speaker function in video conferencing applications. It “manages to fool Google Meet, Zoom and Slack into thinking the user is speaking, while still being inaudible,” the paper explains.

Google AI has open-sourced the training code and models for web demo on GitHub. The paper Real-Time Sign Language Detection using Human Pose Estimation is available on the Google Research website. The model will be presented at SLRTP2020 and demoed at ECCV2020.