Original article was published by Jamshed Khan on Deep Learning on Medium
Let’s dig a little deeper into the product recognition pipeline’s internal design structure.
According to this system diagram from the Google AI blog shown above, the design can be broken down into nine major components:
Frame-Selector — The selector component comes into play when the user points the camera towards a product, acting as a filter. Its purpose is to select frames from the mobile vision stream that satisfy certain quality criteria (such as white balance).
Detector — The selected frames then go on to an ML model whose function is to zero in on the individual regions of the image that are of interest. In other words, this step imposes the conventional bounding boxes on the image frames. In the Lookout app, this is an MnasNet model.
Cache — The system has its dedicated cache reserve for the stream images and can be called seamlessly by the other system components. To capitalize on memory saving, the cache has protocols to avoid duplicate or redundant frames.
Object tracker — Once the frames have had bounding boxes drawn on them, the live stream tracks the detected feature point in real time using MediaPipe Box tracking.The object tracker maintains an object map where each object gets assigned a unique ID, allowing for better differentiation between objects and reducing redundancy, due to duplicated objects in memory. If an object is repeated in the stream, the object tracker will simply update the ID, depending on the bounding box.
Embedder — The embedder is a neural network trained from a large classification model spanning tens of thousands of classes. It has a special “embedding” layer that projects the input image into an ‘embedding space’. The notion here is to tune the network to recognize that two points being close within this space means that the images they represent are visually similar (for example, different images of the same product)
Here’s where the researchers get creative — Since the model is way too large to be used on-device, the vectors resulting from this “embedding space” are used to train a smaller version of the model. They refer to the original model as the ‘teacher model’ and the smaller, mobile-friendly version as the ‘student model’. To further reduce allocated memory, principal component analysis helps to reduce the overall dimensionality.
Index searcher — This component is tasked with looking up the relevant results for the image patterns. It performs a KNN search using the features as a query and returns the highest-ranked index containing the matching metadata, such as brand name or package size. Low latency is achieved by having the indexes clustered using the k-means algorithm.
OCR — There is also an OCR component with the bundle. However, while traditional ML algorithms focus on using OCR for primary index searching, its purpose here is to aid the system to refine results. OCR helps extract additional information from the frames (for example: packet size, product flavor variant, etc.) To this purpose, a score is assigned to it, with the help of a scorer component, which improves precision.
Scorer — The scorer component assigns scores to the results obtainer from the index searcher, assisted by the OCR to achieve more accurate results. The result with the highest score is used as the final product recognition displayed to the user.
Result presenter — This is a UI component whose job is to present the final result to the user. This can be done via the app GUI or a speech service.
While this product recognition system was originally implemented to help users with visual impairments that make it difficult to identify packaged products on display, it can be a useful tool during the ongoing COVID 19 pandemic, as well.
It can help replace the need to physically touch a product on display, allowing customers and store employees to examine packaging information by simply using their smartphone instead. Since the computer vision tasks required for this activity are performed completely on-device, hurdles such as internet connectivity or latency issues do not come into play. The on-device functionality in a product recognition app such as this can be used to usher in various in-store experiences, such as displaying more detailed facts about products (nutritional information, allergen warnings, etc.).
This is just one of many steps that Google is taking to further on-device machine learning. Google’s Pixel 4 smartphone, released last year, was certainly a milestone in the field, featuring the Pixel Neural Core with an instantiation of the Edge TPU architecture, Google’s machine learning accelerator for edge computing devices. Google has also developed next-generation models such as MobileNetV3 and MobileNetEdgeTPU to advance on-device computer vision.