By Srividhya Thirumalairajan
“A healthful eating pattern and regular physical activity are key components of diabetes management.” — American Diabetes Association
As people are becoming increasingly aware of the importance of a healthy diet, the need for automatic food and drink recognition systems has risen. Recent advances in artificial intelligence and deep learning have enabled companies and start-ups to develop solutions to automatically classify food or drink images. These image recognition services can also estimate nutritional values aiding in dietary assessment and planning, which can be applicable for users who want to prevent nutrition-related conditions
This summer I had the opportunity to collaborate with Sandya Madhavan, a Data Analyst intern at Zeta Metrics and student at Syracuse University’s Experiential Learning program. Together we took our curiosity and started researching on some of the leading image analysis services.
We chose to benchmark Instagaze food image recognition API against other machine learning services such as Google Cloud Vision, Amazon Rekognition, and Microsoft Computer Vision.
We analyzed which of these four services would produce the most accurate and acceptable food labels for respective images.
Experiment & Procedure
To begin our study we collected 517 images from various internet sources for our dataset. We choose images from different cuisines to avoid bias in our study and maintained the dataset on a Google Sheet. We only procured images that included correct food label.
As a precondition to maintaining accuracy and processing time, each image was first resized to 640×480 in JPEG format. After this step, the resized images were uploaded to Google Storage, AWS S3 and Azure Storage. The image’s resource identifier was then passed to its respective image analysis service. As a result, each of these services returned a set of labels with their respective confidence scores which were then stored into separate datasets along with original image URL and correct label.
Exact Label Classification
After performing some exploratory analysis on the generated labels for all images from Instagaze, Google Cloud Vision, Amazon Rekognition, and Microsoft Computer Vision we came across some challenges. We found that it was difficult for all services to get the exact label for an image due to the general nature of food and drink images, as the image content was deformable making it difficult to define the structure. In Figure 4, none of the generated labels for “Pan fried pork dumplings” have the exact label “Pan fried pork dumplings”.
Our team used a heuristic method to determine how accurate each generated label is compared to the correct label. To do this we incorporated Levenshtein distance, a string metric to measure the similarity between two strings. Levenshtein distance allowed us to produce a rough estimate score of similarity between the generated and the actual label.
For example, the Levenshtein distance between “Pan fried pork dumplings” and “Dish” has a normalized score of 0.14, which is a heuristic approach to estimate the similarity between both.
Acceptable Label Classification
Acceptable label classification was another challenge because there are multiple acceptable labels for one certain image. For instance in Figure 6, “Jiaozi”, a Chinese dumpling, is just another name for “Pan fried pork dumplings”.
To solve this issue, we identified all acceptable labels for images manually using trained data analysts, as shown in Figure 6 below.
This process helped us study how many generated labels provide an acceptable answer as well as quantify the volume of unacceptable labels.
Heuristics classification using Levenshtein distance generated the following results across all machine learning services.
The results for normalized Levenshtein score do not measure total accuracy, but serve as a good barometer on how well each services perform if we were to manually review all the labels across all services. We were able to conclude that with a threshold of 0.9, Instagaze outperformed other respective image analysis services.
After manually reviewing all the labels generated, we found that each machine learning service generated different amount of labels for each image. For example, Google Vision generated the most amount of labels for all images while Instagaze generated the least amount of labels.
We were able to conclude that Instagaze is the most precise in getting an acceptable label generated for a food or drink image. Instagaze generated 911 acceptable labels, and was the only service to generate more acceptable than not acceptable labels, as shown in Figure 10.
To fulfill the purpose of our study, we also examined how many images had at least one acceptable label generated by respective image analysis services, as shown in Figure 12. Across the four image recognition technologies benchmarked, Instagaze had the highest number of acceptable labels with an image precision rate greater than 80%.
We found that domain agnostic image analysis services tend to not perform well in food and drink detection because of the inherent fuzziness. Instagaze’s ML layers on top of an agnostic CNN provided up to seven times more precise results than other leading image recognition services. Instagaze outperformed Google Cloud Vision, Amazon Rekognition, and Microsoft Computer Vision in terms of label precision, image precision and acceptable label volume. From our study we concluded artificial intelligence with the help of deep learning can provide better image recognition technology in the near future to help us live a healthier lifestyle.
A link to our complete study can be found here.
Source: Deep Learning on Medium