Original article was published on Artificial Intelligence on Medium
Artificial Eyes — AI enabled Mobile Application to help visually challenged people for Safe Navigation & Currency detection
Arumugam Shanmugavelayutham, Karthikeyan Nagarajan, Kaarthikraaj Ramanan, Roopa SD, Rajeev Samalkha (Mentor),Dr. D Narayana
Over the last few decades, advent of new technology has positively influenced Human life — particularly advances in AI & ML technologies in the last 4 to 5 years is opening up new frontiers in developing Life enabling solutions. With the intent to apply our learning to aid a social cause, we zeroed on to develop a solution for visually challenged people. Our project — “Artificial Eyes” aims at enhancing the quality of life of visually challenged people by helping them to navigate independently and in identification of common (e.g Currency notes) objects. This solution leverages Image Classification models that is amenable for mobile deployment and provides faster response time
Keywords — Social cause, Image classification, Mobile interface
Introduction & Problem Definition
All of us are very familiar with the challenges faced by visually challenged people when they move around or when they are in need to identify an object. They rely on either the familiarity/feel (e.g. House, objects) or any aids (e.g. walking stick) to identify the obstacles/objects and this limits their ability to independently handle unfamiliar scenarios. Any form of aid that can provide voice support will be of immense help. The aid/tool should address the following three key aspects for it to be effective in the addressing the problem
- Image capture & high accuracy in classification
- Faster response in real time
- Easy deployment & maintenance
The proposed solution is to develop a Mobile App that will help visually challenged people to-
- Identify the obstacles ahead and alert them via warning voice message
- Identify currency notes
Mobile App will use API to get the live stream video/images from Phone camera and use the built in Image classification model to alert the user for any obstacles or identify & classify objects (currency notes) in front of them on a real time basis .
The solution we propose aims to increase the speed and accuracy of image classification as it for real time human application while also keeping it “lite” as the model needs to be deployed on mobile phones. The workflow of the solution is as follows:
High level Solution Map
The App /Tool will cover the following functions-
- Capture the snapshot of the image using the mobile camera and pass on the snapshot to the App
- Image classification as “Go/No Go” via the Android App and Currency Detector ( App that has built in classification model) /
- Voice/Text alert to user as “Go /No Go” or identify the currency notes on real time basis
Training neural networks, specifically Convolutional Neural Networks (CNNs), typically solve the problem of identifying objects within images. CNNs have shown to have a much better performance with image and video recognition tasks compared to most other neural network architectures available today.
There are several leading CNN architectures currently researched, benchmarked and available for public and commercial use. To solve this business problem we reviewed several architectures to decide on which CNN to use to solve our problem.
Model Hyper parameter Tuning
We modify the network hyper parameters to ensure that we get an adequate level of performance for these metrics for the model we choose to run in the production environment. The hyper parameters that we looked at modifying to fine tune the performance of the model are:
- Number of model layers
- Pretrained model weights
- Activation functions in each layer of the network
- Alpha ( width of the layer) — If Alpha < 1 , proportionately decreases the numbers of filters in each layer)
- Optimization algorithm
- Number of epochs for training
- Drop out layers
- Batch size
Several tests with changes in these hyper parameters were run and the best performing model was chosen among the various runs.
Since we are building models for real life use cases & context specific scenarios, there is no data sets readily available. Data set used for this project is based on the sample videos/image snapshots collated by the team members. For purpose of PoC, we have used data sets that are manually labeled as Go and No Go
One common view is that “deep learning is only relevant when you have a huge amount of data”. Certainly, deep learning requires the ability to learn features automatically from the data, which is generally only possible when many training data is available. However, convolutional neural networks are by design one of the best models available for most “perceptual” problems (such as image classification), even with very little data to learn from. In order to make the most of our few training examples, one can “augment” them via a number of random transformations, so that the model would never see twice the exact same picture. This helps prevent overfitting and helps the model generalize better. Keras provides Preprocessing & Image Generator classes, which allows configuring random transformation and normalization operations to be done on the images during the training
We also considered the impact of color encoding of the images on the leaning abilities of the model. For obstacle identification use case where in the objective is classify the object as Go /No Go, we decided to go with RGB and so is the case with Currency detection use case as well
For the Obstacle identification (Go/No Go) use case, we used a “Balanced Data set” , which covers
- Training set — contains 28 images
- Cross Validation set — contains 8 images
- Test set — Contains 10 images
For Currency Note identification use case — 400 images per class (1500 images) — Used tool to generate images of 5 different rupee notes.
Exploratory Data Analytics
Training & Validation data set observations –
Obstacle identification (Go /No Go) use case — All images are RGB scale and the image size is 224 x 224 standard across all images. Each of the images were labelled manually as “Go” or “No Go”
Currency Detection use case — RGB images of size 224 x224 labelled under 5 classes ( Rs 10, 20, 50,100, 500)
Challenges with data set includes with
- Glare & distortion in the images
- Varying angle /view of the images
- Subjective assessment of image labelling as Go or No Go ( distance & position of obstacles)
Building the Image classification models
For Image classification problems, one can build the model from scratch or take advantage of trained models already available. Transfer learning is a popular method as it allows building accurate models in a timesaving way. For repurposing the pre-trained model, research papers suggest three strategies
- Train the entire model — Need large data set and lot of computational power
- Train some layers and leave others frozen — As lower layers refer to general features (problem independent) and higher layers refer to specific features (problem dependent), we use the dichotomy by choosing how much to adjust the weights of the network. For small data set with large number of parameters, it suggested to leave more layers frozen to avoid overfitting
- Freeze the Convolutional base — This is an extreme situation of the train/freeze trade-off. Indent is to keep the convolutional base in its original form and then use its outputs to feed the classifier. Using the pre-trained model as a fixed feature extraction mechanism, which can be useful for use cases short on computational power.
A practical guide summarizing the strategy based on Size /Similarity matrix is presented below ( Reference: https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751 )
We chose “categorical accuracy” as the evaluation criteria. The homogeneity of instances per class will ensure that there is no class imbalance and hence accuracy can give a good idea of the performance of the algorithm.
Evaluation of Best Model
Once we had a few CNN architectures in place, we needed to go about training the network and finding the best combination of network architecture and hyper parameter that gives us the best performance for the data we had and the classification task given to it. Our process of finding the right architecture and hyper parameter was by designing a series of experiments. The best model from the preceding experiment would then be modified in the next experiment and the performance measured.
Obstacle Detection( Go /No Go) use — Summary of the different experiments listed below-
Outcome of the experiments indicate that Mobile Net2 was giving better results and more suited for the use case based on the computing power (# Parameters as indicator). We then set about tuning MobileNetv2 to see if we could get higher prediction accuracy by fine-tuning the hyperparameters. The results of the experiments are as follows-
Common design parameters:
- Image size — 224×224
- Classes — 2
- Training images — 28 ; Validation images — 8
- Epochs — 20
- Batch size 32
- Optimizer — Adam
- Learning rate — Default
Model Performance across various experiments
Observations is that while pre trained models like ResNet & InceptionNet gave better accuracy, we finalized MobileNetV2 as the final model for the solution. Interestingly Ground Up Architecture with 15 layers gave the best results in terms of accuracy, we choose MobileNetv2 as this is proven & tested model across many use cases.
Currency Detection use case — Summary of Experiment
Final Model Output
- Obstacle Identification ( Go/No Go) use case
The best validation accuracy of 100 % was seen in the “fully trained model of MobileNetV2 with Alpha as 0.5”. We plan to validate the reliability of the accuracy levels with increased sample size.
Final Model Parameters –
- Model — MobileNetv2 with All layers trainable ( with Image Preprocessing )
- Training images — 28 ; Validation images — 8
- Epochs — 20
- Batch size 32
- Optimizer — Adam
- Alpha — 0.5
Final Model Performance
Final Model Parameters –
- Model — MobileNetv2 with First 20 layers frozen and rest are trainable ( with Image Preprocessing )
- Training images — 750 ; Validation images — 750
- Epochs — 20
- Batch size 32
Model Mobile Deployment — TensorFlow Lite with MobileNets
As the model will be deployed on a mobile device, we have used the TensorFlow Lite framework to convert the trained Tensorflow model into a compressed flat buffer with a Tensorflow Lite convertor.
TensorFlow Lite consists of a runtime on which you can run pre-existing models, and a suite of tools that you can use to prepare your models for use on mobile and embedded devices. Training is done on high powered machine and then convert the model to the .TFLITE format from which it is loaded into a mobile interpreter
Android App Architecture
High Level of Architecture of the android application is as represented below-
Using the trained keras model as the input, TensorFlow Lite converter converts TensorFlow models into an optimized Flat Buffer format, so that TensorFlow Lite interpreter can consume them (Note: FlatBuffers — FlatBuffer is an efficient open-source cross-platform serialization library. It is similar to protocol buffers but smaller in terms of code footprint)
The TensorFlow Lite converter generates a TensorFlow Lite Flat Buffer file (.tflite) from a TensorFlow model. The TensorFlow Lite FlatBuffer file is then deployed to a client device, and the TensorFlow Lite interpreter uses the compressed model for on-device inference.
Android — Neural Networks API( NNAPI)
We have used “NNAPI” for mobile deployment. The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on Android devices.
NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks, such as TensorFlow Lite and Caffe2, which build and train neural networks. The API is available on all Android devices running Android 8.1 (API level 27) or higher. NNAPI supports inferencing by applying data from Android devices to previously trained, developer-defined models. Examples of inferencing include classifying images, predicting user behavior, and selecting appropriate responses to a search query.
Why On — Device inferencing is important
- Latency: Requests need not be sent over network connection and wait for a response. This is a vital need for this project, as the app needs to process the process successive frames coming from a camera
- Availability: The application runs even when outside of network coverage
- Speed: New hardware that is specific to neural network processing provides significantly faster computation than a general-purpose CPU, alone.
- Privacy: The data does not leave the Android device.
- Cost: No server farm is needed when all the computations are performed on the Android device.
Tools & Hardware used to create the model
We used the high-level programming language Keras for creating the model. Tensorflow acted as the backend for Keras. It gave a unique ability to run on both CPUs/GPUs without any change to the code. The technology stack to build the model is as follows:
Learning from model building
Key learnings based on the model building experience
- Model building in deep learning is a highly experimental and iterative process. We need to make small changes to the model which could dramatically impact the performance of the model
- Not all uses cases are suitable for Transfer learning as we found better results by training the entire model delivered better results that using the pre-trained model
- The image data plays a key role in the performance of the model. The kinds of images available for a class had a direct impact on the accuracy of the model as well. We would need to find as diverse a set of image data for each class to keep the model accurate for all classes.
- The initial performance gains to a deep learning model are easy to achieve. Fine-tuning for higher performance is extremely difficult and we need to budget a significant amount of time to fine tune performance.
- Even with high training and validation performance, there is a relatively high chance that your model will not perform well enough in the real world. There is also a significant chance of performance degradation over time due to changes in features or new unseen getting into the dataset. We need to ensure that we constantly monitor the real-world performance of the model to ensure it is working well.
Business Value & Social Value
As rightly named, Our Project “Artificial eyes” is aimed at the social cause of enabling Visually challenged people to handle some of the day to day tasks more confidently & independently . Performance metrics from the two Proof of concepts done namely Obstacle detector & Currency identifier looks very promising with high accuracy levels. Mobile ready framework like Tensorflow Lite has helped the ease of the deployment & adoption and this should help reach masses at minimal cost
We also see the possibility of extending the concept to business functions like product testing and validation.
We intend to continue to develop this model to improve the reliability and accuracy of the app to cover multiple real life scenarios. Our next steps are as follows:
- Increase the datasets that are available for this model to learn from- Indoor image data set for Obstacle detection use case
- Increase the number of classes and class types that the model trains with so it can apply to a wide range of clients
- Learn the context of the classes (type of object, position of the object etc. ) so we can understand the context of images better and apply natural language captioning and search for the images.
- Android App — Extend to have Audio interface capability to Android
- Create a learning pipeline for new classes and images so that the model can continue to perform well with new data that comes in.
We also owe an immense amount of gratitude to all the Deep learning researchers and bloggers who have generously published their work for us to learn from. It would have been impossible for us to move forward with the pace we did without having any access to the wealth of information that they have shared with the community.
We would like to place our sincere gratitude to our mentor Rajeev for providing his guidance with valuable inputs and suggestions through the whole process.
- Review of Deep Learning algorithms for Image Classification — https://medium.com/zylapp/review-of-deep-learning-algorithms-for-image-classification-5fdbca4a05e2
- A Simple Guide to the versions of the Inception Network https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202
- DeCAF — A Deep Convolutional Activation Feature for Generic Visual Recognition
- Going deeper with convolutions https://arxiv.org/pdf/1409.4842v1.pdf
- Rethinking the Inception Architecture for Computer Vision https://arxiv.org/abs/1512.00567
- Research paper on ‘Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning’