Basics of Computer Vision and some of its application

Original article can be found here (source): Artificial Intelligence on Medium

Basics of Computer Vision and some of its application

Basically making the computers see and understand the content of images and videos. Simple.

Internet is flooded with images and videos, if we want to index and search images, computers need to know what the images contain.

It is a field of computer science that works on enabling computers to see, identify and process images in the same way as human does. It is a field of AI.

You may be wondering how is computer vision different to image processing ?

Well, image processing is the process of creating a new image from an existing image may be by enhancing the content in some way. It is NOT concerned with understanding the content of image.

However, the idea behind computer vision is to understand the content of image. It may require image processing to be applied to raw images e.g. cropping, removing noise, normalizing, brightness etc.

An object can be seen from any orientation, any lighting condition. A true computer vision should be able to see any number of images and still extract something meaningful. This is the main challenge of computer vision.

The moment you upload a picture on facebook, there is a feature called auto tag. Facebook will give you auto suggestion to tag people who are in the images. And this works all because of computer vision. The idea is to automate tasks that the human visual systems can do.

Having said that, let’s understand how the computer reads an image. A normal human can easily tell that there is car in the below picture. But can a computer really see this ? The answer is No.

Computers sees image as matrix of numbers between 0 and 255. For colour images there will be three channels : Red, Green and Blue i.e. RGB. And there will be matrix associated with each channel. Each element of matrix will represent the intensity of the brightness of the pixel. All these channels are stacked on to each other to create 3D matrix. So, computer will interpret image as 3d matrix as shown in the below image.

There are three main processing components of computer vision.

  1. Image Collection: It can done using webcam, DSLR, mobile cameras
  2. Image processing: Edge detection, segmentation, classification etc.
  3. Image analysis: object recognition, object tracking etc.

While I was studying Business Analytics at ISB, Hyderabad, I did a three months Capstone project with three of my friends on building smart glasses for visually impaired . This project gave us a lot of opportunity to explore and get hands on experience on computer vision and deep learning. I will try to share some of the learning with simple examples. But before that I would like to show some of the basics of image processing using one of the famous computer vision library called OpenCV.

OpenCV: Open source computer vision library. It supports the deep learning framework pytorch, tensorflow and caffe. If you want to apply facial recognition to video, do image classification, object detection etc. you will need to learn openCV sooner or later. Let’s get started with some of the very basic steps:

cv2.imread() : Used to read an image. It needs two arguments: path of the image and the way we want the image to be read. In the below example ‘1’ means it will load the image in actual colour. You may have also noticed why I have converted it to RGB in the line 4. OpenCV uses BGR as its default colour order for images, matplotlib uses RGB. When you display an image loaded with OpenCv in matplotlib the channels will be back to front. The easiest way to fix this is to use OpenCV to explicitly convert it back to RGB. I have also explained below why I am using matplotlib imshow() rather than using opencv imshow().

cv2.IMREAD_COLOR : Load a color image. Mostly used for 8 bit images that don’t have alpha channels.You can pass ‘1’ as argument in the above example. It means the same.

cv2.IMREAD_GRAYSCALE : Loading the images in gray scale. You can pass ‘0’ in the imread() argument, it means the same.

cv2.VideoCapture(): Used to read video file. If you specify the path it will read the video file. If 0 is specified, it will take webcam as input video.

video = cv2.VideoCapture('Test.avi')

cv2.imshow(): To view the image/video as shown in the above example. Please note that cv2.imshow() does not work on jupyter notebook. Your notebook will stop responding if you run cv2.imshow(). So, I have used matplotlib imshow().

cv2.waitkey(0): it will wait until any key is pressed

cv2.windowsdestroyall(): it will destroy all temporary opened windows

and so on. Go ahead and explore features like resizing, changing color spaces,image rotation, translation, filtering etc. if you want to get into details.

Now, I will explain some of the applications of computer vision.

Image classification: Assigning a label to an entire image. It is also called as object classification or image recognition. There are large number of categories in which images can be classified. Lets say you want to assign a label to all the images which has cars. It doesn’t matter which car.

Object Detection: Locating objects in the image by bounding box and also identify each objects in the image. Can be used for face detection, detecting objects on the road, detecting type of vehicles on the road, pedestrian counting. In our project we used object detection algorithm to detect various objects on the road. There are various pre-trained models which are trained on some famous public dataset of images. Problem with pre-trained model is that it is trained on US roads or any developed countries roads. This does not give good accuracy if we run it on Indian roads reason being we do have rickshaw, auto and various other kind of objects which is not detected by model. Best way is to train the model by collecting lots of videos/images of Indian roads, annotate the image and then train the model. Here is an example of object detection done on Indian road using pre-trained YOLOv3 model. Don’t worry about the numbers written on the bounding boxes. It is used to track the objects on multiple frames of a video.

Object Detection

Yolo (You Only Look Once) is the most famous object detection algorithm. And Yolov3 is latest version of Yolo which was updated to make the algorithm more accurate especially for detecting small objects. It only needs to look at the image once for the purpose of detecting all the objects in that image. This is the reason Yolo is very fast model. This is a very basic information about YOLO. You can explore more on this if you want to get in depth knowledge about object detection algorithms. There are various other models too. My intention is just to give you a start on these topics and then you can take it further if you want to get in depth knowledge.

Semantic Segmentation: Basically it is the task of assigning a label or class to each pixel of image.

Now, you may be wondering how is it different to image classification ?

Well, Semantic segmentation classifies each pixel of the image to one of the class or label. However, image classification assigns a single class to the entire image.

Semantic Segmentation

If you notice the above example, each pixels in the image belongs to one of a particular class i.e. car, person, trees, road etc. And all the pixels which belongs to same class have been assigned the same color. Basically different instances of same class are assigned same color, so person 1 and person 2 cannot be identified here. Semantic Segmentation does not predict any bounding boxes around the objects.

When we started exploring on our project, we came to know about semantic segmentation. But since our intention was to track each object on the road, we could not proceed with this as this method does not identify different instances of same class. So if there are multiple cars on the road, then it will treat all the cars as one class. And also it does not give bounding box and the prediction which made it further more irrelevant for our use case. Then we thought what if we have some other approach which can give mask of each object and also give numbering of these objects. So, we ended up with Instance segmentation.

One of the application of semantic segmentation is self driving car. It helps to provide information about free space on the roads. Also to detect lane markings and traffic signals. Here is a beautiful video I found on youtube on semantic segmentation.

Instance Segmentation:

Instance Segmentation

If you notice the above example, different instances of same class are segmented individually. This can be used for object tracking. We had tried pre-trained Mask -RCNN model in our project to track the objects on the road. However the performance was not that impressive as our use case was for real time videos.

Object Tracking: One of the main objective of our project was to track the objects on the road and describe the scenario like “There are 3 cars on the road, out of which 2 cars are moving and 1 stationary.” And we tried various tracking algorithms to track the bounding boxes of the objects. Here is a summary of the pre-trained models we tried and the issues we faced.

If you want to get more details you can explore more on each models and try on your machine.

Distance calculation using Disparity maps: Another feature of our project was to take image as input and calculate the distance of various objects from the camera. For distance calculation, we required the image and its corresponding depth map. For calculating depth map we required stereo images(images of the same scene taken from two cameras). We used ZED camera for capturing the required data. We went to various streets of delhi to capture the images from various angles so that we have variety in the images.

I will try to cover some more details about distance calculation model in my next blog.

My intention was to cover the basics of computer vision and to highlight its capabilities. If you really want to get into details, I would encourage you to run some of the codes of its application which are available on github so that you get the context.

Feel free to reach out to me if you have any question.