Computer Vision : A Technical Overview

Source: Deep Learning on Medium


What is Computer Vision?

An interdisciplinary field concerned with the way computer systems process visual data from the physical environment and/or digital visual input (images, videos), ultimately to be able to automate a diverse range of tasks e.g.

// inspections in the manufacturing industry, thereby speeding up quality control processes. Similarly in the agricultural industry, a process called ‘optical sorting’ is used to remove defective fruit & vegetables, based on analysis of colour, size, shape, structural integrity and chemical composition; this constitutes a subfield of computer vision, called ‘machine vision’.

//visual detection: crime-related surveillance

//complex or multi-faceted identification tasks: Google uses its image data from maps to identify street names, businesses and other named locations, whilst computer vision assists Facebook in the tagging of images.

//operation of autonomous, or semi-autonomous, vehicles: at a semi-autonomous level, computer vision is used in obstacle warning systems, and in autonomous landing of aircraft. At the higher/fully autonomous level, vehicles use computer vision for navigation, generating a map of its environment as well as to detect obstacles.

//image enhancement (to reduce undesirable effects such as noise), typically of medical images (in areas such as tumour detection, blood flow analysis, measuring organ dimensions); in recent years, certain social media platforms as well as various photo editing apps have taken advantage of similar processes to provide a more optimal and more tailored photo enhancement experience for users.

(The above is a definition of the field ‘in industry’ — the academic facets of the discipline are more concerned with the (largely mathematical) theoretical basis of artificial systems involved in the extraction of visual data. Nonetheless the theoretical and practical are interlinked.)

Typically a visual processing system will carry out the following tasks (or at least be reliant on a parent system carrying out some of the initial tasks, the output of which would then feed into the system in question):

//pre-processing: noise reduction, to make sure that false information doesn’t hamper results; contrast enhancement, to ensure that irrelevant information isn’t detected; scale space representation, to account for size differences in image data

// feature extraction: desired features e.g. lines, edges, ridges, and corners are extracted from the image. Extraction of less generic features (and/or of greater complexity), relating to texture, shape and motion, are also possible. ‘Feature detectors’, which have been formed using multi-stage algorithms utilising mathematical concepts such as gradient calculation, vectors, differentiation and integration, and recursion, are applied to the image data. Different feature detectors correspond to different features e.g. ‘canny’ and ‘sobel’ are two of the most common ones for edge detection. Subtler differences in the respective suitability of their application to different types of images may dictate which ones are used.

//segmentation: deciding which regions of the image are of greater interest and will therefore require further processing — this may constitute a specific set of foreground objects, one particular object, or one aspect of the background. The image in question is partitioned into a set of views as part of this task.

// higher level processing: a range of tasks can fit under this category, including image recognition (identifying the category into which the detected object fits), image registration (analyzing and combining two different views of the same object).

Computer Vision & Machine Learning

Machine learning technology, mainly Convolutional Neural Networks (CNNs), have been overwhelmingly responsible for advances in computer vision, particularly with respect to accuracy and precision in the identification and categorisation of data. These have the same architecture as standard neural networks, but with a few extra steps that enhance feature extraction by finding versions of input data that will be most informative to the model/classifier being trained.

In practical terms, this means applying an appropriate feature detector to the image, introducing non-linearity into the image (see Rectified Linear Unit, the related function simply converts all negative values to 0; alternative functions like ‘tanh’ and ‘softmax’ exist, though these are commonly used), and then scanning the image by a set width of pixels, after which the maximum, average or sum of the pixels is retained to represent the corresponding section of the image (a process known as ‘pooling’). At the end of these preprocessing steps, a ‘feature map’, ready to be passed into a standard neural network setup, is obtained.

The classifier, or neural network, is trained on given datasets (a set of feature maps), based on the chosen mode of learning (supervised/semi-supervised/unsupervised) several times, after which it is quality-tested using a new set of data. (This describes the generic machine learning process, not one that is specific to computer vision).

A highly useful and pretty universal dataset that most classifiers make use of, whilst training, is ImageNet, which has 1.2 million high-resolution training images.

A comprehensive look into how CNNs and its variations facilitate computer vision can be found here.

Sources: