The Computer Vision is a field of Artificial Intelligence (AI) that seeks to mimic the innate ability of the human eye to recognize patterns.
In recent years it has become an important enabling technology at the service of big companies and in some cases essential to their core business.
Is the case for example of Tesla, projected to close 2020 with a gain of more than $2 billion from the sale of its autonomous driving software (source Trefis).
Thanks to important developments in Neural Networks and frameworks dedicated to Deep Learning, the use of Computer Vision opens new perspectives also for small and medium-sized companies. There are in fact several solutions that allow to use already developed systems.
The main challenge of Computer Vision is the development of algorithms able to reconcile shapes and objects in single images or alternatively on sequences of "frames" (e.g. video recordings). The development of such systems is mainly concentrated in two phases: .
- The collection/preparation of data;
- The training of the algorithm.
The first phase of development consists in collecting and preparing what is called dataset. In our case a large amount of images. Computer Vision algorithms, like any other digital system, operate on bits sequences (zero and one).
Images are saved as matrix of pixels, each cell of the array corresponds to a different pixel. The value from 0 to 255 saved inside it indicates the color assigned to that pixel. This representation is sufficient for grayscale images. For color images, you need to use three similar arrays for each image: one for red, one for green and one for blue (RGB).
Spesso vengono applicate particolari trasformazioni alle immagini sia per ridurne la dimesione che per facilitare il processo di training successivo.
After the acquisition, the algorithm training takes place. The most used architectures to date are the Convolutional Neural Networks.
During this phase the network "learns" to recognize recurring patterns inside the images. Different objects will have different unique features that will be used to recognize them. In principle this mechanism is similar to that of the human eye, which can distinguish various shapes by focusing on particular details.
The network is input batch of different images that it uses to increasingly refine its recognition capabilities, this iterative process can take a long time and require enormous computing capabilities.
After training, it is possible to get an idea of what the algorithm "is looking at" when it recognizes an image using Heat maps. It is essential to verify the capabilities of the algorithm using different metrics. Getting good results on the images used for training does not guarantee that it will be able to generalize and reach the same performances in the real world (overfitting).
The applications of Computer Vision are countless. Medical diagnostic analysis, predictive maintenance, production process control and augmented reality are just some of the possibilities. The algorithms are divided into 3 main classes according to the type of problem they try to solve.
Given a set of images, each belonging to a single category, the purpose is to predict the category of images never seen before. These algorithms are also called classifiers. A classic example is an algorithm that recognizes whether the input image represents a dog or a cat.
While in the classification the output is the "probability of belonging to a class". In Segmentation the objective is to group all the pixels that correspond to a particular figure. A typical application is the recognition of buildings and streets from aerial/satellite images.
In this case the algorithm tries to "capture" all the objects it recognizes inside an image by framing them inside a rectangle labeled with the name of the class to which the object belongs. The problem moves from "who is the subject" in the image to "where is it".
A glance to the future
In the future we will see an exponential growth of systems that intersect different fields of artificial intelligence and allow more complex applications. This is the case for example of Visual Question Answering that combines Computer Vision with NLP (natural language processing).
Despite the justified enthusiasm of insiders, there is still much work to be done especially in understanding and interpreting the results provided by these systems.
The biggest challenge for the future of Computer Vision and Deep Learning is undoubtedly to demystify the Black Box. Despite the unprecedented results, our ability to understand how these systems come to provide certain answers is still limited.
Article by Sergio Placanica of VGen Hub Polimi