Source: Deep Learning on Medium
Why should we care about recognizing Low Resolution face images ?
Recently there have been a lot of advances in the area of face recognition and we have seen how many governments across the globe have started implementing it for various applications. The one major hurdle faced while deploying this technology real time is the misclassification rate while detecting row resolution facial images which form the majority of the data when it comes to applications such as surveillance.
Is there any work done on this already?
There has been a lot of study in the area of recognizing high resolution facial images but most of the algorithms fail when it comes to faces which are low resolution(< 32 X 32) faces. There have been some studies suggesting upscaling the input image and then training. Also some methods also mention about models which are invariant to resolution of the image, but the results are less than satisfactory owing to accuracy required in a crucial task such as this.
How have the researchers proposed to solve this problem ?
They have proposed a Complement Super Resolution and Identity joint deep learning method with a unified end to end network architecture.
Don’t worry if you haven’t understood a single word in the above sentence. I’ll simplify it for you.
The network architecture consists of two major components:
- Super Resolution Network : Which upscales the image.
- Face Recognition Network : Which obviously recognizes faces :P .
So you might be thinking what is so different about this network compared to other approaches ? The two points below will answer your question.
- Joint Learning of Super-Resolution and Face-Recognition : Instead of training them as two separate networks, a joint training approach is taken where both the networks are trained simultaneously for the super resolution network to adapt to the face recognition task.
- Complement-Super-Resolution Learning : The SR and FR joint network further consists of 2 sub networks. As shown in the network image below, the upper half of the network is trained on synthetically down sampled images(auxiliary) along with the corresponding ground truth high resolution images. the network in the lower half is trained on low resolution images obtained from videos and not artificially down sampled(native) and hence we do not have the corresponding high resolution ground truth images. Both the sub networks share the parameters and optimize both the branches jointly.
How is the model trained ?
The network can be trained by the standard Stochastic Gradient Descent algorithm in an end-to-end manner. For improving the model convergence stability the authors have proposed the following training steps:
- First pre-train the synthetic LR SR-FR branch(Upper half) on a large auxiliary face data.
- We then train the whole CSRI network on both auxiliary and native LR data.