Original article was published by manonmani pl on Deep Learning on Medium
Face Detection in Manga Pages
An important and basic step towards recognising character is first detecting face. For detecting faces in humans is simple there are many algorithms which work best for this purpose but for manga there are very few, From my search what I found is using this state of art algorithm and modifying a bit can able to reach 98% accuracy. Detecting faces is important because the amount of digital manga read per day is growing very fast and almost all manga conventional black and white series are only publishing the digitalized version. For searching manga in the web they are just using meta tag, to overcome this drawback I personally find this state of art paper helpful.
Let’s get into the algorithm part! First, selective search algorithm is applied to the manga page and then region of interest is given to CNN, it further classifies the region into face or not face. Source code is available in Github.
Wondering what is selective search algorithm? Let me explain.To be short it is just like a combination of exhaustive search and segmentation, Exhaustive search is nothing but it just goes to every part of the image with fixed size window and search for the object, Whereas segmentation will separate the foreground and background of the object. So now we can able to come to the conclusion that the aim for selective search is to identify all the objects in the image and draw anchor box around it(In our case we will crop the region)
Now we will dig deep into a selective search algorithm. For creating the initial/starting stage they used Graph based method of flezenswalb & Huttenlbcher. Then they iteratively group regions together based on similarities of neighbourhood region. The above step is repeated until the whole image becomes a single region. Adding to this there are four similarity measures based on colour, texture, size and fill.
Colour similarity : For each region we generate one dimensional colour histogram and there are 25 bins, there are 3 channels(R,G,B) So total of 75 bins are used and all channels are combined into a vector. Similarity of two regions is measured on histogram intersection and can be calculated using
Texture similarities are calculated using Gaussian derivatives in 8 orientation for each colour channel and we extract histogram of 10 bins for each colour channel, so total of 10 x 8 x 3 = 240 dimension vector for each region.
Size similarity: The idea of size similarity is to easily merge small areas. If this similarity is not taken into accounts then the larger regions merge with larger regions and multi scale field proposals will only be prepared at this location.
Fill similarity measures how well two regions ri. and rj fill together. If ri falls into rj and fits into one another then they should be merge and fill in the gaps, if they do not even touch each other they should not be merged.
Final similarity is the linear combination of above mentioned four similarity as
So at this point of stage after we applied Selective Search we have an image with sampled sub images. For visual purpose I have drawn Anchor box in it
Now coming to the dataset, In this paper they use manga109 dataset which is the largest dataset available. Now this dataset is available with annotations, So extracting positives and negatives samples won’t take much effort. Here faces are considered as positive examples and non faces(body, trees, chair, letters etc) are considered as negative samples.
Architecture of CNN will consist of 5 convolutional layer, the output of the 5th layer is input to two branches Top and Bottom.
Top branch is the classification layer consisting of two fully connected layers. Softmax is used for activation function of last layer. Output of this will be in binary classification of face or not(1 or 0). Loss function for top branch is set as
Bottom branch is also two fully connected linear layers which are used to find the spatial displacement of a given training sample to its corresponding ground truth.
x¹, y¹,w¹,h¹ is the given manga page and x,y,w,h is the ground truth value. The horizontal and vertical displacement is calculated as Δx = x¹-x/w and Δy = y¹-y/h, width and height difference is calculated as Δw = log w¹/w and Δh = log h¹/h. Loss function to the bottom branch is set as
In general facenet is trained altogether with top and bottom branch by considering the integrated loss of
For training we use 13999 images of face region and 10000 images of not face region. We adopt batch size of 100 and there are 139 epochs and validated with 3000 of face and not face region.
Taaadaa after training we will reach 98% accuracy.