Original article was published by Lamothe Thibaud on Artificial Intelligence on Medium
Disclaimer N°1: The predictive rate is not as good as with siamese networks and we had to explore other solutions. But the idea is very interesting and deserves to be shared and known.
Disclaimer N°2: As in many data science projects, the data preparation was the toughest part. Indeed to process tails as signals, the quality of the signals needs to be very good. In this post, we’ll take time to understand all the necessary steps upfront the signal processing.
Let’s deep dive 🐳
Exploring our dataset, analyzing pictures
As mentioned in the introduction, we were given a few thousands of pictures, which is a lot to look at. At the first glance, a whale is a whale. And all these pictures looked like a blue background (sky and sea) with a gray spot in the center (tail).
After a first exploration though, we started to make the difference between two different sperm whales, essentially due to the shape of the tails, and were convinced that it would be decisive for our algorithms. What about color? Is there any interesting information in the distribution of the pixels?
Using Bokeh visualization library we quickly found that the colors in the images were highly correlated. So we concentrated on the contours, trying to detect them through color variations.
Tail extraction based on color filters
The first step to detect the contour of the tails is to extract them from the sky and water. And actually, it was the most difficult part of the process.
First, we used contour detection algorithms. But because of the ever-changing sunlight from one shot to the next, the contrasts change a lot and the results were far from satisfactory. By the way, it was fun to see pictures were the algorithms failed the most, because most of the time the distinction between the tail and the sea was obvious for humans.
That being said, let’s deep dive into color analysis and contour extraction automation.
Using colors to extract tails
Let’s plot grayscales pictures for each channel intensity (Red, Green, Blue)
As you can see above, and this is true for the majority of the pictures, there is less color in the middle of the picture, allowing to filter by pixel intensity. As the tails are often grey, they have almost the same quantity of each color (R = G = B), however, the sea and the sky tend to be blue, which makes this color an ideal candidate for filtering.
Let’s see what happens when keeping only blue values, and only pixels where
blue_value < SELECTED_THRESHOLD.
The maximum of this
SELECTED_THRESHOLD being 255, as it is the maximum value for a pixel intensity.
With this series of pictures, we can believe that tail extraction is a breeze. But how would I choose the filtering threshold?
Below is an example of results using all value from 10 to 170 (ten by ten) as a threshold on a single picture.
Here are some interesting learnings :
- With a very small threshold (around 10), the sea disappears, but the tail also
- With a small threshold (around 20), parts of the tail disappear
- With a not so high threshold (around 40), the extraction seems perfect. All the tail is less blue than the threshold, but all the sea is bluer than the threshold.
- With an intermediary threshold (around 80), the tail remains intact, but we start to keep part of the sea
- With an almost middle-value threshold (around 110), it is hard to make the distinction between the sea and the tail
- With a little high threshold (140 and more), the tails completely disappear. It means that even the sea is not blue enough to pass through the filter selection.
So here we are, and it seems clear that we should take
SELECTED_THRESHOLD = 40 and apply the filter
blue_value < 40 .
As you can guess, it is not that easy. 40 is the right value for that picture, given the light intensity of that picture. But it changes from cliché to cliché. By plotting the results with all these thresholds on random pictures, the threshold occurs to vary from 10 to 130. So how to choose the right value?
Using bounding boxes to select the threshold
By looking at the previous pictures, something came to our mind: the right picture with the right threshold is the one with the most empty zone on the outside and the fullest zone in the inside. And hopefully, some neural networks trained on ImageNet can localize whales in a picture. We decided to use MobileNet which is based on the ImageNet classes.
n02066245 grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus
And it occurred to be a great idea. As shown below, we could identify with really high precision, the position of the tails in pictures. Then we could separate the “Tail part — Inside” from the “Sea part — Outside” in almost all the pictures.
To get a better sense of that separation, for each picture of the training set we summed the value of blue of each pixel inside the bounding box, and we did the same for the pixels outside of the box.
Then we plot each picture on the following graphic, with inside results on the X-Axis and outside sum on the Y-Axis. The blue line represents
X = Y. The sense that we can get form this graphic is the following: the farther you are from the line, the easier the separation between tail and sea will be.
We tried to apply a filter threshold depending on the distance to the line, and this didn’t lead to any result. After a few attempts, only based on the picture’s color distribution, we got resigned and we decided to go with the hard method. Instead of looking at the picture and decide the threshold, we’d apply 15 filters per picture, analyze them, and automate the selection of the best one for further processing.
Then for a given picture, we applied the 15 filters with 15 different values as a threshold. And for each filter, we counted the number (after filtering, pixels value is either 0 or 1, no need to sum the intensity anymore) of pixels inside the bounding box and the pixels outside. Then we normalized the results so that the number would be independent of the picture’s size, and plotted the results on a graph.
For each picture, we got a curve similar to the one above, which is the mathematic translation of the statements we made earlier with the evolution of the threshold.
- When the threshold is very small, tail and sea disappear. There is no pixel inside the tail, nor outside
- When the threshold is growing, the tail is appearing and the values of the X-Axis raise.
- Until the threshold starts to make appear some parts of the sea, and the outside value starts to grow.
Using linear regression or derivatives it is now easy to detect the right threshold: it is the one at the intersection of both lines of the plot.
NB: the orange line is the
y = y_of_the_selected_threshold
The last tip for tail extraction
Finally to get the best of our pictures while extracting, when we figured out the best threshold (among 10, 20, 30, 40, …, 120, 130, 140, 150), let’s say 80. We applied the filters on the -5/+5 values. So we had three pictures
blue < 75,
blue < 80,
blue < 85 . Then we’d sum the three of these grid pictures (of 0 and 1), and keep only the resulting pixels where the value would be equal to 2. This would act as a final filter, removing, the noise around the tail. Which globally led to better extraction, and we decided to apply to all pictures.
As a summary, here are the assumptions we made until now:
- We can separate a tail from the sea using a filter on the blue pixel’s intensity
- There is a threshold to find for each picture before filtering
- Using bounding boxes is a promising method to find this threshold
And after a few (lots of) hours of work, we ended up with a very good tail extractor, working fine on tails with different luminosity, weather, sea colors, tails colors, and able to go through the toughest pictures.
Now that tails are located in the picture, we process the contour detection. Indeed to deal with tails as Time Series we need to have a signal.
At this step, we could have used a contour detection algorithm from OpenCV, but it appeared to be faster with the following two steps:
Step 1: using entropy to remove noise around the tail
Step 2: keeping the upper light pixel of the picture for each column
This step was pretty straightforward, with no specific complication. Maybe the single one 😉
By extracting the tail from the sea and taking the upper pixel of the picture we got the trailing edge of the tail as a signal. Now that we have this, we’ll have to deal with normalization. Indeed, all pictures do not have the same size or pixel quantity. Moreover, distances to the sperm whales are not always the same, and the orientation might change between the shots.
For the normalization, we had to do it along the two axes. First, we decided to work with 300 points per tails for the signal comparison. Then we interpolated the shortest ones and sampled the longest ones. Second, we normalized all the values between 0 and 1. It led to signals superposition as seen in the following picture.
To tackle the orientation problem, we used an integral curvature measure, which transforms the signal into another one by evaluating it locally.
As mentionned in the original paper : “It captures local shape information at each point along the trailing edge. For a given point that lies on the trailing edge, we place a circle of radius
rat the point and find all points on the trailing edge that lie within this circle.”
Then, at each step, we straighten the edges of the signal in the circle, so that it is inscribed in a square.
Finally, we define the curvature as follows :
Curvature is the area under the curve to the total area of the square, which implies that the curvature value for a straight line is c = 0.5
We thus obtained standardized signals, independent of the distance between the whale and the photographer, independent of the angle between the whale and the photographer, and independent of the inclination between the whale and the sea.
For each picture of the training test we then created those signals for a radius of 5, 10, and 15 pixels during the IC phasis. We stored them and used them for the final step: the comparison between times series.
I’ll pass over the implementation of such an algorithm in this article. Once it works, we can apply it to our trailing edges and extract a signal free from the environment details. For a single tail, it looks like the signals below.
Now, let’s go through signal comparison!
Dynamic Time Warping
Dynamic Time Warping (DTW) is an algorithm able to find the optimal alignment between two time-series. It is often used to determine time series similarity, classification, and to find corresponding regions between two time-series.
The DTW distance, opposed to the Euclidean distance (which refers to the distance between 2 curves, point by point), allows linking distinct portions of curves. Here is how the algorithm works:
- With 2 curves we create the distance matrix between two series, starting from the left-bottom-corner until the top right-corner and calculating the distance between two points
Ai (from serie A) and Bi (from serie B)as follows :
D(Ai, Bi) = |Ai — Bi] + min(D[i-1, j-1], D[i-1, j], D[i, j-1]).
- When the distance matrix is fulfilled we compute the less weighted path from the top-right-corner to the bottom left corner. To do so at each step we select the square with the smallest value.
- Finally, the selected path (green in the next figure) indicates which data point from Serie A corresponds to the data point from Serie B.
The implementation of such a basic computation is quite easy. As an example, here is a function to create a distance matrix from two series
This being said, let’s get back to our sperm whales! Each tail of our dataset transformed into an “integral curvated signal”, we computed the distance between all tails to discover which ones were the closest.
After that, when receiving a new picture, we have to make it pass through the whole preparation pipeline: tail extraction with a blue filter, contour detection with entropy methodology, and contour transformation with IC. It gives us a
300x1 shaped tensor and we finally have to compute the distance with the whole dataset. Which is, by the way, quite time-consuming.
Verdict: results were respectable! When we had two pictures of the same whale, in most of the case, the two were in the 40 closest, which is great among 2000. However, as mentioned in the introduction, results using siamese networks outperformed (pictures often in the 5 closest) this one, and given the time for the competition, we had to choose among our investigations; and went no further with this methodology.