Source: Deep Learning on Medium
I am currently a Data Scientist at Balad, which is an Iranian smart map and route planner platform, a fast-growing product of the well-known Iranian tech giant, CafeBazaar. We face all sorts of challenges in Balad on a daily basis; data aggregation, smart information retrieval, behavior monitoring, optimal graph solving, etc. are just a handful of our data hurdles. Balad offers a whole bunch of cool features, to go over them namely, I’m giving a shout-out to my superb colleagues at Navigation, Search, Public Transport, and PoI — Points of Interest, which I’m a member of. By “Points of Interest” we basically mean any place a general user is possibly interested in; it might be a specific clinic, a local locksmith, a supermarket, or an established restaurant. According to our numbers, services like Google Maps or Waze which are fairly popular in Iran, combined together, have only managed to collect a small fraction of almost one-fifth of actual PoIs of, for a case in point say, a prominent street in Tehran. As ambitious as it may sound, our ultimate mission is to leave no Point of Interest behind — OK, that was a bit remote from practice, but you’d be surprised what great lengths we’ve gone to be able to say this! Further, I will concisely explain how are we to achieve this and what I have to do with all this.
We rely on two major cornerstones in PoI, our image capturing cars — which we refer to as Vanettes — and our liveware operators. We basically make an optimal traversal of the graph of the city map, have the Vanettes drive on these routes and capture images with a full 360° view at a high frame rate, and ask the operators to label any appearing Place of Interest with a preset category and insert additional details. My job, as the sole Data Scientist of PoI division, was to automatically detect these places of interest to accelerate the image tagging process for the operators. The key result is to provide the operator with exactly one image to enter the info of a single point of interest.
We can safely assume that store signs and billboards are ubiquitous for all sorts of retail stores and establishments. By detecting these signs we should be able to identify the corresponding point of interest. So the first thing we did was to ask our operators to draw a bounding box around any billboard or signage appearing in a large number of images — these also include traffic signs, street name signs, etc. — to prepare a dataset for training an object detection model. Obtaining a ready detection module per se gives the operators a boost of 3x in terms of speed — due to the fact that only one-third of the pictures contain any point of interest — but there’s more to be done. Regarding our relatively high frame rate, a single sign may appear in a number of successive images and the detection network identifies it in every single one of them. Accordingly, if we somehow track these PoIs and group them together, another boost of 3x is reachable, and also we avoid the issue of duplicate info.
The assignment of coming by a topnotch detection module was first undertaken. There are a great many detection algorithms out there, almost all of which relying on deep learning; YOLO and R-CNN are two main themes in this paradigm. I’m not going to give a technical pros-and-cons of these systems here, although I will be shedding some light on the network we ended up using. As one may infer from our super noisy data — either of the street images of Tehran or the unsettled distinctive boxing strategy of the operators — Faster R-CNN outperforms YOLO, in terms of both the accuracy of distinguishing the signs from background class and the box precision. I have to mention that we used a refined version of Faster R-CNN, by removing the mask heads of a Mask R-CNN model. This way you’re still able to enjoy the merit of Feature Pyramid Networks and RoI Align. We later migrated from this Tensorflow and Keras implementation to PyTorch for performance issues — I’ve been in love with PyTorch ever since 😍.
Now that we have an awesome object detection util at hand, the next step is to tackle the tracking problem. Our first plan was to crop out the detected regions, extract their visual features by running them through the feature extractor part of our detection network, and see if identical PoIs maintain closeness when projected to the feature space. To experiment this, we looked at the T-SNE representation of the features. This gave us the false hope of adequacy of exclusive deployment of this similarity notion for the task of tracking. To officially test this, we basically cropped out the detected regions of any two following images and tried to match the first and second image detection pairs with the highest similarity metric. The results were poor, due to the fact that features were only able to maintain a relative, instead of an absolute global closeness. So a detection in an image might find a fairly similar, but not the identical detection in the next image, closest to itself — this is partly due to the shortcomings of the feature extractor, for example, it’s insensitivity to the textual content of the cropped region. We could consider training a Siamese Network to be able to further increase the sensitivity of the feature extractor network to more relevant features but we ended up doing something a smarter, and easier.
A valuable source of information that we haven’t yet taken into account, is the positional attributes of the boxes. The following algorithm is premised on the fact that, if a pair of boxes preserves relative positioning and absolute sizes in two successive images, regardless of the content of the boxes, a high matching probability is assumed for them. For instance, a group of neighboring boxes always maintain relative positioning; this is the central idea. The algorithm also accounts for depth variance of the signs, i.e. the distance from the camera — notice that closer objects undergo bigger displacements and vice versa. Our final algorithm complements the content similarities from the previous section with these new probabilities and in the end, we solve for maximum weighted matching of a bipartite graph. I will try to explain the crux of our tracking system in concrete math in the following section.
Suppose detected boxes of indices 𝕚₁ and 𝕛₁ in the first image with the total detected region count of 𝔻₁, and 𝕚₂ and 𝕛₂ in the second image with the total detected region count of 𝔻₂. We first create a matrix of size (𝔻₁, 𝔻₂, 2), which consists of vectors of size 2 — x and y coordinates in the picture — to displace the center of the box 𝕚₁ in the first image, to the center of the box 𝕚₂ in the second image, at [𝕚₁, 𝕚₂, :]. By iterating over these displacements, and taking a displacement vector of size 2 at a time — 𝔻₁×𝔻₂ in total — we shift all the boxes in the first image by this transition vector. Notice that at iteration 𝕚₁×𝕚₂, boxes 𝕚₁ and 𝕚₂ have coinciding centers. For the resulted set of boxes at each iteration, we can create another matrix of size (𝔻₁, 𝔻₂), with each element determining the Intersection over Union (IoU) of the displaced box of 𝕛₁ and the box of 𝕛₂, at [𝕛₁, 𝕛₂]. We can squash all this information into a single matrix, Ψ, of size (𝔻₁, 𝔻₂, 𝔻₁, 𝔻₂), with each element at index [𝕚₁, 𝕚₂, 𝕛₁, 𝕛₂] representing the IoU of the boxes 𝕛₁ and 𝕛₂, when 𝕛₁ is moved by the displacement vector of 𝕚₁ → 𝕚₂. OK, that was painful, but just bear with me for a bit longer.
We can now reformulate the aforementioned premise of the algorithm as follows. If by displacement of 𝕚₁ → 𝕚₂, boxes 𝕛₁ and 𝕛 ₂ have a high IoU, and by displacement of 𝕛₁ → 𝕛₂, boxes 𝕚₁ and 𝕚₂ also have a high IoU, then 𝕚₁ matches 𝕚₂, and 𝕛₁ matches 𝕛₂. This means that the product of Ψ[𝕚₁, 𝕚₂, 𝕛₁, 𝕛₂] and Ψ[𝕛₁, 𝕛₂, 𝕚₁, 𝕚₂] should be high. We then “pair transpose” Ψ, meaning to change the order of the first and the second axes, by its third and fourth, and name the resulting matrix by Ψᵀ. Calculating an elementwise multiplication of Ψ and Ψᵀ results in a new matrix, Φ, with the same size as Ψ. To satisfy the match condition, we plainly require the entity at Φ[𝕚₁, 𝕚₂, 𝕛₁, 𝕛₂] to be higher than a certain threshold — note that this matrix is also “pair symmetric.” To get a one-to-one similarity measure from regions in image one to two, we can take a maximum over the last — or first, remember it’s pair symmetric — two axes of Φ. Additionally, we combine the outcome of size (𝔻₁, 𝔻₂) with the content similarity matrix from the previous section, and at last, make use of the Hungarian Algorithm to solve for a one-to-one optimal matching of the probabilities.
And pretty much that’s it. We then set up a service of Automatic Boxing of PoIs and connected it to the backend project. We use a local cluster of GPUs as our boxing engine. This system uninterruptedly takes in boxing assignments, detects all PoIs statically in every image via the Faster R-CNN network, runs the tracking algorithm for every pair of successive images in the data to bundle up reappearing PoIs, and writes the results on the database. Operators then have access to all images of every PoI, but it suffices to insert info for only one of images. And yay! Mission accomplished!
I tried to give you some sense of what it’s like to work at CafeBazaar, what kind of challenges we face, and what kind of solutions we consider for applying. For future work, we’re planning to expand our detection systems, autodetect categories, and develop and deploy Persian OCR systems. It’s an invaluable experience to help a product take its first steps and accompany it to its maturity. Thanks! ✌️