Source: Deep Learning on Medium
Finding your perfect dog using vector representations for fast nearest neighbour searches
Here’s the problem:
We love Poodles, but when we were trying to find the right one, we didn’t realise just how many variations there were, or how each Poodle derivative differed (physical attributes, temperament, family friendliness etc.).
So maybe using a deep learning technique and machine learning to help us with all these Poodles, might be a great way to find and match us to the right pooch.
Word Vectorization might be one way of sifting through the mire quickly saving us time, money, and potential heartache.
The aim of word vectorization is to associate words within a corpus and assign a “vector” representation. It works by extracting information from a text corpus. The vector is assigned a value.
The value is then computed by an algorithm that calculates the contextual relationship between words.
For example: Vector 1 might be similar to Vector 2, and Vector 1 and 2 might be similar to Vector 3.
So if we look at the features and attributes, the algorithm calculates similarities using techniques like cosine similarity, extreme classifier for very large labelsets, etc…
By feeding the word vectorization algorithm a very large corpus (we are talking here about millions of words or more), we obtain a vector mapping in which close values imply that the words appear in the same context and more generally have some kind of similarity, either syntactic or semantic.
So how do these Vectors help me to find the right Poodle?
For a comprehensive model for Poodles, we might include countries and climates for certain types of poodles. Also, behavioural patterns and articles, vet care, extensive articles on dog health and lifespans, history of each breed, etc.
Then we determine similarities.
For example, combinations of features:
- Barking Range
- Kid Rating
Cockapoo = Poodle + Cocker Spaniel + barking range low+kid rating great = vector (1,0,0,0,0,0)
Maltipoo = Poodle + Maltese.+ barking range medium+kid rating bad = vector (0,1,0,0,0,0,0)
Labradoodle = Poodle + Labrador.+ barking range high+kid rating ok= vector (0,0,1,0,0,0,0)
Goldendoodle = Poodle + Golden Retriever.+barking range low+kid rating great= vector (0,0,0,1,0,0,0)
Schnoodle = Poodle + Schnauzer crossbreed.+ barking range high+kid rating ok = vector (0,0,0,0,1,0,0)
Peekapoo = Poodle + Pekingese+barking range high+kid rating bad. = vector (0,0,0,0,0,1,0)
Yorkipoo = Poodle + Yorkshire Terrier+barking range medium+kid rating bad. (0,0,0,0,0,0,1)
Example: 2 x Vectors
Example: if we add yet another vector, deeper relationships are determined and assigned to a vector representation, using a larger labelset.
Word vectorization refers to a set of techniques that aims at extracting information from a text corpus and associating a vector to each one of its words. For example, we could associate the vector (1, 2, -3, 1) to the word Yorkipoo. This value is computed thanks to an algorithm that takes into account the word’s context. For example, if we consider a context of size 1, the information we can extract from the following sentence:
The Yorkipoo is a combination of a Poodle and a Yorkshire Terrier
is a set of pairs:
(the Yorkipoo), (a combination), (Poodle and), (Yorkshire Terrier)
By feeding the word vectorization algorithm a large corpus, (many words/dog attributes) we obtain a vector mapping in which close values imply that the words appear in the same context. More generally, they have similarity, either grammatically or semantically.
Ok, I get the concept, but why is it interesting?
This technique goes further than grouping words, it also enables arithmetical operations between them.. What it means is that you can do the following:
Poodle + Yorkshire Terrier
and the result would be:
In other words, the word vectorization could have associated the following arbitrary values to the words below:
Yorkshire = (0, 1)
Poodle = (1, 2)
Terrier = (2, 1)
And we would have the equality.
If the learning was good enough, the same is possible for other relationships between poodle breeds.
We can play with concepts by adding and subtracting them and get meaningful results from it, which is awesome!
The applications are varied:
- You can visualize the result
- You can use these vectors to feed another more ambitious machine learning algorithm (for example, a neural network).
- The ultimate goal is to allow machines to understand human language, not by learning it by heart but by having a structured representation of it.
Sounds feasible! Where do I start with our dogs?
We insert our features and then output into a vector format
Then utilising deep learning, we find our nearest neighbours within our vectors (this is cosine similarity)
Dogs and features:
- Weight: 3 to 14 pounds
- Life Span: 10 to 15 years
- Height: 7 inches to 1 foot, 3 inches tall
- Temperament Rating: Calm
Constructing a better dog vector(s) representation using a method such as cosine similarity:
The example query (in this case Elasticsearch), displays a range utilising Principal Component Analysis (PCA). It then executes a re-ranking score using a function score (in line vector scoring) calling an associated vector library (cosine similarity function).
The vectors are filtered by range and within the range cosine similarity is conducted to reveal our similarities (poodles to poodles).
The results are presented in the scoring order (you determine how many similar poodles you would like).
For the more technically minded see below an example of the cosine similarity/range vector query:
"vector”: [ 0.0, 0.0716, 0.1761, 0.0, 0.0779, 0.0, 0.1382, 0.3729 ]
Want to explore more about dense vectors?
Vector Space Model Software
The following software packages may be of interest to you if you want to experiment with vector models and implement search services based upon them.
Open Source Software Links
- Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java
- Elasticsearch is another high-performance, full-featured text search engine using Lucene
- Gensim is a Python+NumPy framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for Tf–idf, Latent Semantic Indexing, Random Projections and Latent Dirichlet Allocation
- Weka is a popular data mining package for Java including WordVectors and Bag Of Words models
- Word2vec uses vector spaces for word embeddings
Other important references to help get your head around these concepts: