Source: Deep Learning on Medium
Understanding context with Google’s Universal Sentence Encoder and Neo4j
Sentence embeddings and graph connections
One of the challenge to analyze text quantitatively is to categorize strings. With the advent of embedding models like Word2Vec, Glove, ELMo and recently BERT, it has become possible to solve complex NLP tasks with ease.
In this article, we will try to understand, the advantages of sentence embeddings over word embeddings for multi word strings. Also, we’ll see how these embeddings can be used to analyze “concept transfer” within similar strings in a graph database — Neo4j to get something like this:
Say, we want to study the occupational profile of our existing customers. We would like to send them targeted campaigns on the basis of their professional profile. Our customers are from variety of backgrounds. When a person creates a profile, there are the fields that they enter as free text. Occupation is one of them. Since these are free text strings, people with similar occupation (similar industry or job) may enter their’s differently.
We wanted to solve a problem based on the above scenario. We had to create different contents for targeted, cross selling campaigns for the customers on the basis of the occupation entered when they registered. We had around 10,000 distinct strings which we wanted to cluster into 30–40 industry specific groups input by customers as their occupations. We found a lot of themes related to Engineers, Doctors, Artists, Defense, Education, Designers, Labor, Guard, Food, Transportation, business owner etc. The challenge was to find these themes in an unsupervised manner.
Shout out to word embeddings!
Models like Word2Vec, glove and ELMo have certainly given “context” to our lives and made it easier. These models have been trained on large corpora coming from a variety of sources. The intuition behind these models is,
Words which occur and used in same context, are semantically similar to each other and have similar meanings.
These models enable us to use transfer learning for our tasks. We can use these model’s embeddings directly or fine tune a model on our corpus. This enables us to find meaningful clusters with even small amount of data.
Check out this amazing article, to get an understanding of embeddings in general: Deep Transfer Learning for Natural Language Processing by @d
Challenges with Data
If we take a look at the figure above, we can straight off see some of the challenges in data:
- Spelling mistakes — Doctor written as Docter
- More than one word — Orthopedic Surgeon
- Abbreviations — Doctor written as Dr
- Various types of doctors — neurosurgeon, dentist, medical executive etc.
So, we need a way to bring all these strings together in a group somehow. One way to achieve this is using word embedding generators like Word2Vec. Word2Vec is a neural network trained on a large corpus which spits out a fixed size vector of floating numbers for each word.
But, our data can have multi word strings and we would like to use the models for inference instead of fine tuning. Word2Vec returns a vector for a single word. One way would be to take an average of the embedding of each word in the multi word string. Even though this strategy may work with our occupation data, it would not give great results for longer sentences. Since, we are taking an avg of each word, we lose context of the sentence as a whole.
This is where sentence encoders come in. Google’s Universal sentence encoder, embeds any variable length text into a vector of 512 size vector. There are two variations of the models available on TF-hub. One is based in a Transformer Network and the the second based on Deep Averaging Network based embeddings. To understand how these work, check out this paper from Google research.
Let us take some concrete examples to understand advantage of sentence embeddings over word embeddings for multi word strings. Say we have two occupations — “Specialist Dentist” and “Healthcare Consultant” . Since, both the occupations are of medical fields, we can expect these to have similar embedding and hence a high cosine similarity. Let us see a comparison of cosine similarities returned by the two approaches below.
Approach 1 — Taking mean of Word2Vec embeddings
Here, we first split the strings into words and get the word embeddings using Word2Vec model in Gensim (GoogleNews-vectors-negative300). We take the mean of the word embedding vectors returned for the corresponding occupation. Then, we calculate the cosine similarity between the two mean vectors.
Approach 2— Using Google Sentence Encoder
Here, we get the sentence embeddings using the DAN model from Tensorflow hub. Then we simply take the cosine similarities between the returned vectors.
We can see that the second approach gave us better results. Some more examples:
For all the occupation pairs, we can see that the sentence encoder out performs word embeddings. This is quite understandable as a “Specialist” can be of anything. Therefore, the embedding returned by Word2Vec for “Specialist” is general and does not depend upon the word “Dentist”. Similarly, “Consultant” in the second occupation can be of anything. The sentence encoder returns the embedding which is interdependent on both the word “Healthcare” and “Consultant”. Hence, we get embedding vectors which have a much higher cosine similarity.
Also, note the high cosine similarity returned by sentence encoder for HSBC Employee and Bank Manager. The algorithm knows HSBC is a bank! We wouldn’t be able to achieve this with vanilla vectorizers and tf-idf approaches with the small amount of data we had!
Visualizing the Embeddings
Once we have the embeddings for our strings, we use t-SNE to reduce the dimensionality of our data from 512 to 2. Also, we generate multiple clusters using K nearest neighbor.
We plot the results on a scatter plot using plotly express a high-level wrapper around plotly graph objects.
As we can see above, similar professions have been clustered together. For example, all textile related professions like “tailors”, “women’s wear shop”, “saree whole seller”, “roohi garments”, have come closer to each other. We can tag all these clusters into “Textile” category. Similarly, all “Education” related professions have clustered together. The green smudge on the left side is a category of hardcore spelling mistakes which were totally different from other clusters and hence, similar to each other 🙂
Tracking concepts in a graph with Neo4j
The t-SNE plot was able to give us a static 2D representation of our data. Similarly, a correlation plot of the embeddings would give us first degree relationships among the occupation strings. What if we want to track 2nd or greater degree relationships?
To achieve this, we can create a graph with each occupation connected to the other with a correlation cutoff. Something like this:
Here, “lawyer” has a second degree connection to “supreme court judge”. Note that we only connect those nodes which have a correlation ≥ 0.75. This ensures that only highly related data is connected in the graph.
The above schema was applied in Neo4j. One of the most used graph databases. Note how we move from a “concept” to “concept”.
(Note: Highly suggest to zoom in for better viewing if on a browser)
Theme 1 — Pilot to Hospitality
Pilot -> Aviation -> Airlines Professional -> Flight Attendant ->Cabin Attendant -> Hotel Job -> Other Hotel Stuff
Theme 2- Pilot to Defense
Pilot -> Airforce -> Army -> Navy and Merchant Navy
Theme 1 — Photographer to Architect
Photographer -> Fashion Photographer -> Fashion Designer ->Interior Designer
Them 2 — Photographer to Film stuff
Photographer -> vfx artist -> assistant cinematographer -> video director -> film maker -> more film stuff
Theme — Writer to News Reporter
writer->document writer -> editor -> journalist-> reporter -> tv news reporter
Theme — Mechanic to Construction worker and Structural Engineer
mechanic->ac mechanic -> ac technician-> electrician -> welder -> steel worker -> construction
Theme — Farmer to Milk Suppliers
Farmer-> agriculture -> dairy farm -> milk and dairy business -> milk suppliers
Words and Sentence embeddings have certainly made it possible to solve complex NLP problems with ease. There have been major advancements in the field recently. OpenAI released the GPT-2 model:
Finding similarities among strings becomes a daunting task when we have large amounts of data. Facebook released a tool for this purpose. They optimize the algorithms to run on GPUs.
“It’s a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other”
We provided a simple example of how graphs can be used to understand and track context. Here, we were able to figure out why “pilot” leads to both “airforce” and “hotels”. But, it can have applications with bigger docs like research papers, official documents etc.