Original article can be found here (source): Deep Learning on Medium
Need for the Search engine
Stack Overflow has millions of questions that are already answered and verified by the users. If a user came across with some problem, there is a high chance of that problem is already answered.so instead of posting a new question and waiting for someone else to answer he/she can simply check relevant questions through a search engine. To maintain their organization, it is very important for the site that facilitates relevant questions to user queries and also important to get accurate and relevant results.
So, an optimal search engine is necessary. currently, Stack Overflow has its search engine but it has few flaws. Let’s take an example where the stack overflow search engine is failing.
If we consider the semantic meaning below two should have given similar results. but in the 1st query just because of one spelling mistake search engine didn’t give any results. and the second query is not giving appropriate results.
Query 1: how to do group by and follwed by the sum in pandas
Query 2: how to do group by followed by sum in pandas
we will tackle this problem using text similarity with Natural Language Processing (NLP) tasks.
Our goal is to understand the content of what the user is trying to search for and return the most similar results based on that.
The idea is to represent questions as vectors of features, and compare questions by measuring the distance between these features. Stack Overflow provides topics(tags) for each question. so we will be using tags information also to retrieve similar results. predict the tags of the user query and based on those tags we will give weight.
The overview of the solution is as follows:
- Collecting data from Google cloud public dataset(stack overflow) using BigQuery
- Data aggregation and Preprocessing the data
- Exploratory data analysis on tags and filtering most common tags
- Feature engineering
- Training Word2Vec Model
- Training a deep learning model to predict tags
- Similar result retrieval using the below techniques :
1. Average Word2Vec with Cosine Similarity
2. TF-IDF and Word2Vec with Cosine Similarity
3. Smooth Inverse Frequency and Word2Vec with Cosine Similarity
4. Universal sentence encoder with Cosine Similarity
5. Bert embeddings with Cosine Similarity
6. Latent Dirichlet Allocation with Jensen-Shannon distance
- comparing the results