Source: Deep Learning on Medium
Objective: Recommend life – style videos based on dialogue with the user
Following is the high level design.
Step 1: Data Collection
I wrote a web scrapper in python to collect user questions and related labels from a website that had 10 million questions. The data looks like this.
They were labelled with multiple tags like shown under. I decided early in the design that I will eliminate questions with multiple tags. The data reduced to 1.8 million — almost 80% loss.
Step 2: Preprocessing Pipeline
After playing with the data, I build the follwing preprocessing pipeline to clean data and make it usable. It further reduced the data size to less than 1 Million.
I ran into some interesting problems. Spelling errors were the biggest source of losing interesting information. For example, a simple sentence with a misspelled symptom and medicine name will be filtered out due to too many ‘NA’s. To fix it was the most time consuming activity in the whole process. I iteratively designed hard coded logic to fix spelling mistakes in the medical terms.
Step 3: Model Architecture and Training
The model architecture was following design. I used convolution filters and then gated them to chose information flow for the right diagnosis. The problem is supervised.
Python code for Gated CNN is following: The code can be found here
The examples in Keras. Contribute to pratyush-kumar-sinha/Keras development by creating an account on GitHub.github.co
I have used a modofication that puzzlingly gives better results on test set. Instead of training the model with 1 batch in each iteration, I trained with 2 batches in each iteration. That is why weights are shared between two batches.
Step 4: Prediction
The Prediction looks like this.
Query is the input. QID is the query id in that test set. score 1 is the softmax score for the topmost predcition. tokens_y is the tokenized data. score 2 is the softmax score for the second prediction. f, na and m are the raw scores for predicting whether the query came from a female, not enough information, or male.
Also I did not train to predict gender on all questions. I only trained to predict gender on a subsample where ‘male’ and female’ were mentioned in the query but I used it in inferencing. Some interesting cases are following:
The overall accuracy was 86% if I combined top 2 softmax predictions.
Step 5: Unsupervised matching of video titles
Pushing a list of video titles through the learnt model and comparing the distance between the feature maps gives the recommended list of videos relevant to the question.