Original article was published on Deep Learning on Medium
Out of the 30 over categories, each question would on average be labelled as 3–5 categories. This imbalance thus makes it difficult to evaluate the accuracy of the system. For example, let’s say there are 30 categories and the system would label a question 1 if it is in that category and 0 if it is not. Given a particular question that belongs to only 5 categories, if the system only labels 4 out of 5 of these categories correctly and proceeds to correctly label the remaining 25 categories correctly, how accurate would you say is this classification? Some possible ways to see this are:
1. having 29 labels out of 30 that’s correct (96.7%),
2. out of the 5 that was supposed to be labelled as 1, 4 were correct (80%). This is also known as Recall
3. out of the 4 labels that were labelled 1, all of them were properly labelled (100%). This is also known as Precision
But if you use the conventional Accuracy metric, this would be considered a 100% incorrect classification.
The first way mentioned above (the one with 96.7%) represents a metric called the Hamming Loss which looks at the number of labels that are correct as a whole. However, it would not make sense if only 3–5 categories are labelled as 1 on average. That would mean that even if we had all labels be 0, we would roughly get an evaluation value of 25/30 (83%) or higher. Which may seem fantastic given the value but is not truly representative of what we want to evaluate, which is how good the system is in predicting the right categories to label as 1.
Precision and Recall resonates more with how we want to evaluate our system. Recall combats the intuition of labelling all categories as 0. It looks at the categories that should actually be labelled as 1 and compares them to the predicted labels of those same categories. Hence, if we had all 0’s, the Recall would evaluate to be 0%. On the other hand, Precision combats the flip side of having more category labels as 1. It simply looks at all labels that are labelled as 1 and compares them to the actual categories that should be labelled as 1. Hence, if we label all 30 categories as 1, with only 5 categories actually being 1, the Precision would evaluate to be 5/30 (16.6%) only. Thankfully for us, there is a metric called the F1-Score that makes use of both Recall and Precision.
Basically, the equation shows how this F1-Score metric takes the Precision and Recall into consideration. Only when both of these values are 100% would the F1-Score be 100% as well. The converse is true as well, where if either Precision or Recall falls short, the F1-Score would be compromised. As mentioned in the introduction, our system managed to achieve an F1-Score of 90%!
Ease of Deployment
Programming is not a skillset that everyone has. Therefore, the need to package the system in a way that made it easier to use and understand seemed like an obvious step in our project. This then required quite some code cleaning and organising to fit them into various forms of ‘packaging’. I honestly enjoyed this process as it felt pretty therapeutic, anyway here are 3 iterations of the ‘packaging’ for this auto-tagging system.
In this image, you can see how raw the codes are in general and if you have no experience in programming, this probably looks mind boggling to you. Even if you do know how to code though, cleaning this up and organising it better would definitely help as well.
In this next image, the working codes are crammed into a single package/module, as you can see on the left. This package/module can be executed to tag questions using just a few lines of code, as seen on the right. It is evident that everything is so much more clean and easier to use in general!