Original article was published by Himanshu Chandra on Artificial Intelligence on Medium
Quite often, we machine learning practitioners get swept away in the rush of applying different models to solve a certain problem. Extensive use of statistical tests takes either a back seat or comes into play only when presenting end results to the client.
While certain metrics like ROC, PR curves & MCC help you fine tune your results at the fag end of your project (read: How NOT to use ROC, Precision-Recall curves & MCC), certain other metrics might help you win new projects for your organization if used at the start itself.
Today, I present a case study of how we typically bag new projects and how metrics like p-values, confidence intervals and concepts like binomial series help us gain customers’ trust.
The Case In Point
This particular project was one of computer vision for the manufacturing industry. The client produces medium-sized metallic parts which are visually inspected for cracks, dents, rust and few other categories of defects, at the end of the production line. An image based automated solution was required to replace the current manual process — since their daily production volume was on the rise — currently at around 1,500 parts per day.
Bids from multiple software vendors was invited, and we were one of them. As often happens, the client had a great record and expertise in the manufacturing domain, but not quite so much in the AI/ML field. The typical approach they planned to take was to check past projects, company size, commercial quotation and try to quantify that into a general sense of confidence for a particular service provider.
We were not particularly happy about this approach, being a growing, but a small company still. We were however, confident of our skills and hence I proposed this to the client:
Why don’t we test our initial models, one month from now, on 1,000 parts and amongst all the vendors, see whose model performs the best?
This would become a PoC (Proof of Concept) and also help us estimate accurately for the longer engagement. By the time I proposed this, they had already narrowed the list down to us and another competitor. They were, however onboard with the PoC idea, since it seemed more quantifiable than their current methods and asked us and the competitor to get started on the initial classification model.
How do you confidently say someting is better than the other based on just one classification run over 1,000 samples? After all, this model would be run on 350,000+ parts per year. We were questioned:
Is such a ‘sample run’ reliable enough for comparison?
Also, when we say ‘better’, what metric are we considering here? Is it accuracy, precision, recall, specificity, AUC or something else?
The second question is easier to answer, so let’s start with that. Our client was very clear in the fact that they track precision. So nothing we could discuss further there. For a quick review of the confusion matrix and the associated metrics, read –
To anwer the first question, we first set an expectation, a confidence level — 95% in this case. We told them that whatever the result of comparing the two models may be, we would be 95% confident that it is not erroneous. We could easily have chosen 90% or 99% or any other number, but they were fine with 95% at the PoC stage.
Now to compare classification models, there are a few statistical methods; McNemar’s paired test is one which is commonly used. However, the test only tells you if the models are different. Also, it compares the mismatch in predictions, and does not work with precision/recall/specificity per se. We needed something more intuitive which could be trusted by the client too, when explained in day-to-day terms.
Binomial Series To The Rescue
Imagine this: Suppose you flipped a coin and got a head. You flipped again and got a head again. Then a tail. Then a tail…
After 20 flips, you saw that you had 7 heads and 13 tails. If I asked you to be 95% confident in your statement, would you say that the coin is a fair one, or is it biased towards tails? After all, one knows from experience that more often than not, you do not get an exact 50–50 split of heads-tails in 20 flips. That does not mean the coin is biased.
Alternatively put, how many tails would you want out of 20 flips, before saying that the coin seems to be favouring tails?
Or that the probability of tail performing better than head is significantly higher?
The answer to the above question is this: