Machine Learning (ML) Models: From Science to Industry Quality Product Engineering: Part 1 — Motivation
The science and engineering of turning the “cool new AI machines” in to robust, customer facing, industry quality products
In an interview to the Wall Street Journal (Oct 2017), Intel’s CEO Brian Krzanich, likened the impact of the Artificial Intelligence (AI) based systems to the disruptive changes brought forth by the internet during the 90’s . During the 90’s, if you are not an internet company, experts observed that you aren’t going to be around for long. The same is becoming true for AI or Machine Learning (ML) based products in today’s context, because ML applies to every business and hence is a horizontal technology. Due to this reason, Prof Andrew Ng rightfully termed AI as the new electricity . In the contemporary research Machine Learning (ML) and more importantly Deep Learning (DL) is used as a primary approach to building intelligent systems. Hence in this article, we use the terms Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) interchangeably without any loss of generality for the kind of problem statement we discuss below.
Today, the constant endeavor of Data scientists and the ML research community is to focus on developing new architectures and techniques intended at two broad aspects:
(a) Strive to match and where possible, exceed human level of performance for cognitive tasks such as image recognition, natural language understanding and so on. For instance, the results from the ImageNet competitions indicate that the deep architectures such as Microsoft’s ResNet with over 150 layers  outperform human levels of performance. Experiments on question answering systems based on the Stanford’s SQUAD dataset  also claim higher than human accuracy levels in the domain of question-answering.
(b) Come up with architectures, design patterns, tools and techniques that enable novel applications where machine learning was not applied before. For example, an application that involves multimodal fusion (Computer Vision and Natural Language Processing) is image transcription where given an image, we generate a text description of it. Recently published papers on Capsule Networks proposed by Geoff Hinton’s research team is intended at building more scalable and performant systems suitable especially for Computer Vision applications. 
Today, the ML based approaches are finding their way through the industry cutting across the verticals. Applications such as fraud detection systems, chat agents, sentiment analyzers and so on have become well adopted. However, are we ready yet to deploy these techniques for mission critical applications? When can I enjoy driving through the Bangalore traffic in my self driving car? Trusting a ML classifier to perform mission critical or industry quality tasks requires that there exists a quality assurance procedure that verifies the product quality across all applicable dimensions. We need to have a clear criterion on which errors are acceptable and which are “show stoppers”.
“If we have a face recognition application that has a 99% accuracy but classifies the image of Narendra Modi as that of Donald Trump, will we sign off on the classifier as ship ready?”
It depends on the mission criticality of the application as governed by our quality plan. For a social networking application such a misclassification might be acceptable because statistically the classifier is still doing pretty well albeit a few errors that may look ugly when looked at on a case by case basis. But for more serious applications certain kind of errors are simply unacceptable, see the image below (Courtesy Los Angeles Times) that shows an image snapshot of a self driving car crash and the manufacturer referring to the “owners manual” on when to run the car in self driving mode!
The key consideration however is, given the rapid advances and excitement in the machine learning research, how can the software development community “engineer” these in to robust, industry grade products? More specifically, we may ask: How does the traditional software development processes apply to the development of a ML based product?
Building industry grade/scale products that leverage ML is of great interest to the product management community. Product management can look at this technology through the eyes of the customer and define the right business problems to focus on. There is an excellent series of blog posts on this subject, please refer the series by Yael Gavish on Medium . My objective in this series of articles is to go deeper on the engineering aspects of product development.
In this post (with a few more follow up articles), I intend to discuss the systemic aspects of building robust products that meet the quality levels required by the industry, viewing this from the prism of software engineers and the engineering managers. The scope of this article is to discuss the critical aspects of setting and verifying product quality goals without regard to the specific machine learning technique that is used to realize the quality goals.
Last week I was reviewing the demo of a newly developed classifier for a company. The developers, who were building their classifier for the first time, did a great job architecting the model, tweaking the hyper-parameters and were rightfully very happy on what they have done. The validation accuracy was around 93% and for the kind of problems for which the classifier was targeted, the accuracy levels were quite remarkable. To keep our discussions simple, I am using the term “accuracy” as a single indicator of the performance of the system and in the project we review many other critical metrics (e.g: Precision/Recall/F1). Obviously, we wanted to announce the product availability to the product managers and the Senior Management and tell them that we are done with the product and are ready for release.
But are we?
In the traditional software product development we always run our unit tests, system tests, regression tests and so on and ensure that there are no high severity bugs in order to qualify for the release. There may be additional goals pertaining to code coverage and other metrics pertaining to white box testing. For the UI intensive products, we run usability tests and make sure that the UI of the product is intuitive, performance/throughput is acceptable. Prior to the software development, we set the quality goals and hit those goals before we declare the product release. In a similar manner, we need a formal procedure that we can execute and verify ML based products.
Defining such a procedure is not that straightforward though. For instance, what is a “bug” in the context of a machine learning model? Would we treat every incorrect prediction by the classifier as a “bug” and file a bug report or should we assume the statistical accuracy numbers adequate to guarantee robustness? How do we assign the “severity” of a bug? What are the equivalents of white box testing, where we should be testing the internal working of a ML model as opposed to looking at only the inputs and outputs? We can extrapolate this line of thinking and keep asking questions around the development process.
As I was working on this consulting assignment I proposed a simple framework for the developers to follow, sign off and certify the product release. Some aspects of my work are project specific and I am sharing the generic aspects of the approach in this post. While the software development process includes topics such as the development life cycle choice, project planning/scheduling, configuration management and so on, our focus in this post is setting and verifying product quality goals for a ML based product.
As stated earlier, product quality is multidimensional. It is important to measure the quality of features as well as we need to be concerned about usability, performance etc. The framework I used to define, set and verify the product quality goals is based on FURPS, originally proposed and practiced by Hewlett Packard . This acronym stands for: Functionality, Usability, Reliability, Performance and Scalability (or Supportability). Later this model was extended to include Localization with the acronym changed to FLURPS.
Are there any other dimensions of product quality that are special to a Machine Learning product which are not addressed by the FURPS model? The answer turns out to be yes. Let us illustrate one such case. Consider a situation where our customer needs to understand the way a classifier arrived at a decision. This calls for a quality goal that can be termed “Explainability”. A high explainability goal requires the product to be able to explain every decision it makes. Such a goal often constrains the kind of ML technique that is used to build the product. This also means that the classifier can’t be treated as a black box and possibly, may need the system to be instrumented adequately in order to reveal the internal workings. The explainability aspects must be testable too and hence this dimension correlates to testability/maintainability.
In summary, we need to acknowledge the need to take an industrial approach while building applications based on ML. Such an approach defines the quality goals upfront. The definition and choice of quality goals is a non trivial exercise. The quality dimensions and the verification procedure for ML based products need to be thought through differently, primarily because ML models learn from data and hence the quality of the classifier is strongly impacted by the quality of the dataset which it learns from.
In the next article we will look at in detail how to map these quality dimensions from the traditional software product development universe to the domain of Machine Learning. I will also describe how to augment FURPS with other dimensions that are special for ML.
Thanks much for reading the article, looking forward to your comments and encouragement. I will write one more soon!
- “Companies Must Use AI — or Else”, WSJ Article, https://www.wsj.com/articles/companies-must-use-aior-else-1508811121
- “Artificial Intelligence is the new electricity”, Medium Blog by Andrew Ng, https://medium.com/@Synced/artificial-intelligence-is-the-new-electricity-andrew-ng-cc132ea6264
- Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, https://arxiv.org/abs/1512.03385
- “SQUAD — The Stanford Question Answering Dataset”, https://rajpurkar.github.io/SQuAD-explorer/
- Dynamic Routing Between Capsules
Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, https://arxiv.org/abs/1710.09829
- The Step-By-Step PM Guide to Building Machine Learning Based Products, Yael Gavish, https://medium.com/@yaelg/product-manager-pm-step-by-step-tutorial-building-machine-learning-products-ffa7817aa8ab
- FURPS, Wikipedia article, https://en.wikipedia.org/wiki/FURPS
Source: Deep Learning on Medium