Empowering Citizen Data Scientists with Watson AutoAI

Original article was published on Artificial Intelligence on Medium

Empowering Citizen Data Scientists with Watson AutoAI

Leveraging on AI to automate machine learning activities and accelerate model lifecycle management

Why we need to care about AI Automation?

One of the most significant developments in the data science domain over the last years has been the automated AI and Auto ML solutions empowering business analysts and IT professionals to perform complex machine learning activities in this domain with small or no data science coding experience. Auto AI mpowers data scientists to work on projects faster and more efficiently by using AI automation to accomplish key machine learning tasks during data science projects. The data science community has adopted the concept of

Recents statistics has shown that regardless of the explosion in demand for roles in the machine learning and data science community. 50% of Data Scientist respondents on Kaggle said they had less than 2 years of experience on ML methods. Same for coding experience

How Long have you used machine Leanring Methods?

Delivering automatic feature engineering, model validation, model tuning, model selection and deployment, machine learning interpretability, time-series and automatic pipeline generation for model scoring, Auto AI provides companies with an extensible customizable data science platform that addresses the needs for a variety of use cases across industries.

Feature engineering is the strong weapon that advanced data scientists use to extract the most accurate results from algorithms. Auto AI employs a suite of algorithms and feature transformations to automatically engineer new, high-value features for a given dataset.

Andrew Ng’s has introduced the idea of automated AI in Feature Engineering

“Through 2020, the number of citizen data scientists will grow five times faster than the number of expert data scientists. Organizations can use citizen data scientists to fill the data science and machine learning talent gap caused by the shortage and high cost of data scientists.”

Source: Gartner: Top 10 strategic technology trends for 2019, Oct. 2018

The AutoAI tool in Watson Studio automatically analyzes your data and generates candidate model pipelines customized for your predictive modeling problem. These model pipelines are created iteratively as AutoAI analyzes your dataset and discovers data transformations, algorithms, and parameter settings that work best for your problem setting. Results are displayed on a leaderboard, showing the automatically generated model pipelines ranked according to your problem optimization objective.

AI desinging AI

AI automation can change the way businesses processes work. Neural networks and machine learning algorithms are arguably the most powerful tools currently available to data scientists. However, while only a small proportion of data scientists have the skills and experience needed to create a high-performance neural network from scratch, at the same time the demand far exceeds the supply.

As a result, most enterprises struggle to quickly and effectively get to a new neural network that is architecturally custom-designed to meet the needs of their particular applications, even at the proof-of-concept stage. Thus, technologies that bridge this skills gap by automatically designing the architecture of neural networks for a given data set are increasingly gaining importance.

Revolution of depth in Neural Netwoek

AI optimizing AI

Using AI for the design and optimal performance of AI models brings a new and much-needed capability to the development of AI technologies. For example parameter tuning of complex networks can be time-consuming, error prone and might not scale with time and resources. Neural networks continue to grow in size and complexity, so there is an imperative to automate the process of optimal parameter selection to ensure that machine learning process generates predictive results in an accurate and optimal way.

Neural Network Hyparameter optimization with AutoAI

AI governing AI

60% of companies see regulatory constraints as a barrier to implementing AI efficient and automated policies. Without expensive Data Science resources handholding multiple AI models in a production application:

1.No way to validate if AI models are compliant with regulations and will achieve expected business outcomes before deploying

2.Difficult to track and measure indicators of business success in production

3.Resource intensive and unreliable processes for ongoing business monitoring and compliance

4.Impossible for business users to feedback subtle domain knowledge into model lifecycle

Auto AI Functionalities

  1. Automated Data Preparation

Most data sets contain different data formats and missing values, and as we know most of the standard machine learning algorithms work with no missing values. AutoAI applies various algorithms, or estimators, to analyze, clean, and prepare raw data for machine learning. It automatically detects and categorizes features based on data type, such as categorical or numerical. One of the most important requirements is variable scaling ensuring that most of the variable are equally scaled and to reduce machine learning bias. Depending on the categorization, it uses hyper-parameter optimization to determine the best combination of strategies for missing value imputation, feature encoding, and feature scaling for your data.

2. Automated feature engineering

Feature engineering attempts to transform the raw data into the combination of features that best represents the problem to achieve the most accurate prediction. AutoAI uses a unique approach that explores various feature construction choices in a structured, non-exhaustive manner, while progressively maximizing model accuracy using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algorithms of the model selection step.

Auto ML Demo in IBM Watson Studio Source:IBM

3. Hyperparameter optimization

Hyper-parameter optimization step refines the best performing model pipelines. AutoAI uses a novel hyper-parameter optimization algorithm optimized for costly function evaluations such as model training and scoring that are typical in machine learning. This approach enables fast convergence to a good solution despite long evaluation times of each iteration.

4. Automated model selection

The next step is automated model selection that matches your data. AutoAI uses a novel approach that enables testing and ranking candidate algorithms against small subsets of the data, gradually increasing the size of the subset for the most promising algorithms to arrive at the best match. This approach saves time without sacrificing performance. It enables ranking a large number of candidate algorithms and selecting the best match for the data.

Model Pipeline Comparison in Titanic Dataset Source: IBM

IBM’s Strategy for AI Automation of AI Development

1. Trasfer Learning

Transfer learning is an important piece of many deep learning applications now and in the future. This is predominantly due to the scale of training production deep learning systems; they’re huge and require significant resources.

A recent paper from the University of Amherst found that in a production deep learning application focused on natural language processing, there were more than 200 million weights required to be trained with the available training data. Training a network of this size with a network of graphical processing units (GPUs) emitted the same amount of CO2 as five average U.S. vehicles over their individual lifetimes. There are many pre-trained models available for use over a variety of platforms and tasks such as MobileNet, Yolo for tensorflow etc.

2. Neural Network Architecture Search

Neural architecture search (NAS)is only one component of the automation pipeline that aims to find suitable architectures for training a deep learning model. This search in itself is a computationally intensive task and has received an overwhelming interest by the deep learning community.

Consequently, there has been an upsurge in the development of neural architecture search methods leaving the field with lots of competing choices with little to no consolidation of the developed methods and lack of guidelines to help a practitioner with choosing appropriate methods. We address this gap in our survey with an extremely thorough analysis of the existing landscape. We provide a formalism which unifies the vast pool of existing methods and critically examine the different approaches. This clearly highlights the benefits of different components that contribute to the design and success of neural architecture search and along the way also sheds light on some misconceptions in the current trends of architecture search.

Neural Architecture Search in Reinforcement Learning Source:IBM

3. Model Pipeline Optimization & Deployment

While AutoAI is generating the models, there are two different views through which you can visualize the progress of these pipelines being created. They are the progress map and the relationship map as seen in the following images. You see that AutoAI has choosen XGB, Random Forest, and Decision Tree Classifiers as the top performing algorithms for this use case. After data preprocessing, AutoAI identifies the top three performing algorithms and for each of these three algorithms, AutoAI generates the following pipelines: Automated model selection (Pipeline 1), Hyperparameter optimization (Pipeline 2), Automated feature engineering (Pipeline 3) ,Hyperparameter optimization (Pipeline 4).

Figure shows the relationship map with the relations between each of these pipelines Source:IBM

Each model pipeline is scored for a variety of metrics and then ranked. The default ranking metric for binary classification models is the area under the ROC curve, for multi-class classification models is accuracy, and for for regression models is the root mean-squared error (RMSE). The highest-ranked pipelines are displayed in a leaderboard, so you can view more information about them. The leaderboard also provides the option to save select model pipelines after reviewing them.

Pipeline Leaderboard Source:IBM

Disclaimer: Part of the views expressed here are those of the article’s author(s) and may or may not represent the views of IBM Corporation. Part of the content on the blog is copyright and all rights are reserved — but, unless otherwise noted-under IBM Corporation (e.g. photos, images).