Original article was published on Artificial Intelligence on Medium
The Machine Learning Life Cycle
While there are many variations of the machine learning life cycle, all of them have four general buckets of steps: planning, data, modeling, and production.
Before you start any machine learning project, there are a number of things that you need to plan. In this case, the term ‘plan’ encompasses a number of tasks. By completing this step, you’ll develop a better understanding of the problem that you’re trying to solve and can make a more informed decision on whether to proceed with the project or not.
Planning includes the following task:
- State the problem that you are trying to solve. This may seem like an easy step, but you’d be surprised at how often people try to come up with a solution to a problem that doesn’t exist or a problem that isn’t really a problem.
- Define the business objective that you are trying to achieve in order to solve the problem. The objective should be measurable. “Being the best company in the world” is not a measurable objective but something like “Decrease fraudulent transactions” is.
- Determine the target variable if applicable and potential feature variables that you may want to look at. For example, if the objective is to decrease the number of fraudulent transactions, you’ll most likely want labelled data of both fraudulent and non-fraudulent transactions. You may also require features like the time of the transaction, the account ID, and the user’s ID.
- Consider any limitations, contingencies, and risks. This includes, but is not limited to, things like resource limitations (lack of capital, employees, or time), infrastructure limitations (eg. lack of computing power to train a complex neural network), and data limitations (unstructured data, lack of data points, uninterpretable data, etc)
- Establish your success metrics. How will you know that you’ve been successful in achieving your objective? Is it a success if your machine learning model is 90% accurate? What about 85%? Is accuracy the most suitable metric for your business problem? Check out my article on several metrics that data scientists use to evaluate their models.
If you complete this step and are confident with the project then you can move to the next step.
This step is focused on acquiring, exploring, and cleaning your data. More specifically, it includes the following tasks:
- Collect and consolidate the data that you specified in the planning phase. If you’re obtaining data from multiple sources, you’ll need to merge the data into a single table.
- Wrangle your data. This entails cleaning and converting your data to make it more suitable for EDA and modeling. Some things that you’ll want to check include missing values, duplicate data, and noise.
- Conduct exploratory data analysis (EDA). Also known as data exploration, this step is complete essentially so that you can better understand your dataset. If you’d like to learn more about EDA, you can read my guide on conducting exploratory data analysis.
Once your data is ready to go, you can move on to building your model. There are three main steps to this:
- Select your model: The model that you choose ultimately depends on the problem that you are trying to solve. For example, whether it’s a regression or classification problem requires different methods of modeling. If you’d like to learn about the various machine learning models, check out my article ‘All Machine Learning Models Explained in 6 Minutes.’
- Train your model: Once you’ve selected your model and split your dataset, you can train your model with your training data.
- Evaluate your model: When you feel that your model is complete, you can evaluate your model using the testing data based on the pre-determined success metrics that you’ve decided.
The last step is to productionize your model. This step is not talked about as much in courses and online but is essential especially for enterprises. Without this step, you may not be able to get the full value out of your models that you build. There are two main things to consider in this step:
- Model Deployment: Deploying a machine learning model, known as model deployment, simply means to integrate a machine learning model and integrate it into an existing production environment where it can take in an input and return an output.
- Model Monitoring: Model Monitoring is an operational stage in the machine learning life cycle that comes after model deployment, and it entails ‘monitoring’ your ML models for things like errors, crashes, and latency, but most importantly, to ensure that your model is maintaining a predetermined desired level of performance.