Machine learning project checklist

Original article was published by Adrià Serra on Deep Learning on Medium


MACHINE LEARNING PROJECT

Machine learning project checklist

Structuring Machine Learning Projects

When we build any type of project, there are checks that the project should accomplish. In the case of machine learning projects are nos distinct.

Since now we have been explaining the mathematics(Statistics, Probability, Linear algebra, Calculus) that will allow us to understand how machine learning models work. But machine learning is more than an algorithm, it’s easy to train a model by itself, the difficult part is making it useful!

A basic structure for your machine learning projects

Machine learning workflow, self-generated.

Define the problem

The first thing to do is to detect a problem and try to answer these questions:

  • What objective do we have?
  • Who is going to use the solution?
  • What’s the actual workaround to this problem?
  • What type of algorithms will work(Supervise/unsupervised, etc…)
  • Select a performance metric/validation metric.
  • Is the problem similar to other problems already solved?
  • Define the assumptions that we will consider.

Obtain all the possible data

  • Select all data sources available.
  • How much data we have? (volume).
  • Create a workspace.
  • Get access to all the data.
  • Get the data.
  • Get all the data together in a dataset.
  • Check the format of the data.

Exploratory data analysis / Clean data, feature selection

  • If the data volume is big, select a manageable subset.
  • Create a jupyter notebook and explore each attribute of your data.
  • Check for missing values
  • Check for distributions, scales, and the noise of the data.
  • Visualize the data.
  • Study correlations.
  • Search for data transformations.
  • Feature selection.
  • Normalize/Scale data.

Sample Data

  • Go back to all the data, apply the transformations previously defined.
  • Divide the data into train/test/dev samples.

Create ML models/Evaluate Models

  • Try the most used models based on what you’ve seen in the data on the train/test sets, using the cross-validation algorithm.
  • Check the validation metric for each model on the dev set and select the better one.

Now you will have more insights into the problem, if the original question is well-formulated continue, if not, refactor it and restart the process.

Parameter tuning of the best models

  • Fine-tune the best models previously selected.
  • Perform a hyperparameter tuning.

Deploy the model

This is the most variable part of the model, it will depend on a lot of variables, some options are:

  • Create an API where businesses can request the model for prediction.
  • Add the machine learning pipeline to the business code.
  • Let the programmers of the business add your model to the business code.

Conclusion

Don’t be scared, we will explain all the parts of this workflow in further posts. The next ones will start with machine learning models that you can try in the create ML model part.