Original article was published on Artificial Intelligence on Medium
Cleaning and optimizing data is one of the biggest challenges that data scientists encounter. The ongoing concern about the amount of time that goes into such work is embodied by the 80/20 Rule of Data Science. In this case, the 80 represents the 80% of the time that data scientists expend getting data ready for use and the 20 refers to the mere 20% of their time that goes into actual analysis and reporting.
Much like many other 80/20 rules inspired by the Pareto principle, it’s far from an ironclad law. This leaves room for data scientists to overcome the rules, and one of the tools they’re using to do it is AI. Let’s take a look at why this is an important opportunity and how it might change your process when you’re working with data.
The Scale of the Problem
At its core, the problem is that no one wants to be paying data scientists to prep data anymore than is necessary. Likewise, most folks who went into data science did so because deriving insights from data can be an exciting process. As important as diligence is to mathematical and scientific processes, anything that allows you to do more diligence and to get the job done faster is always a win.
IBM published a report in 2017 that outlined the job market challenges that companies are facing when hiring data scientists. Growth in a whole host of data science, machine learning, testing, and visualization fields was in the double digits year-over-year. Further, it cited a McKinsey report that shows that, if current trends continue, the demand for data scientists will outstrip the job market’s supply sometime in the coming years.
In other words, the world is close to arriving at the point where simply hiring more data scientists isn’t going to get the job done. Fortunately, data science provides us with a very useful tool to address the problem without depleting our supply of human capital.
Is AI the Solution?
It’s reasonable to say that AI represents a solution, not The Solution. With that in mind, though, chipping away at the alleged 80% of the time that goes into prepping data for use is always going to be a win so long as standards for diligence are maintained.
Data waiting to be prepped often follow patterns that can be detected. The logic is fairly straightforward, and it goes as follows:
- Have individuals prepare a representative portion of a data set using programming tools and direct inspections.
- Build a training model from the prepared data.
- Execute and refine the training model until it reaches an acceptable performance threshold.
- Apply the training model and continue working on refinements and defect detection.
- Profit! (Profit here meaning to take back the time you were spending on preparing data.)
There are a few factors worth considering. First, depending on the size of the task and its overall value, it has to be large enough that a representative sample can be extracted from the larger dataset. Preferably, you don’t want it to be 50% of the overall dataset, otherwise, you might be better off just powering through with a human/programmatic solution.
Second, some evidence needs to exist that shows the issues with each dataset lend themselves to AI training. While the power of AI can certainly surprise data scientists in terms of improving processes such as cleaning data as well as finding patterns, you don’t want to be on it without knowing that upfront. Otherwise, you may spend more time working with the AI than you gain for doing analysis.
The human and programming elements of cleaning and optimizing data will never go away completely. Both are essential to maintaining appropriate levels of diligence. Moving the needle away from 80% and toward or below 50%, however, is critical to fostering continued growth in the industry.
Without a massive influx of data scientists into the field in the coming decade, something that does not appear to be on the horizon, AI is one of the best hopes for turning back the time spent on preparing datasets for analysis. That makes it an option that all projects that rely on data scientists should be looking at closely.