A Simple Way to Explain the Pathway to Data Science

Original article can be found here (source): Artificial Intelligence on Medium

A Simple Way to Explain the Pathway to Data Science

The insights you get from data science can sometimes feel like eureka moments. But you don’t get them just sleeping over or bulldozing the options presented to you. In reality, there are a lot of moving parts. You have to plan and coordinate the dots to gain those valuable insights. In my opinion, we should imagine data science projects like walking down a pathway, where each step gets you closer to the goal that you have in mind.

Plan your pathway …

You first need to define your goals. What is it that you are trying to find out or accomplish? With that, you can know when you are on target or when you need to pivot a little bit.

You need to organise your resources. That can include things as simple as getting the right computers and the software, accessing the data, getting people and their time available.

You need to coordinate the work of those people because data science is a team effort. Not everybody’s going to be doing the same thing, and some things have to happen first, and some happen later.

You need to schedule the project, so it does not expand to fill up an enormous amount of time. Timeboxing, or saying we will accomplish this task in this amount of time, can be especially useful in working on a tight timeframe or you have a budget, and you’re working with a client.

The walking starts right here …

After planning, the next step is wrangling or preparing the data. That means you need first to get the data. You may be gathering new data, you may be using open data sources, or you may be using public APIs. The point here is you have actually to get the raw materials together.

The next step is cleaning the data, which is an enormous task within data science. It is about getting the data ready, so it fits into the paradigm. For instance, the program and the applications that you are using can process it to get the insight you need.

Once you have prepared the data and it is on your computer, you need to explore the data. This may entail making visualisations, doing some numerical summaries; a way of getting a feel of what kind of insights you may get out from the data.

And then, based on your exploration, you may need to refine the data. You may need to re-categorize cases. You may need to combine variables into new scores. Any of the things that can help you get it prepared for the insight.

Evaluate your pathway …

The third phase in your data science pathway is modelling. This is where you create a statistical model such as linear regression, decision tree, or deep learning neural network.

Then, you need to validate the model. This is an important step. How well do you know this is going to generalise from the current data set to other data sets. You don’t want to end up with conclusions that fall apart when you go to new places

The next step is evaluating the model. How well does it fit the data? What’s the return on investment for it? How usable is it going to be?

Based on the evaluated model, you may need to refine the model. You may need to try processing a different way, adjust the parameters in your neural network, get additional variables to include in your linear regression.

And then finally, the last part of the data pathway is applying the model, and that includes presenting the model, showing what you learned to other people, to the decision-makers, to the invested parties, to your client. Hence, they know what it is that you have discovered.

Repeat the Journey (from explicit to implicit) …

Then you deploy the model. For instance, you created a recommendation engine. You can put it online so that it can start providing these recommendations to clients or you can show it into a dashboard so it can begin to provide recommendations to your decision-makers. You will eventually need to revisit the model and see how well it is performing, especially when you have new data and maybe a new context in which it is operating.

Also, you may need to revise it and try the process over again. And then finally, once you have done all the above, you need to archive the assets, clean up the data yourself. This is a critical discipline in data science. It includes documenting where the data came from and how you process it. It involves commenting on the code that you used to analyse It also includes making things future proof. All of these together can make the project more successful, more comfortable to manage, easier to get the return on investment calculations. Collectively, so long everyone follows the steps mentioned above, it will make the project more successful.

Taken together, those steps on the pathway get you to your goal. It could be an amazing view at the end of your hike, or it could be a fantastic insight into your business model, which was your purpose all along.