“Amateur” Data Science vs. “Professional” Data Science



It has never been a better time to learn new tools and techniques in data science, for free. Many publications on Medium alone aggregate well-written tutorials on a variety of modeling techniques, and, new tools are constantly coming out that have great documentation and are easy to pick up if you know a bit of Python: last week, one day in the Metis Slack channel, three instructors were discussing the merits of three tools for network analysis, all of which meet these criteria.

Three tools for graph/network analysis, all cool and easy to learn.

But beyond just learning things because they are cool, data scientists want to have an impact on the world. That means learning the tools and techniques most likely to be able to be put into production to impact the organization you’re working for — or want to work for.

Many data scientists get enamored with learning lots of libraries and many algorithms. What is underappreciated, even among more senior data scientists, is that much of the concrete value data scientists add isn’t from knowing advanced tools and algorithms; it is usually:

  • Problem structuring: taking complex situations that organizations have and formulating their problems in such a way that clever applications of simple techniques can solve them.
  • Engineering, broadly speaking: writing clean, easy-to-follow code — most of which, as is well-known, will simply be for cleaning data — and working with engineers to ensure this can work with the prod

These I would broadly call “professional data science”. By contrast, I would argue that the tools and techniques that many aspiring data scientists focus on fall under the umbrella of “amateur data science” .

“Amateur”? How dare you!

I do not mean amateur in a bad way! The origin of the word “amateur” is French, and means “one who loves, lover”. By amateur data science, I mean the things we learn “for their own sake” because they are intrinsically beautiful, fascinating, or even awe-inspiring, rather than because they can have a direct impact on an important real-world problem. Indeed, even DeepMind’s monumental AlphaGo Zero program would fall under this category, since it wasn’t built to solve a particular problem that Google or even other companies within Alphabet were facing.

David Silver and Demis Hassabis of DeepMind. Amateur data scientists, just like me.

In addition, loving to learn this stuff does not make you as an individual an “amateur” data scientist. Or, if it does, we are all amateur data scientists, myself included. The point of this blog post is not to denigrate any part of data science — rather, the point is to shine light on another side of data science which is written about disproportionately little and is very different than “amateur” data science: that of “professional” data science.

A profile of “professional data science”

My friend Nathaniel Tucker and I recently did a podcast interview with Josh Bloom, was intended to be a profile in professional data science.

Nate and I interviewing Josh Bloom at the GE Digital office in a WeWork in San Francisco.

Josh discussed applying machine learning in a variety of real world contexts, from dealing with natural language data coming out of Salesforce or Zendesk at his first company, Wise.IO, to dealing with (almost literally) messy industrial image data from pipelines at GE Digital.

  • At Wise.IO, the team was dealing with language data around customer service. You might think that their solution involved using word embeddings or some other cutting edge approach, given that their product worked well and the company was eventually acquired by GE. However, for their use case — identifying whether a ticket could be addressed with an automatic response vs. warranting a more extensive human look — one of the simplest possible techniques, logistic regression on top of bag of words, worked well enough, in addition to having the benefit of being easy to understand. The key for a data scientist on that team would have been coming up with creative and clever ways to apply these simple models, including identifying when the model was more vs. less confident in its predictions.
  • In addition, Josh discusses how at GE, the most challenging part of analyzing images of oil pipelines invovlved preprocessing and cleaning the data properly . Once that was done, the goal of the model wasn’t to achieve “high accuracy”, but to give GE analysts insight into which of the many images could be something significant — a problem that requires cleverness in, for example, determining the criteria to select images as worthy of further attention, but not necessarily knowledge of the latest deep learning architecture published at NIPS.
Read about the work GE Digital is doing in the Industrial Internet of Things here.

Amateur vs. Professional Data Science

  • Reading lists and/or trying to learn the “Top 10 algorithms” without learning why one would use one instead of another. As cool as algorithms are, It could be the case that knowing just 2 or 3 of these algorithms very well could solve 99% of problems for you. I heard a manager at a leading tech company say that he encourages his employees to stick with just two model for supervised learning: logistic regression and gradient boosted trees (though he implied that recently he was open to them using shallow neural networks).
  • Dealing with problems where the question is well defined and the data is clean. In the real world, once this has been done, building the model relatively easy! Again, the value that data scientists add is in framing the problem so that all the modeling techniques you’ve learned about can be used to answer the real world question you care about.

Caveats

To be fair: it is important to learn things that get you excited and that you think are cool. Deep Learning, for example, is by far the hottest subject among amateur data scientists — I myself follow the latest developments and think it is an endlessly cool subject (the math behind it in particular is really simple and elegant when you work out the details…but I digress).

And, tools that are at first amateurish do indeed become professional: PyTorch is now moving toward its first release geared toward use in production rather than just for research; and more generally, Deep Learning is being applied in production, first in narrow applications where it is far and away the best approach such as computer vision but slowly expanding beyond this.

Still, taking Deep Learning as an example: since Deep Learning models are so complicated, before using them to solve a real world problem, you should ask if a simpler approach could work. For example, thanks to the diligence of Uri Shalit (and thanks to DeepMind for publishing a comprehensive appendix) we know that when DeepMind applied Deep Neural Networks to real world health records data to predict inpatient mortality, logistic regression with regularization, a much more widely-used technique within companies, did only a tiny bit worse than using Deep Learning, and resulted in a simpler, easier-to-interpret model.

So, the key takeaway from this article is not to stop learning the flashy, new stuff, but to also pay attention to which tools are most commonly used by organizations all around the world. The tools that professionals put into production are popular for a reason: they are often simple relative to techniques that deliver similar performance, and this has been proved on a wide variety of datasets, not just on flashy benchmarks. Learning about the details of Deep Learning models without a firm grasp of logistic regression with regularization is, well, amateurish.

How should I learn “Professional” Data Science?

Here are some questions professional data scientists think about all the time:

Framing the problem

  • What is the real world implication of the outcome of this modeling question?
  • On this problem, does getting extremely high accuracy vs. ok accuracy matter?
  • Is making a correct prediction for every case important, or is identifying when

Modeling

When choosing a model, here are things to consider:

  • How long does this model take to train?
  • How long does this model take to make predictions (do inference)?
  • How easy is the model to interpret?
  • How easy is the code that generates the model to maintain?

Coding/engineering

  • Is the code written in a way that is easy to maintain?
  • Have I written what I have done in a modular way, so that parts of it are re-usable?

Deployment

  • Are you familiar with what deploying your model might involve?
  • Have you learned Python libraries such as Flask that let you deploy models as HTTP servers?
  • Have you explored tools such as AWS SageMaker and Algorithmia?

Example

Let’s say you do a Kaggle competition and get extremely high accuracy. In the blog post you write about what you learned, don’t just talk about how high your accuracy was: talk about simple or complicated it was to find the optimal model, how easy it would be to break down and explain what you did to somebody else, how fast it takes (e.g. in milliseconds), etc.

Where to learn more

For a couple of great blog posts covering topics at the core of “professional data science”, check out these two posts from Josh on the Wise.IO blog:

Leave other posts you think are relevant in the comments below!

Conclusion

No matter what your seniority, it will help you if you consistently think about the things you are reading from a “professional data science” angle, consistently asking yourself the questions above. This will help you learn the right things so you can either get the job you want or have a greater impact at the job you have.

Follow me on Twitter: @SethHWeidman

Source: Deep Learning on Medium