How to avoid the worst mistake every Data Scientist can make — using these 2 crucial steps

Source: Deep Learning on Medium

Can you tell us about your professional background?

Before Patreon, I was at Google working on data in a data analysis team. I was working on spam and abuse specifically, data applied to try to reduce the amount of spam, abuse and fraud in Google’s communications product. Before that I was studying applied math and music at Harvard, specifically the intersection of technology, arts, and computer science is where my interests lie. I joined Patreon about 4 years ago as the second Data Scientist, and I’ve with the company since then.

What were your day-to-day responsibilities when you first started at Patreon? How has your data role evolved throughout the years?

When I first started, Patreon’s main data structure was writing all of our key metrics to a Google sheet that refreshed everyday through Google Sheet’s API, which was effective at the time. That was the environment I was coming into, where we didn’t have any data infrastructure setup, and we were beginning to learn foundational data questions we wanted to ask about how creators, patrons and businesses like Patreon grow.

A lot of our first year was spent doing 1) data engineering work, such as setting up our first ETL, 2) accomplishing a big migration from MySQL to redshift, 3) defining our core metrics and what we wanted to measure, and 4) building our first key dashboards.

Now my day looks very different. I now manage 5 Data Scientists. I have a lot more influence on strategy (where should our data help us in telling us where to go), and I place a lot focus on product analytics (how is the product performing? How do you run experiments well?).

Do all Data Scientists at Patreon work on product-related questions? If not, can you tell me how the data teams are structured?

We are one core data science team (we’re very much a centralized team). We support 4 major functions in the business:

  1. Product Analytics: measuring performance of a product.
  2. Business Analytics: all data and metrics related to go-to-market business teams (sales, marketing, creator success, finance, legal, etc.).
  3. Core research: foundational, deep questions about creators and patrons that are going to drive the entire business.
  4. Business Intelligence/Data Education/ Data Accessibility: making improvements to data on-boarding to new employees, making resource that makes data more accessible and interpretable for the company.

70% of what we do on a quarterly basis is on the Product Analytics bucket. Some folks are more geared towards B.A. side, some data education, it depends on the person and on the quarter and what we are trying to do that month.

Can you elaborate on your main mission: To make “creators, patrons, and teammates at Patreon have the data they need to make excellent decisions”?

I think about this spectrum of foundational data infrastructure on the one side to data products on the other that are powering the creative economy for our data science team. Today, we are about 60–70% of the way towards that journey to the other end. My hopes is that in the future we are building data products, APIs, or models that are embedded within Patreon.com and are enabling patrons and creators to make better decisions.

For example, we can help creators understand the churn characteristics of patrons and their membership [churn model and analytics]. This then helps creators take better action to retain those members and to grow their membership at Patreon. Or we can tell patrons about the most popular post or benefit that they haven’t yet seen through something the data science team has built [content recommendation engine].

Right now we only have one model in production, the fraud model, which is mostly helping Patreon not charge fraudulent pledges. My hope is that in the long term, we are building more data products that are powering the site.

What are traits, qualities, or backgrounds that you seek in prospective Data Scientists at Patreon?

There are few things that are really important to me, especially because Patreon is a startup and being a Data Scientist at a startup can be different from being Data Scientists at a bigger company with a hundred Data Scientists. Few things that are key:

  1. Technical bar: we are working everyday in SQL, specifically in Postgres, and expect candidates to know Python/some fluency in some sort of statistical language. Also, someone who is really comfortable with querying really large datasets.
  2. Communication: we’re in roles where a lot of our day-to-day is spent getting great insights or building models and communicating results of that to stakeholders, whether that’s product managers, marketing folks or finance. It’s super key that data science candidates have good communication skills.
  3. Grit, tenacity and willingness to solve hard problems: Patreon is a new product in a new market. Things we are trying learn and problems we are trying to solve are generally hard problems. My hope is that anyone who joins the data science team is excited about hard problems and bumping against hard challenges.
  4. Passion for the arts and passion for the mission: This is not the most important but great to have.

Conversely, what are the traits you don’t look for? In other words, what are common pitfalls you’ve seen in data science applicants?

One common pitfall I see frequently is related to the idea of Maslow’s hammer. It’s this idea that for someone with a hammer everything looks like a nail. In data science, this is the desire to apply a methodology a candidate might feel really comfortable with to any problem regardless of the actual problem.

Example: a candidate might get into Tensorflow and learn Keras and deep learning and for any problem they run into, they would say “Oh, let me use Tensorflow! Let me apply deep learning to this”. This is a super important pitfall to avoid. Some problems you’re going to be faced with might require a simple excel spreadsheet. Or maybe actually how to solve a problem is getting folks in a room and talking about it. So not thinking about the right methodology to approach a problem with is very important pitfall to avoid.

So that would be zooming out from the questions they are trying to solve and asking “why” to the techniques and models they are using?

Exactly. Asking why we are trying to solve this problem, what value we are going to add to the business is always a great place to start, rather than jumping in with the methodology that you know.

For those who are not familiar with business-value thinking, how can one improve?

Two ways I recommend:

  1. There are great books on strategic thinking. If you Google Harvard Business Review’s books on strategic thinking, that would help. It’s great for Data Scientists to have the skill of thinking strategically of asking longer term questions and framing things into why they are being done.
  2. It can be very valuable for candidates to think about case studies of similar products. For instance, come up with 10 feature ideas for Pinterest, Airbnb or Lyft who have these very common software products. Go through the practice of before analyzing a feature, step back and ask why you would build that feature to begin with. This can help build the muscle of asking why and starting with that rather than diving right in.

Focusing specifically on Patreon, what are some major data challenges that the company faces?

One data challenge we face is that creators run their memberships in many different ways. For example: you have musicians who opens up a Patreon page for the support of their fans. They just want to bring their fans along for the ride for any of the music that they make. Another example: a podcaster who’s creating membership to offer exclusive content.

The fact that there are so many business models for why people use Patreon and how they use it and how they structure their pricing and benefits, this makes it really challenging for the data side. We do have foundational clustering to show what these different models look like, but they are not canonical. This unstructured business model problem in the data is really challenging for us. Because that means we have to rely on other elements and characteristics that we see in payment behavior and the pricing behavior to try to figure out what’s going to work for a given creator.

The second major challenge is that Patreon’s have been around for 6 years, and throughout that time, we have had different messages, marketing and branding to with the strategy at the time. Right now, we are focused on membership. Patreon’s a place where you can build a membership and get paid by your biggest fan. But that’s a very different model from our earliest creators. So our historical data might not be the most valuable thing for us to look at when we are trying to look into this new market of membership.

This is an inherent challenge of what data do you use for what problem and how much do we rely on older data vs. data from creators who might have launched in the past year.

For this cold-start problem, what strategies have you applied that have been successful?

That’s good question. One important step taken, which is more business approach, was to pick canonical and case-study examples of what we are looking for. So finding a creator who’s doing membership really well and diving deep into understanding what are they doing? How are they setting up their page? How are they delivering value to their members? And using these questions to try to find other creators.

Giving a specific example, we know that podcast creators on Patreon have really good retention because they are releasing serialized content every week. So if you know that there are brand new episodes coming out next week, you’re likely to stick around. We’ve taken this as an example to say, “Ok. How can we encourage other creators to make serialized content? And how can we take this insight from very specific creator and apply it more broadly for all our creators? This is what we call the canonical-creator approach, which has been helpful for us in the lack of data.

For the complete, remaining interview, please watch the YouTube video, where Maura delves deeper into data science projects that have worked at Patreon, and other important tips and resources for current and aspiring Data Scientists.