Setting Up Your Data Science & Machine Learning Capability in Python

Original article was published on Artificial Intelligence on Medium


Setting Up Your Data Science & Machine Learning Capability in Python

Why Python?

Python is the clear winning programming language in data science & machine learning (DSML). With its rich and dynamic open source software ecosystem, Python stands unmatched in how adoptable, reliable, and functional it is. If you disagree with this premise, then please take a quick detour here.

Python has over 8 million users (SlashData)

The Purpose of Your Data Science & Machine Learning Capability

Your goal as a lead of a DSML team is to deliver the best return on investment to the business. The business invests in the DSML capability with a budget for staff and resources, while your job is to deliver the maximum business impact you can.

Your business impact can be measured in many ways. The most high-level objectives are cost optimization, risk optimization, and revenue growth. You may focus on a variety of specific metrics within each objective, such as customer acquisition cost optimization, churn prediction, fraud detection, patient health outcomes, or personalized product recommendations.

Anything diverting goal-setting, budget and execution from this purpose drives down the ROI your team can deliver. Where the attention goes, the energy flows, to quote a self-improvement guru.

Renting vs. Owning

This a re-framing of the classic Buy vs. Build discussion, in context of many DSML platforms offering “pay as you go” pricing now, much like Amazon Web Services. I feel it’s necessary to rephrase the discussion, because unlike “Buying” where you pay a fixed cost whether or not you use it; “Renting” implies that you only pay for when you use it. This is much more convenient for the end-user.

As you begin to set up your DSML platform in Python, you can own the internal architecture or you can rent it from a vendor. I’ll use Saturn Cloud as the primary vendor, because I am expectedly biased.

The Hidden Cost of Owning

Owning a DSML capability carries inherent “scope creep” issues that are not in plain view from outset. It is all too easy to expect owning the capability as simplifying integrating your favorite open source tools together: Jupyter, Snowflake, Dask or PySpark, Prefect or Airflow, Kubernetes, NVIDIA RAPIDS, Bokeh, Plotly, Streamlit, etc.

Here is a short list of “scope creep” dealbreakers we hear from our customers who have previously tried to own a DSML capability:

  • Setting up and managing cloud hosting and support for AWS, Azure, GCP, or on-premise
  • Ensuring enterprise-grade security of code and data; even more burdensome if you are in a highly regulated industry
  • Configuration: executing work on the proper infrastructure which exposes the appropriate resources and libraries for the task at hand
  • Monitoring e.g. ensuring minimal downtime
  • User management: managing employee access to systems and information
  • Access control: controlling what users can do and see within an application
  • Managing existing OSS package versioning and integrating new OSS packages
  • Support for end-users; managing consultations with OSS experts

Each of these bullets has a list of further burdens that may not be attractive. In fact, some of it is so painful that our Saturn Cloud co-founder and CTO, Hugo Shi, wrote an article on Kubernetes just to vent.

The Obvious Cost of Owning

Here are the cost components of ownership that you need to consider as you build your DSML capability.

Example 1: Owning Results in Higher Total Cost

Your team is tasked with developing a customer churn model. If you could predict churn, sales could take proactive measures to retain more accounts. Your company generates $100M in annual sales, and there’s an opportunity to reduce churn from 10% to 5%, or by $5M annually. To keep it simple, we’ll assume you’re a SaaS company with 100% gross margins.

Figure 1: Renting = Automated DevOps

Assumes FTE cost of $150K

Given the cost savings in automating DevOps, the renting scenario generates higher ROI due to less total spend.

Example 2: Owning Carries High Opportunity Cost

Now let’s assume in both scenarios your team is 9 FTEs, but in the renting scenario all 9 are dedicated to Data Science & ML. A team of 9 FTEs can produce 50% more output than a team of 6 FTEs, so with the spare capacity you take on a second project around customer personalization. Let’s assume this project could result in 5% higher software sales in year 1.

Figure 2: Renting = Force Multiplier

Assumes FTE cost of $150K

Notice that in the renting scenario, you’re actually spending more money, but with the same team size you can generate higher ROI. By shifting labor spend to Data Science & ML from DevOps, your team is more efficient and can tackle more positive ROI projects in the same time period. The owning scenario carries an inherent opportunity cost, which is not inherent in the renting scenario.

In both scenarios, the ROI of renting outperforms that of owning a DSML capability. It is also worth noting that cloud computing pricing has dropped significantly over the past decade, whereas labor costs for data science, machine learning and DevOps has increased significantly.

A Cautionary Tale

Not every organization needs to rent DSML architecture. But, it is much easier and less risky to rent first before you own.

“Rent before you own”

I have spoken with hundreds of DSML leaders in the past couple years. A good portion of them lead their teams into owning DSML architecture without renting, and without assessing the obvious and hidden costs of owning. All too often, they turn back half way, realizing renting is cheaper, easier, more flexible, and allows them to stay focused. Furthermore, many developers on the teams expected they would be only part of building the architecture upfront, but later had to serve in full-time support roles, spending much less time on interesting scientific projects they joined the company for!.

It’s Somebody Else’s Problem Now

…is what you’ll be saying when you rent the architecture. Yes, all the integration of open source tools, open source version management, building state-of-the-art security around data and code, building enterprise administration architecture, cloud hosting, support services, open source expert consultations — — say it with me — 👏 somebody 👏 else’s 👏 problem!

Not only is that offloaded, but you get some pretty great benefits from a dedicated team working on it.

  • Greater Performance: Saturn’s tooling offers up to 100x faster runtime than Apache Spark, Pandas and other data processing tools
  • Instant Delivery: You subscribe, you have it immediately in your virtual private cloud
  • Expert Support: Leading committers of Python OSS available to support you.
  • Smooth Experience: Immediate integration and updating of open source tools
  • Native Integrations: Amazon Web Services, Snowflake, and other cloud services
  • Seamless Teamwork Tools: Interactive and Collaborative DSML Capabilities
  • Automation: Data Pipelines and Workflow Orchestration with Prefect
  • Beautiful: Intuitive, State-of-the-art User Interface
  • Flexibility: Pay As You Go and Cancel Whenever

Concluding: Your Pythonic DSML Capability

Ownership Model: Team and budget are divided in using DSML capability to create value and supporting DSML capability.

Rent Model: Entire team and budget are streamlined towards using rented DSML capability to create value.

The purpose of your DSML capability is to maximize its ROI. You want as much of your budget going towards that target: whether the endpoint is faster stock market trading decision-making, recommending new marketing investment, running more drug discovery models, and so on.

My advice is:

  • Choose Python for its unmatched ecosystem
  • Choose to rent before you buy

Good luck and if you are curious about Saturn Cloud, please check us out here.