Building The AI Stack

Source: Deep Learning on Medium

As the use of machine learning — and specifically compute-intensive deep learning technology — is booming across research and the industry, the market for building the machine learning stack has exploded. This is perfectly illustrated by technology giants such as Google, Amazon and Microsoft releasing cloud products targeted at making ML (Machine Learning) technology easier to develop, and eventually driving cloud sales. This trend has also been highlighted by a large number of startups building infrastructure and tools, from data preparation (image labelling), to training (model optimization), to deployment.

5 painpoints emerge across the pitches of startups I can see at Elaia and big cloud providers:

  • Inefficient ML Teams Machine learning scientists and engineers are spending too much time on plumbing and low-value tasks, such as setting up their infrastructure and tooling, hacking around data pipelines, and building basic automation, delaying R&D projects.
  • Low Hardware Utilization Compute infrastructure is underutilized, leading to suboptimal costs and/or ML teams to battle over scare resources
  • Technology Uncertainty As new hardware products (Google TPUs, latest NVIDIA chips) and software products (Google AutoML) roll out, it becomes hard for decision makers to make long term decisions about ML infrastructure.
  • Heterogeneity of skills and stacks Traditional data scientists and IT managers are slowly migrating to new technologies (advanced ML and deep learning), and need to work alongside the new generation of deep learning native scientists.
  • Regulation and Privacy Data is either critical and needs constraining security, and emerging regulation creates additional challenges (HIPAA, GDPR, etc)

The market is still immature, although the technology fundamentals are there to build a comprehensive offering that will ultimately unlock rapid ML development and wide real-world impact

As in any post of this type, it is hard to use the right vocabulary. AI, ML, Deep Learning have all been misused. In some regards, the AI stack has already existed forever since the first computer was built. Here I will focus on the products that enable developing the kind of current generation, typically compute and data intensive machine learning model. I call that, abusively, the AI stack.

Arguing that Google’s strategy and products will deeply influence the market, and drawing inspiration from what happened with a previous generation of technology, namely the Map Reduce paradigm and the Hadoop ecosystem, and , I will propose two scenarios on what the stack may look like in the future.

1) Google’s Strategy is Driving the AI Infra & Tools Market

Google’s strategy

Google’s Business Model is overreliant on advertising revenue, a fact that has been pointed out many times by investors. As new platforms emerge, and such interfaces as voice (eg. virtual assistances) are widely adopted, search in the format we know now will slowly decrease in volume. To compensate, Google has launched a number of other products, historically with limited impact on the bottom line, until quite recently.

It seems that two strategic opportunities have emerged in the AI-space:

Build revenue through AI-added services The rise of AI has created a massive market opportunity for Google, that has significant expertise in the field, the culture, and the internal infrastructure to rapidly develop and launch AI-driven products.

Conquer the Cloud Market Although AWS is still the market leader of cloud infrastructure services with north of 30% market share, the cloudification of the world is not over, leaving room for a new generation of cloud products. In the rest of this article, we will focus on the strategy Google has deployed to tap into this opportunity.

Cloud Prodiver Growth vs Market Share

Google has built products across the AI stack

Although it is easy to argue that Google’s is impacting any market, it is specifically the case in the AI infrastructure and tools market. Indeed, Google has released products across the various layers of the stack. The machine learning framework TensorFlow is by far the most popular. Google Cloud, historically dwarfed by AWS in terms of revenue, is the favourite cloud of machine learning scientists.

Google’s Products Cover the Stack. Some are open source software (Kubernetes, TensorFlow), some are Hosted services (ML Engine) and some hardware (TPU)

To win the cloud war, Google has played with two levers:

  • Product Differentiation: value-added AI infrastructure products and services may both win over AI developers with a burning need to get a solid stack (bottom up adoption), and leadership looking for support in building up their AI strategy (top down adoption) — where Google’s brand is a significant asset. Free, open source frameworks such as TensorFlow have spread fast, easing up the marketing of Google ML Engine, a highly-priced Machine Learning Cloud Platform. Google also released Kubeflow, an open-source machine learning platform that runs on top of Kubernetes that might provide an alternative to Google ML Engine for on-premise users.
  • Lower Switching Coséts: better products may not be enough to convince customers, in particular in the enterprise, production world where migrating workloads is costly, and can require significant software redesign. The containerization of applications has created an opportunity to lower switching costs. Google won that war with Kubernetes, imposing its container workload orchestration system. Doing so, it has also significantly reduced the switching cost by making workloads more portable, both easing up the switch between cloud services, and from on-premise to cloud offerings.

This matters to those seeking to understand the future of the AI stack: Google is building products for various layers of the stack. TensorFlow has imposed itself as the leading (but not only) machine learning framework. Google ML Engine is acclaimed for its user-friendliness, although its high pricepoint may come as a blocker for heavy-users. Kubeflow, the open source platform, is slowly gaining traction although it is not mature.

And at the very bottom of the stack, Kubernetes has gained a definite edge in the container workload orchestration world that extends way outside AI. Kubernetes adoption for production workloads creates a suitable ground for building an entire Kubernetes-native AI ecosystem.

Google Open Source Products Adoption


One of the consequences of Google’s strategy is that each layer of the AI stack is served, with varied but growing success, with a Google maintained open source product (although Kubernetes is now maintained by the Cloud Native Foundation).

The question I want to answer is whether and where there will be room for proprietary machine learning offerings.

To understand the dynamics, I will use the analogy of the Hadoop ecosystem (Part 2) and try to infer two scenarios.

2) An (hopefully legitimate) Analogy with the Hadoop Ecosystem

The Hadoop stack can be broken down into infrastructure, platform, and tools quite similarly to the AI stack.

There are at least a few limitations to that analogy:

  • Hadoop was purpose built for the map reduce paradigm, and processing very large scale datasets. Kubernetes, on the other hand was developed for general purpose workload orchestration. Beyond the technical implications, it means Kubernetes adoption will happen in a much larger market. As a side effect, a company may adopt Kubernetes for orchestrating the containers of its main website — and then favor ML products that can make use of their existing Kubernetes installations.
  • Deep Learning Models are fundamentally more complex to develop and train than ETL operations (the original use case for the Hadoop ecosystem). They oftentimes make use of GPU compute, which creates another technological difference, and calls for a more complex stack.
  • Hadoop adoption started when the market existed but was quite small — it started with a few customers, including Yahoo, maintaining large Hadoop Custers and gradually buying services to help manage them. See here for a history of Hadoop. As for the AI stack, the need for a machine learning platform existed, at a large scale, prior to any solution being available. The current offering is still relatively immature, but the market has exploded.

What can we learn from this analogy?

  • The Hadoop ecosystem has centered around a deep stack of open source products and tools.
  • The business model of the Hadoop stack is mostly driven by services and enterprise support and tools, à la Red Hat.

In the long run, will this happen for AI too? Will the market widely adopt Kubernetes+Kubeflow as the infrastructure and machine learning platform, in which case other players will switch to building enterprise versions, tools and services for Kubeflow? Or will there be a market for Enterprise platform?

3) Two scenarii

Before diving into projections, let me come back to the market. It can roughly be split into 3 personas:

  • Small Players and Individuals developers would favor simplicity, but have simplified IT requirements. They are typical customers for existing cloud offerings such as Google ML Engine or AWS Sagemaker.
  • Enterprise Cloud users would are similarly good targets for existing offerings, although specific needs of advanced players would also call for customisations not addressable by cloud providers.
  • Enterprise On-premise targets are currently massively underserved in terms of platform and tooling. The main uncertainty lies there and justifies the two scenarii below.
The AI Stack Broken down by segment — Major uncertainty lays in the on-premise segment

Scenario 1: Competing Open Source and Enterprise products co-exist

As the need for enterprise products already exists, and the open source offering is far from mature, a few companies have started building up closed source offerings.

This is the case of companies such as Clusterone and RiseML, and several others.

If Open Source offerings do not mature fast, the demand for on-premise enterprise grade offering will drive their adoption. Open Source may catch up later, in a second wave of product.

Business Model: Proprietary Enterprise software

Remaining Opportunities: The stack will, as a consequence, remain quite fragmented. This may make the building up of dominant offerings of tools on top harder. Those market conditions would probably also favor end to end, verticalised machine learning platforms.

Scenario 2: Open Source Dominates and an Enterprise Ecosystem Emerges

Pre-existing Kubernetes adoption, and faster maturation of Kubeflow may trigger wide adoption, and a virtuous cycle of contributions to the open source codebase.

This would open up opportunities for Kubeflow-based Enterprise-grade offerings, services and support models, quite similar to the Hadoop ecosystem offerings of Cloudera and Hortonworks (now Cloudera-Hortonworks) and MapR.

Business Model: Services, Support, and Enterprise-Kubeflow

Remaining Opportunities: This standardisation of the stack around Kubeflow would privilege companies building tools on top of the stack, and an ecosystem model of tools and applications around Kubeflow.


Predicting how the stack will evolve is a tricky exercise. Regardless, keeping both scenarii in mind can help decision makers identify opportunities and hedge against changing market conditions.

Although some standardisation of the stack will be beneficial to those building applications and tools, a mature and standard offering will take time to emerge. There might be a third way whereby open source and closed source players collaborate on defining standards, maximizing both their chances of success and end-user value.

Thanks to Louisa Mesnard for her kind help in writing this article.

Disclaimer: Views are my own only. I was a co-founder at Clusterone and hold stock of the company.