At Work-Bench, we host quarterly peer roundtables connecting corporate executives solving some of the largest technology pain-points facing their organization. At our Machine Intelligence Roundtable a few weeks back, it became clear that ML and AI were the most significant initiatives this year and most of the attendees were actively building supporting infrastructure. Priorities included navigating build vs. buy scenarios, hiring data science talent, avoiding vendor lock-in and cloud strategy. So far, investments made in ML and AI are typically aiding in BI reports but some folks in our network are exploring infrastructure to drive a flywheel of value to customers using data they have collected. The teams we’ve backed at UpLevel Security, Merlon Intelligence, Socure are disrupting entire industries because they’ve built this infrastructure as well as a moat around high value data. Following the roundtable, I’ve been focused on uncovering ways in which enterprises can also leverage data they already have to create systems of intelligence that improve their business.
A Brief History
A lot has changed over the last decade in the AI Infrastructure landscape. Since Hadoop’s introduction in 2006, advancements in cloud computing, containers, GPUs and data science have revolutionized the way insights can be harvested from data. New and emerging technologies suggest that innovation is happening, particularly in areas tackling engineering problems like data access management, pipelining, model development and inference. However you slice it, the AI infrastructure market is huge: IDC forecasts that spending on AI and ML will grow from $12B in 2017 to $57.6B by 2021.
Using Crunchbase, I took a look at over 500 investments made in the last 3 years and included those that exhibit the three qualities I look for in potential portfolio companies: a stellar team, superior product and promising market traction. Our AI Infrastructure landscape features the most promising startups in the space and is shaped by the emerging strategies for bringing data driven insights through ML and AI into businesses. I’ve broken out each segment and described a forward looking trend for each, while highlighting a few notable companies in each respective area. If you have any feedback, please share it on Twitter.
Data Access Management
After data has been stored, there are business challenges in providing access for end users. Aside from ownership and political issues between groups in an organization, legal/privacy concerns exist around the storage, analysis, and usage of customer data (with GDPR enforcement just around the corner).
Trend: Make data discoverable, versioned, and secure with policies that are dynamically applied to the data through a data catalog, standard package, or data marketplace. Lineage and data governance are longstanding problems within this category that are actively being solved.
- Tamr: Tamr’s patented software fuses the power of machine learning with your knowledge of your data to automate the rapid unification of data silos at scale.
- Dremio: Dremio makes all your data self service. Run SQL from any tool against data from any source
- Immuta & Cerebro: Allow teams to securely access and work with high-value data, without having to worry about data access and usage policies.
Intelligent ETL / Data Pipelines
Dirty data is still the number one challenge for data scientists and traditional GUI based ETL tools lack flexibility, ease of use and robustness for unstructured data workloads. Code is the ultimate abstraction for working with data. While data scientists spend 80% of their time cleaning data, time is also spent asking for data to be made available, explained and moved and there is more work to be done to enable data consumption.
Trend: Analysts and developers need tooling that interplay well and core to this are building and managing data pipelines. Alteryx managed to consumerize ETL for Tableau and today, ambitious startups are using AI to automate ETL and data preparation tasks.
- Datalogue: Datalogue’s smart data pipelines and classifiers empower enterprise teams to move, prepare, and transform data easily for analytics and data science, shifting data preparation from a per source paradigm to an ontology-based workflow.
- Astronomer: Built by developers, for developers, Astronomer offers an “Airflow as a service” product for engineering complex ETL workflows.
- Nexla: Helps create inter-company data feeds to receive data or send data to partners with security features like encryption and permissions.
Feature Engineering increases the predictive power of learning algorithms by creating features from raw data that facilitate the learning process. This requires a combination of business expertise and data science know-how. While some data science platforms have rudimentary feature engineering capabilities, it’s a conscientious topic for elite data scientists who believe it’s their special sauce.
Trend: While Intelligent ETL vendors may build into this area, feature engineering could be a bigger hit with business analysts to collaborate with data scientists.
- Feature Labs: Feature Labs builds tools and API’s for automated feature engineering.
- ScaleAPI: Scale allows developers to access humans on-demand. Their growing suite of APIs currently handles a wide array of use cases such as audio transcription, image recognition, categorization.
- MightyAI: Originally launched as Spare5, MightyAI builds a training data as a service platform, mostly for images and autonomous vehicles today.
Modeling & Training
Using a combination of tools like Jupyter Labs, it seems that clear winners have emerged and that open source frameworks (Tensorflow, MxNet, PyTorch, CNTK and Keras) are the consensus winners that dominate this category. While there are end to end data science platforms that add UI and collaboration features, the entire community has coalesced around these open source projects.
Trend: For data science platforms to be successful, incorporating business users into the mix here are key feature most enterprises are looking for.
- Tensorflow: TensorFlow is a tool for machine learning. While it contains a wide range of functionality, TensorFlow is mainly designed for building deep neural network models.
- Dataiku: Builds a visual and interactive workspace that’s accessible to both Data Scientists and Business Analysts.
- SigOpt: SigOpt’s API tunes your model’s parameters through state-of-the-art Bayesian optimization.
Deployment / DevOps
As data science models get manually deployed into production; efficiency, scaling, monitoring, and auditing become cumbersome and expensive. Automating model deployment and hardware management becomes important because of the need for specialized hardware (TPU, GPUs), the spiky compute demands for running models, and model governance
Trend: Managed solutions automate away the infrastructure engineering challenges required to train, deploy, and run models at scale.
- Algorithmia: Automates DevOps for AI, allows for customer auth & permissioning, model inventory & discovery, and gives enterprises the ability to run on any cloud. Their current cloud solution supports 68k developers and over 5k models in production.
- Paperspace: A PaaS that makes it easy to train and deploy deep learning models using GPUs managed by Paperspace.
- PipelineAI: Gives data scientists and engineers the freedom to quickly deploy, test, and rollback their models directly in production.
Experiments lead to a data driven future
The engineering challenges in ETL and Deployment are actively being solved for in the AI Infrastructure Landscape and there will be winners in both of these categories over the next several years. I think there’s an opportunity for experimentation platforms that bring in the business stakeholders, analysts with engineers and data scientists to evaluate the effectiveness of models as they are deployed as product features in production. This is more workflow than it is technology, and I am excited for companies that will build products in this area for the enterprise. If you’re working on something like this or have thoughts on the landscape, please reach out!
Note: Algorithmia, Datalogue and Tamr are Work-Bench portfolio companies.
Special thanks to our Machine Intelligence Roundtable for inspiring this post, as well as Drew Conway, Shinji Kim, Jared Lander, Lauren Ottaviano, David Boast, Kapil Chhibber, and the entire Work-Bench team.
Source: Deep Learning on Medium