There is no AI without a solid data strategy

Original article was published on Artificial Intelligence on Medium

There is no AI without a solid data strategy

Sclable Insights — Interview with Charles Dietz, Lead Data Scientist

Talking to Charles, make sure to come with an open mindset. Each conversation with the physicist and former CERN researcher can quickly turn into an insightful AI conversation. In this interview, learn more about our Data Science Team Lead, why the best minds in data science are in such high demand and why every company needs a proper data strategy. No matter if they will actually engage with AI or not.
This story is about #DataScience, #MachineLearning, #ArtificialIntelligence

Hello Charles. Let’s shed some light on your background. Please introduce yourself.

I’m one of the many fundamental research physicists that left academia to work for the industry. I did my undergraduate and master studies at the EPFL in Lausanne, Switzerland. I completed my master thesis research at the Tsinghua University in Beijing in the field of experimental high energy physics (or particle physics). As I enjoyed my stay in Asia, I decided to do my PhD in the region and that’s how I ended up at the National Taiwan University in Taipei. Yet, as we were members of an international collaboration, I spent about three years based at CERN in Geneva working on one of the biggest particle detectors, the CMS detector, built around the Large Hadron Collider (LHC). That was around the exciting time of the Higgs Boson discovery in 2012.

After my graduation in 2015, I relocated from Taiwan to Europe and settled in Vienna. There I first worked as a bioinformatician at the CeMM Research Center for Molecular Medicine where I supported biologists and medical doctors in the analysis of large genomic datasets. Then, I truly left the academic research to work for an AI startup where I became familiar with deep learning and recommender systems. Last year, I decided to join Sclable as the company was in the process of forming a new AI / Data Science Team to extend its offer to clients in the industry. That finally allowed me to work with the manufacturing, logistics and banking sectors all at once.

Data Scientists rank among the most sought after people in the DACH market (Germany, Austria, Switzerland). Why is the demand for specialists in the field of data science in such high demand?

As you might know, there has been a dramatic shift worldwide in the last few decades towards the democratization of data systems and the usage of the internet. As a result, a lot of data was not only suddenly recorded but was also made more largely available by the growing IT infrastructure. That’s precisely why finance and internet companies were the first to jump on the big data and AI trains, hiring large numbers of computer scientists, mathematicians and physicists to manage and analyse these large amounts of data. By the way, there’s a now completely outdated joke that said: “Do you know what statisticians are called in Silicon Valley? Data Scientists”. Eventually, more and more companies started to realize that they were accumulating data about their own processes or about their customers and that these data were extremely valuable, so they were in need of people to gain insights from these raw data and build new services and new products around them.

The other noticeable phenomenon is that the data storage and processing capabilities kept growing exponentially with prices constantly dropping. The rise of GPU computing, in particular, allowed the explosion of deep learning applications, which until recently was essentially a purely academic research topic. In return, this also stimulated research. That had a huge impact and definitely contributed to feeding the hype around artificial intelligence. So, suddenly, there was a considerable demand for people competent in the fields of data science and applied machine-learning across industries, until then these people were only hiding in academia or in a few very specific industrial sectors.

You are leading the Data Science and AI Team here at Sclable. Tell me a little bit more about your role and who you are working with.

Being the Lead Data Scientist makes me responsible for organizing the work of my team within the company and acting as the interface between them and the management along with the business developers. In that sense, I also contribute to defining the strategy of the company as well as ensuring that our team is properly integrated with other teams and our general processes. Nowadays it is one of the major pitfalls in the industry to have data scientists completely isolated from the rest of the company, it can be a very harmful situation. This leading role also makes me the spokesperson of the team internally: I am the one raising concerns to the management, aligning our agenda and eventually helping solving conflicts.

As for the data science part, I’m pretty much at the same level as my other team members. We are still a rather small team covering a large number of techniques and applying them for various industries, for which a lot of domain knowledge is required. That is, nobody knows it all and we are trying to have a very collegial approach where everyone brings and shares his own expertise. That’s a serious lesson in humility and my experience in fundamental research is being very helpful in that matter. It’s about managing susceptibilities while maximizing the exchange of knowledge within the team. For that purpose, we also have our own study and research agenda with dedicated time to further educate ourselves on state-of-the-art statistics and machine learning methods.

Finally, I represent Sclable, be it for engaging with clients technically or attending public events such as industry fairs, data science conferences or local meetups.

Let’s get back to the basics: there is a lot of misconception and confusion around AI. There has been a big hype around artificial intelligence. What is it about and why is there so much confusion?

There is definitely too much hype around artificial intelligence, mostly fed by the mainstream media and sometimes even by respected people from the field themselves. On one hand, it’s not all a bad thing as it has gotten us a lot of attention and increased general awareness of the topic, whether it is about actual AI capabilities or general ethical concerns about real or hypothetical use of AI. On the other hand, it has contributed to creating some serious misunderstandings in the general public as well as among company executives. We have also witnessed misuses of that term by certain companies in their communication. The overpromising around AI solutions will create a so-called “AI winter”. Some argue it has already started.

So what’s behind AI? AI is an umbrella term that essentially encompasses machine-learning (identifying and learning patterns), rule-based and heuristic approaches. The general misconceptions are the following:

  • There is nothing such as a “General AI” that can compete with human intelligence simultaneously on a large variety of tasks and there might never be any, existing AI systems are good and sometimes better than humans at solving only very specific and well-defined tasks like recognizing a person, classifying a chunk of text or identifying a limited visual, textual or audio context.
  • Despite having been very successful for many applications, deep learning, which is the usage of multi-layered artificial neural networks, does not solve everything, there are many classes of problems in certain domains (like healthcare) for which it is not suitable or not preferred as it can seriously lack robustness and explainability.
  • In practice, there is no AI possible without a well-defined data strategy in the first place, as most machine-learning algorithms consist in identifying patterns in datasets, nothing can be done without (quality) data and a data collection pipeline.

Thinking more philosophically, AI is just a technological means, not a goal. In a sense, it will be shaped after us. But that’s a never-ending fascinating debate which has already been partly covered by science-fiction literature and by philosophers of course.

“There is no AI strategy without a solid data strategy”. This sentence from one of our first conversations still keeps me thinking. What do you mean by that?

In principle, it is very simple: as just mentioned, in its current state, AI is made of a large number of statistical techniques that aim at recognizing patterns in data, whether these data are structured or not. If you have no data or low-quality data, there will be no valuable model and no AI. So, any AI strategy must be preceded by a thorough data strategy.

In practice, it’s not that simple: it can mean a deep paradigm shift for companies that are not used to working with data or do not even have experience with IT infrastructure or software development. It implies defining new positions, new units, new processes, new infrastructure, new products, new business models, new sales strategy, etc. There are too many ways you can fail. To do this requires a carefully coordinated effort whose impulse must come from the top.

As with everything in business, a data strategy must serve some business goals. In my opinion, the most important components of any data strategy are:

  • understanding of the value of data by all stakeholders, the value in both data ownership and the insights and patterns you can extract from them;
  • a carefully defined data collection strategy, which implies to determine which data must be collected and with what quality along with the necessary infrastructure, all based on the selected use-cases;
  • clear KPIs for everyone involved.

Additionally, in the EU, a lot of efforts have to be made to ensure compliance with GDPR which is very dependent on the kind of applications you work on. User privacy, model explainability, trustworthiness and several other ethical questions must be addressed early on as it can affect the whole data collection and processing. Getting it wrong can quickly lead to extremely costly consequences in legal terms but also for the company’s image.

Why go through all this pain? Because a lot of value can be created on the way, not to mention an increase in competitiveness. AI applications are fueling an entirely new wave of automation and optimization which will literally affect all industries and define future winners. This is a very strategic topic and every CEO should be concerned with that. But that journey must start with a clearly defined data strategy.

So, the data strategy sets the foundation for any AI initiative. What are the steps any company needs to take to build their data strategy?

For that, there is already a well-established methodology called CRISP-DM (Cross Industry Standard Process for Data Mining) which is openly available. It consists of a few very well defined steps like Business Understanding, Data Preparation and Modelling, Evaluation, etc. It is important to understand that the process is iterative and that it’s likely going to take a few rounds until satisfactory results are obtained, with some PoC(s) involved. And that implies a straight communication channel between the data science team and the various stakeholders, which means finding a common language.

The particularity of AI projects is that they must start with a clear understanding of what the potential AI use-cases are for a given industry. There are generic ones but some can be very industry-specific. Again, domain knowledge is crucial. On top of this, the outcome of the process is a lot less certain than for usual IT or software projects. In particular one can never guarantee a certain accuracy for a model until the full dataset with satisfying quality is made available and the model is trained. Stakeholders must be willing to deal with this higher uncertainty and admit that even if the PoC fails, one can learn a lot on the way. But by breaking down the problem into testable hypotheses, it is possible to reduce the risks a lot. I actually like this term of hypothesis testing because it resonates both with statistics and business.

In your personal opinion: what are the areas where companies could benefit the most from AI or specifically machine learning? I know there are myriads of potential applications but let’s focus on some examples.

I’d say that the limits are our own imagination and the budget companies are willing to put into such projects. AI will not solve everything but there is still plenty to do. As I’ve already mentioned, AI supports the new wave of automation and optimization across industries which is great news: it means we’ll be able to create more value with less energy, less waste and make certain tasks a lot less harmful for humans.

Let’s focus on two existing generic use cases that are crucial for the manufacturing industries and to some extent to logistics and infrastructure:

  • Visual quality inspection: computer vision systems can perform quality checks at all steps of a manufacturing production line at a very high rate and with a 24/7 rollout.
  • Predictive maintenance: time-series forecasting allows you to predict the lifetime of a machine or identify anomalous patterns during operation in order to preemptively intervene.

These two use cases alone can apply to a very large number of configurations.

Generally, there is a good rule of thumb that was expressed by one of the famous applied AI researchers Andrew Ng which says: “If a repetitive task can be performed by a human within one second, then it is probable that an AI system could do it”. So, ask yourself how many of these tasks still exist in your daily life or your processes.

What are you currently working on with your team? Is there a specific challenge you can talk about?

We have quickly realized that there is a lot of demand from our customers for comprehensive consultancy and education around the topic of AI applications for the industry. So we are putting a lot of effort into developing quality materials and workshops to help them understand what AI is all about and which AI application use-cases are relevant for them. It’s a lot of education but also some advanced business strategy.

Another element of our work is to maintain a high technical and conceptual level within our team. That implies keeping studying reference books and keeping track of state-of-the-art research. Practically it’s a lot of group reading and discussions on all the fundamental topics involved, from mathematics to computing paradigms. We also like to measure our skills against other teams during online machine-learning competitions. The level gets higher and higher every year, so it’s a great motivation to push our limits and learn more.

One last question I always ask as we foster the lifelong learning habit here at Sclable: What are you currently learning? What will be the next things you peek into?

I like exploring more exotic topics far from mainstream data science. Recently I have written a blog article on how Ancient Babylonians were already following best practices of data science, for me, it’s a way to tackle the hype and replace ourselves in our historical context. Besides that, as a physicist, I look back into statistical physics from which lots of concepts of machine learning and information theory have emerged. I’m also in love with complex numbers, so I’m exploring the academic literature about building neural networks with complex or quaternionic weights. Finally, I’m loosely following the development of quantum computing as there are good chances that we will encounter it in our careers and there seems to be a great potential for the development of quantum machine learning algorithms. But there is never enough time to study all of this thoroughly.