Why machine learning algorithms are not like Lego

Source: Deep Learning on Medium


This is a story of a rush on data science (DS) and machine learning (ML) by businesses that believe they can quickly (and cheaply) capitalize on this apparent panacea. You know, when a company decides to “really get in on this ML stuff” by slapping together some free, existing solutions (maybe even some code on GitHub) because “someone must have had the same problem before.” It’s a story that keeps recurring, kinda like bankruptcy filings.

A cautionary tale of cutting corners in machine learning

So, listen up, because it comes with a moral that may just help you steer clear of becoming the hero of our cautionary tale. I will show you that DS/ML solutions cannot be directly transferred between seemingly similar cases and it’s actually cost-effective to go with a tailored solution. There’s a plethora of reasons, but our story will focus on three of them.

  1. Seemingly similar problems may not work with the same ML solution
  2. Every case has a unique dataset and you better respect that
  3. Data preprocessing is a must — don’t just throw the whole database at the algorithm

Onto our story. This really could be about most anyone doing business online, but for the sake of our tale, you have an internet shop. You’re doing ok, but ok is not what you were aiming for. So, you keep your eyes open for anything that can make your business really take off. You’ve been reading and hearing a lot about how artificial intelligence and machine learning can really make a difference, and you wouldn’t mind a bit of that for yourself.

The Netflix recommender model isn’t the answer for product recommendations in your store

So, one day, when rifling through Netflix you have a moment of clarity. Of course, Netflix had this huge contest way back when (Netflix Prize, 2009) to come up with its now famous recommender. And when you go to your competitors’ online shops, you see product recommendations. Hmm… What if you could use that amazing recommender to accurately predict what your clients need? That would surely put you well on your way to Amazon glory.

Not exactly. The Netflix system is based on collaborative filtering. In simple terms, it predicts a score of an item based on the similarities between users and items. Users are similar if they rate the same movies similarly. So, basically, if you loved Pulp Fiction and Fight Club but hated Titanic and anything Bridget Jones, the recommender will find others with the same great taste and show you other films or shows they scored highly. Why is that of no use to you?

Collaborative filtering (score prediction) vs basket analysis (products often bought together)

You have product ratings in your shop, but many customers fail to leave one, which means that some products aren’t rated. The result — many users whom you have nothing to show and many products that will never show up. OOPS! But honestly, you had a snowman’s chance in a forest fire. Why? You went at it the wrong way when the solution was in sight. But that’s because you didn’t think things through and didn’t analyze the data to come up with the right solution.

In this situation, the most complete and useful data you have is your customers’ purchase history, and that means that you can use market basket analysis (a type of affinity analysis). When a customer adds an item to the basket, it is likely that they will be interested in another item if those items are often bought together.

Check out our case which highlights this kind of a solution.

A client asked us to develop a proof-of-concept recommender (MVP) for their online store that connects healthy food suppliers with consumers. The shop doesn’t have its own assortment. The assortment depends on the user’s location, where suppliers and consumers are assigned to groups based on location (one supplier can belong to many groups, but one consumer belongs to only one group). The only data available in the context of recommendations is the shopping history for each user. There are no ratings, and product categories are very general.

We used market basket analysis to develop an MVP system of recommendations that was able to suggest popular products to a new user, and after adding a single product to the basket, it recommended other products bought along with it by other shoppers. We aimed to further expand the system with product similarities, basket item coincidence and personal shopping tendencies. However, after discussing the best solutions going forward with the client, we decided to go with a platform for marketing automation that also had the capabilities of recommendation engines.

Each dataset is unique and needs an individual approach

Now, you’ve learned your lesson but still haven’t lost that fire that keeps you up at night thinking of how to promote your business. Which brings us to Facebook, because we all know that Facebook advertising is where your buck delivers a real bang. That’s why you’re here, with the rest of the world’s sellers. Until now, you’ve done Facebook the ordinary way — targeted everyone ages 18–65, regardless of gender, and sometimes used re-targeting — but the performance has been poor. You have heard there are better ways to target your clients using machine learning.

But, you’re no fool — at least not twice! So this time, you make a real effort to do proper research. And it must be your lucky day because you’ve just found an open source solution that meets all your criteria. It seems that it is already used by some of your competitors, so it must be perfect for you. Of course, the secret sauce in such a solution is the machine learning model which naturally is not provided because it’s… secret. Also, a model trained for a different shop, products and customer base would not reflect your exact situation anyway. Well, whether you knew that or not, it can’t be that big of a deal, right?!

To really deliver on their potential, datasets need data experts

All you’ve got to do is train your own model with your own data. Simple — just match the data you have to the attributes the original code requires. That’s where things start getting complicated. You find out that the model needs the Facebook reach prediction for each of your past ads. But who’d ever think this would come in handy and save stuff like that??? You have ideas about how to estimate this value, but then, it turns out there are quite a few attributes missing in your data and this becomes too daunting of a task. Also, it bugs you that some attributes, that you know from experience are important, are not included in the model. It may not have been such a perfect fit after all.

Where did you go wrong this time? Each business’s dataset, even for a seemingly identical problem, will be different. Your data is unique and may have something valuable no other business has. Analyzing it along with your exact needs is a must. Machine learning problems are kinda like snowflakes — there may look and feel very much alike, but it is highly unlikely that they are identical.

Check out these next two cases from our backyard which seem to be the same but really differ in the approach and the data required.

In the first case, we were working on a solution that will help Shopify sellers run more successful Facebook ad campaigns. We had historical data on past Facebook ads along with the sales information from Shopify. Based on this, we developed an AI powered engine, which first learned what worked based on the available data we fed it, and then was able to suggest campaign characteristics and targeting that will deliver best Relevance Score, CTR (click through rate), CPM (cost per click), or ROAS (return on ad spend). This allowed the sellers to have greater insights into what creative designs will deliver the best conversions.

The second case also involves Facebook ads, and here, we aimed to improve the ad content displayed to targeted audiences to better reflect their needs. In this project, we had the results of a sociological study connecting personal characteristics and aesthetic preferences with “Likes” on Facebook. The solution to this challenge was a tool that used keywords to identify profiles of Facebook users with their characteristics and preferences. Based on this information, the ad content could be better adjusted to a given target group, resulting in a two times greater conversion rate of Facebook ads.

You don’t stick a live chicken in a pot, so why wouldn’t you have to preprocess your data

Fast forward a little bit. The company is slowly growing and now, invoices are giving you a hard time. Most of them are not directly computer readable as they are scans and there is some metadata, but not complete enough to be sufficient. Processing them takes a lot of time and effort and you know that there must be a better way! You want to digitize all of them and then group them into categories and have read about OCR (optical character recognition) software that could “read” a document scan and spew out a blob of text. You’ve heard some even handle tables and captions.

You may be tempted to throw data you have at an OCR software and see what comes out. This may even work, except not very well. Data — be it images, text, labels or numbers must be prepared for use in machine learning. Images must be unified — rotated, cropped and the colors equalized (or preferably reduced to grayscale). With documents, a lot of information may be extracted from the position, size and neighborhood. Text, depending on its further use, may be cleaned from unusual symbols and stopwords, stemmed or lemmatized, and often turned into vectors. Labels or categories must be coded as they are not comparable (is red > green?). Numbers have to be normalized, cleaned from outliers and sometimes transformed. Seems like a whole bunch of stuff you had no clue about, and I hope you’re onto where I’m going with this

You need AI pros to preprocess your data, because all data was not created equal

What I’ve just described is called preprocessing and it takes a lot of time in machine learning projects. What’s a lot? To put it in perspective, when this year, Vicki Boykis asked data scientists about what they spend more than 60 percent of their time on, 67 percent answered with “cleaning data/moving data.” Without preprocessing, it would be hard for an algorithm to understand data. Preprocessing is often the foundation of great results as the algorithm may use its complexity and resources to produce results instead of just understanding data. But preprocessing requires an ML engineer to get to know the data and make the right choices on how to process it based on preliminary tests, when those are possible. The whole procedure is later automated to prepare new data while the algorithm is in use. It is not transferable between datasets, as even large changes in one dataset may deprecate the preprocessing procedure. For example, if your business grew and previously, invoices with amounts over $ 10,000 were unheard of, but now, are a regular occurrence.

Here’s a quite similar case that shows how we dealt with the problem of efficiently handling documents for one of our clients.

In this case, the client, a tax platform, contacted us to remedy the problem of rewriting the data from paper tax form to the system since it is time-consuming and prone to human error. There were several challenges. For example, the tax forms had different formats, or sometimes we had scans and sometimes, poor quality photos. Of course, high accuracy was very important — after all, this was financial data.

Our solution featured algorithms combining OCR technology with document segmentation methods. The result: in the first 6 months, we processed over 10k documents while reducing the processing time by 80% and maintaining over 99% accuracy.

Machine learning doesn’t work on a wing and a prayer, it takes a tailored solution

So, now you have a decision to make. You can continue going with the wing-and-a-prayer variety like you have so far, but if my story served its purpose (and you’re not into self-humiliation), it should be painfully clear that you need to get some experts involved. Of course, once in a while, someone gets lucky and finds that free magic bullet that changes their business. But if you’re not banking your success on the luck business and are into repeatable results, you must understand that up-to-date knowledge, experience and tailored solutions are the only way.

As long as general artificial intelligence is not available, there is no one-size-fits-all solution in machine learning. For each problem, there are many solutions, and each dataset may demand a completely different approach. Even the best algorithm will not produce the desired results without proper data preparation. These are lessons that you can learn one of two ways, and I’ll tell you the hard is not just hard, it’s also really expensive.


Originally published at dlabs.pl.