Original article was published by Catherine Breslin on Artificial Intelligence on Medium
Where to find Data for Machine Learning?
High quality data is key for building useful machine learning models
Machine learning models learn their behaviour from data. So, finding the right data is a big part of the work to build machine learning into your products.
Exactly how much data you need depends on what you’re doing and your starting point. There are techniques like transfer learning to reduce the amount of data you need. Or, for some tasks, pre-trained models are available. Still, if you want to build something more than a proof-of-concept, you’ll eventually need data of your own to do so.
That data has to be representative of the machine learning task, and its collection is one of the places where bias creeps in. Building a dataset that’s balanced on multiple dimensions requires care and attention. Data for training a speech recognition system has to represent aspects like different noisy environments, multiple speakers, accents, microphones, topics of conversation, styles of conversation, and more. Some of these aspects, like background noise, affect most users equally. But some aspects, like accent, have an outsized impact on particular groups of users. Sometimes, though, bias is built deeper into the data than in the composition of the dataset. Text scraped from the web, for example, results in a dataset that embeds many of society’s stereotypes because those are present in text from the web and can’t be scrubbed.
For building successful machine learning models, sourcing data is a critical part of designing and building the overall system. As well as finding data that’s effective for the task, you have to weigh up cost, time to market and data handling processes that have to be put into place. Each source of data has its own pros and cons, and ultimately you might use some combination of data from the sources below.
The first and easiest place to look is at publicly available datasets. There are many different datasets out there, created for many different tasks, indexed and described in places like Kaggle, UCI Dataset Repository, & Google Dataset search. For speech technology, OpenSLR and LDC have lists of available data. Often, publicly available datasets have a non-commercial license or you need to buy the dataset before you can use it commercially.
ImageNet, is one well-known computer vision dataset (with a non-commercial license). It’s a large set of millions of images, tagged with the objects that the images depict. This set is famous for being the turning point for deep learning in computer vision. Until ImageNet was available, datasets had been much smaller. The use of a large shared dataset across researchers meant that the effectiveness of deep learning could be clearly shown.
There are multiple ways to create synthetic data, depending on the task you’re trying to solve. To classify people’s spoken requests to a smart speaker, we could create a short grammar which lets us suggest how people might ask for the weather:
(can you)? (tell|give) me the (weather in|forecast for) CITY (please)?
Here the ‘?’ denotes optional words, and the ‘|’ denotes a choice. With a list of cities to fill the CITY marker, we can we can quickly and easily generate many different examples of how users might ask for the weather, such as:
tell me the weather in London please
can you give me the forecast for Cambridge
give me the weather in Bristol please
can you tell me the weather in Southampton please
Once you have a model for how to generate synthetic data, it’s cheap & easy to produce large quantities of data for which you have the ground truth. And that’s beneficial for machine learning models which need large quantities of data.
Synthetic data has its downsides. It’s not normally as realistic as the real-world, and it might also not match the behaviour that real users exhibit. Using the grammar above we would add the optional ‘can you’ to half of our synthetically generated examples, but perhaps in real-life none of our users actually say ‘can you’.
Data augmentation is a way of increasing the amount of data you have by copying and transforming it in different ways.
One successful way to build a speech recognition system that’s robust to noisy environments is to collect a dataset of clean audio, then augment it by adding different kinds of noise to give a dataset of noisy audio. There’s no need to re-transcribe the noisy versions of the audio, because we know the transcription is the same as the clean version. It’s possible to add multiple types of noise, like babble noise or car noise, to create multiple different copies. Other data augmentation techniques for audio include VTLP, which mimics making the vocal tract longer and shorter, giving lower and higher frequencies in the audio, and altering the playback speed.
Data augmentation is also widely used in image processing. There, the types of transformations are different — geometric transforms, colour shifts, additive noise, mixing multiple images etc. The effect is the same though — to create additional training data that is different from the original, but doesn’t need re-labelling.
Another option is to run a targeted data collection effort, and hire people to create the data you need. Perhaps you have crowdworkers talk to your dialogue system, or you engage expert translators to translate text for the basis of a machine learning system. Data collection efforts can be expensive, but (provided the participants are motivated to participate properly and not cheat) can result in realistic and reasonably good quality data.
Care has to be taken in the design of your data collection effort to ensure you have a wide range of participants and that you don’t influence the data you get by the instructions you give. Furthermore, people who are paid to interact with your system aren’t going to act the same as your final customers. A crowdworker talking to your ticket booking system and being paid to do so isn’t going to necessarily speak to the system in the same way as a customer who really does want to book a ticket.
Data you already have
Your organisation might already own suitable data for using to train a machine learning model. Perhaps you have large collections of internal text documents that could be used directly for language modelling. Or maybe investing some effort in labelling those documents mean they can then be used as a dataset for NLP tasks like Named Entity Recognition (NER), which extracts entities from text (such as names, locations, places etc.). Data that you already own has the advantage that labelling existing data is likely cheaper than gathering new data.
Perhaps the best matched source of data to your ML model, once it’s deployed, is the data you get from customers interacting with it. This data is the most realistic, and doesn’t have the artifacts that come from synthetic sources. If your product is widely used, you can see a broader range of users than with targeted data collections.
However, the world changes fast. By the time you are able to use production data, it might already a bit out of date. This is especially the case in fast moving fields, like subtitling of live news broadcasts. New topics and events come and go quicker than you can incorporate data about them into your models. If you learn about the language of last week’s news, you might not necessarily be good at modelling the language of next week’s.
Overfitting to the past is one risk of over-relying on production data. Another risk is overfitting to your current set of users and features. If you’ve just launched your product and your users are early adopters and familiar with the technology, then how they use it might differ from the ways that later adopters will. Biasing your models too much towards understanding how early adopters interact may actually degrade the performance for the later adopters.
Production data is also the data source which is more likely to come from customers, and thus be subject to data handling laws such as Europe’s GDPR.
While collecting and labelling production data may be expensive, it typically pays dividends in improving the system performance. For this reason, it’s often the primary source of data for a production machine learning model, though perhaps used in combination with some of the other sources described above to provide additional robustness.