Original article was published on Artificial Intelligence on Medium
Since I began working in the AI sector, I found out that the 1 Million Dollar problem is finding data. You can be as good as you want, have plenty of brilliant ideas to change the world, but without data, you have nothing in your hands. Because data has become a very precious commodity, you need to know in detail how to search for it.
There are three ways you can mine data from the internet:
- Web Scraping
- Open Source datasets
Googling for information won’t bring you far
One thing you continuously see in companies is people who need to dedicate hours of their time to search and collect in excel database data from websites found on the internet. It is a waste of precious time, both for the employee and the employer.
In the last few years, Google has become one of the most annoying tools where to search for information. The searches prioritize ads, and targeted searches are not reliable anymore. Attempting to collect a volume of information manually (especially if collected from different websites) is now a laughable attempt.
What are your options?
Simply speaking, APIs are a collection of algorithms that allow us to connect to databases to download information.
For example, I want to download a list of tweets that contain the keyword “#sustainability”. I need a Twitter API for that. The same applies if I want to connect to the Stock Market, a website of online shopping, chess, games…
Note that YOU WILL NEED THE WEBSITES TO PUBLISH THEIR OPEN SOURCE API to connect to their database. The website will set a limitation on the amount of information you are allowed to download. Only a few websites offer information without requiring you to pay. However, you still have the chance to download information for FREE, if you are lucky enough.
How to search for API
For example, I want to download chess matches of my favorite chess website: lichess.org. You can google (lol) for the lichess API, and, if you are lucky, you can find the source code published by lichess.org.
In fact, https://lichess.org/api contains the API and the instructions to download chess matches.
Do all websites offer APIs?
Unfortunately not. Consider that Facebook had to limit the download of information, therefore, you are not allowed to download any information from Facebook (even posts, for example). I will talk about alternatives to APIs, but for Facebook, you cannot download any information without their written consent.
If a website offers an API, what limitation I will likely encounter?
If you do not know how to code, this is the first issue. Every website needs a personalized approach, not so easy as it looks.
The common format to minimize the waste of information is JSON, but there are other forms. The data you download needs to be standardized, comprehended, and then stored in the way you want (I can guess a .csv file). It is time-consuming, and the code is not always stable.
Sometimes, you will be lucky enough in finding websites that offer information for FREE. Most of the time, you cannot even download FREE information without a subscription plan: be prepared for a backup plan.
You cannot just download full time, full speed GigaBytes of data from a database. The flow of information will likely slow up the servers, so the websites are very careful, and put a cap on how many requests you will perform. You will need to perform a GET request (the act of downloading information from an online database) every n seconds. Of course, the entire process can be automated.
Most of the websites that offer APIs, unless they are all open-source, do this for profitability (now you understand what it means to be selling data). They will ask you to pay if you wish to download data greater than a certain size.
Another form of metrics to limit downloading is not by size, but by the number of requests. For example, downloading historical stock prices using Alpha Vantage has a limitation of 500 requests per day.
These numbers (like 100k tweets limit per day) do not seem like a big limitation, but if you are running a 500 workers company and you are aiming at building gigantic AI predictive models, 100k tweets is a laughable amount for what you want to build.
2. Web Scraping
Web scraping is becoming my favorite way of downloading data, after all, dealing with API is never fun (try asking around if you do not believe me).
Some websites have a list of information you can see directly on their webpage. One of the examples I want to use is Xtrawine.