Easy Access to the World’s Largest Data Source

Original article was published by Soner Yıldırım on Artificial Intelligence on Medium


Easy Access to the World’s Largest Data Source

Wikipedia API for Python

Photo by Clay Banks on Unsplash

The importance of data comes way before then building state-of-the-art algorithms in data science. Without the proper and vast amount of data, we cannot train the models well enough to get satisfying results.

Wikipedia, being the largest encyclopedia of the world, can serve as a great data source for many projects. There are many web scraping tools and frameworks that allow getting data from Wikipedia. However, the Wikipedia API for Python might be the simplest one to use.

In this post, we will see how to use the Wikipedia API to:

  • Access the content of a particular page
  • Search for pages

You can easily install and import it. I will be using Google Colab so here is how it is done in Colab:

pip install wikipedia
import wikipedia

The content of a page can be extracted with the page method. The title of the page is passed as an argument. The following code will return the Support Vector Machine page as a WikipediaPage object.

page_svm = wikipedia.page("Support vector machine")type(page_svm)
wikipedia.wikipedia.WikipediaPage

This object holds the URL of the page which can be accessed with the url method.

page_svm.urlhttps://en.wikipedia.org/wiki/Support_vector_machine

We can access the content of the page with the content method.

svm_content = page_svm.contenttype(svm_content)
str

The content is returned as a string. Here are the first 1000 characters of the svm_content string.

The returned content is a string which is not the optimal format for analysis. We can process this raw string to infer meaningful results. There are efficient natural language processing (NLP) libraries to work with textual data such as NLTK, BERT, and so on.

We will not go in detail into NLP tasks but let’s do a simple operation. We can convert the content string into a python list that contains the words as separate elements. We can then count the number of occurrences of a specific word.

content_lst = svm_content.split(" ")len(content_lst)
57779
content_lst.count("supervised")
4