Data collection: scraping Glassdoor with Selenium

Original article was published on Artificial Intelligence on Medium

Web scraping

Web scraping in Wikipedia is data scraping used for extracting data from websites. The web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Web scraping pipeline

In more familiar words Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

Methods of Web Scraping

Using software

A web scraping software will automatically load and extract data from multiple pages of websites based on your requirement. It is either custom-built for a specific website or is one that can be configured to work with any website. With the click of a button, you can easily save the data available on the website to a file on your computer.

Writing code

This is what we are about to do right now, it is inspired by the medium post right here.

Glassdoor

Glassdoor is a website where current and former employees anonymously review companies. Glassdoor also allows users to anonymously submit and view salaries as well as search and apply for jobs on its platform. In 2018, the company was acquired by the Japanese firm, Recruit Holdings, for US$1.2 billion.

Glassdoor does not have any public API for Jobs. This means that you have to do scraping if you want to get data about the job posting. Also, Glassdoor does not have an API for reviews either, which might be of interest to you.

Selenium

Selenium is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language, Glassdoor renders its content with Javascript. This means that a simple get request to the webpage below would return only the visible content. Selenium is a library that lets you code a python script that would act just like a human user.

Code:

Import from Selenium the appropriate packages.

Define a Function called get jobs to collect our data the same approach of the medium post listed above.

To be able to get the function click here or visit the medium story here.

Now time to run the function:

at the end you get a file like this:

Download it into your desktop using the following code

Happy Learning

Conclusion

I know there is a full tutorial online but I decided to hare it since I am writing in the first place for me to learn new things and this is one of the things I’ve learned today.

you can contact me on: Github LinkedIn Zahra Elhamraoui