How to Scrape Multiple Pages of a Website Using a Python Web Scraper

Original article can be found here (source): Artificial Intelligence on Medium

Time to Code

As mentioned in the first article, I recommend following along in a environment if you don’t already have an IDE.

I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code beforehand.

You can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight changes.

Alternatively, you can go straight to the code here.

Now, let’s begin!

Import tools

Let’s import our previous tools and our new tools — time and random.

Initialize your storage

Like previously, we’re going to continue to use our empty lists as storage for all the data we scrape:

English movie titles

After we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:

Analyzing our URL

Let’s go to the URL of the page we‘re scraping.

Now, let’s click on the next page and see what page 2’s URL looks like:

And then page 3’s URL:

What do we notice about the URL from page 2 to page 3?

We notice &start=51 is added into the URL when we go to page 2, and the number 51 turns to the number 101 on page 3.

This makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so on.

Why is this important? This information will help us tell our loop how to go to the next page to scrape.

Refresher on ‘for' loops

Just like the loop we used to loop through each movie on the first page, we’ll use a for loop to iterate through each page on the list.

To refresh, this is how a for loop works:

for <variable> in <iterable>:

<iterable> is a collection of objects—e.g. a list or tuple. The <statement(s)> are executed once for each item in <iterable>. The loop <variable> takes on the value of the next element in <iterable> each time through the loop.