A handwritten introduction to PageRank

Original article was published by Zakarie Aloui on Artificial Intelligence on Medium


The Damping Factor

The previous implementation worked perfectly fine with a simple example. But we don’t expect all networks to be as well-behaved: if we consider more examples and some special cases, we may stumble upon some unexpected results. Let’s take the following graph, Delta, as an example:

Delta.

The network is separated into two sub-networks that we called (i) and (ii). (i) and (ii) are connected by a one-way link. Therefore, if the random surfers start their quest in (i), the probability that they end up in (ii) is not zero.

However, once they reach it, they cannot go back to (i). If we build the link matrix and compute its main eigenvector, we get the following results:

[-3.58205484e-01, 0.00000000e+00, 7.61186653e-01, 2.23878427e-02, 4.06650495e+14, -8.13300990e+14, 1.21995149e+15, -8.13300990e+14]

As we could expect, the pages in (ii) have a much higher rank than those in (i). A common example of such sub-networks is Wikipedia: a lot of pages contain a reference towards it, but Wikipedia pages only contain internal links (external links are “no-follow links”, and therefore are not registered by search engines).

The problem of this implementation of PageRank is that it assumes that once you have landed on any page on Wikipedia, you go through all other pages without ever going back or teleporting to another website by typing its URL. Obviously, this approach does not depict the real behaviour of a web surfer and the results it gives are not relevant.

A more conspicuous example of this problem is when the network consists of several graphs which are completely disconnected, like in the example below.

Epsilon.

Since there are no relationships between both sub-networks that make up Zeta, it will be impossible to get a general popularity ranking.

If we try to use our algorithm to compute the PageRank of this network, here is the r vector after each of the 10 first iterations:

Initial [0.2 0.2 0.2 0.2 0.2]
1 [0. 0.2 0.4 0.2 0.2]
2 [0. 0.4 0.2 0.2 0.2]3 [0. 0.2 0.4 0.2 0.2]4 [0. 0.4 0.2 0.2 0.2]5 [0. 0.2 0.4 0.2 0.2]6 [0. 0.4 0.2 0.2 0.2]7 [0. 0.2 0.4 0.2 0.2]8 [0. 0.4 0.2 0.2 0.2]9 [0. 0.2 0.4 0.2 0.2]10 [0. 0.4 0.2 0.2 0.2]

The rank of pages B and C oscillates between 0.2 and 0.4 without settling. When letting it run a bit longer, it keeps iterating more than 1 million times, without congeriving.

We will use Numpy eigenvectors calculator to try to understand the reason behind this behaviour.

eigenvalues: [ 1. -1.  1. -1.  0.]
first eigenvector of eigenvalue 1: [0. 0. 0. 0.70710678 0.70710678]
second eigenvector of eigenvalue 1: [0. 0.70710678 0.70710678 0. 0. ]

The problem of this network is that it contains multiple eigenvectors of eigenvalue 1: the first vector gives the rank of the network {DE} without considering the other one and the second vector gives the rank of the network {ABC}.