Vectorisation: How to speed up your Machine Learning algorithm by x78 times faster

Source: Deep Learning on Medium

Vectorisation: How to speed up your Machine Learning algorithm by x78 times faster

Given an equation, we will se how step by step can achieve more efficient code not only x78 times in terms of speed, but using only three lines of code instead of 11! Let’s dive into it…

Intro

As an interpreted language, Python for loops are inherently slower than their C counterpart. This is a big bottleneck for the infamous programming language, as Deep Learning and Machine Learning Algorithms depend heavily on matrix operations, which they are executed via for loops.

This is why developers developed packages, like numpy, which they offer vectorized actions on numpy arrays. Meaning that it pushes the for loop that are usually be done in Python down to the C level, which is much faster.

Python + C level speed = Paradise

The Problem

(you can skip the expenation parts if you are confutable with the EM algorithm)

We want to use the Expectation Maximization (EM) algorithm for an unsupervised learning task (say, for example, recognition of handwritten digits in the MNIST dataset) and our data are binary (e.g binary images). A natural way is to model our data as a Bernoulli mixture model; a weighted sum of Bernoulli distributions, were each distribution has its own scalar weight π and its own mean vector μ and represents a cluster of a data (e.g. if our data is images of digits 2, 3 & 4 and we use 3 Bernoulli to model them, one Bernoulli will be the digits 2, the other the digits 4 etc). Overall, that make the former a vector and the latter a matrix.

Bernoulli mixture model (1)
Distribution of one observation x given the cluster k (2)

Let N=number of observations , D=dimensionality of one observation and K=number of clusters. Because it is important for our problem, the types of random variables that we have are:
X; our data a NxD matrix
(N the number of images, D the dimentinality of them → 5 images that are 28*28 will make a 5×784 matrix X)
π; a vector K, a scalar for every distribution representing its weight.
(e.g three Bernoulli can have π=[0.2, 0.75, 0.05] weighted vector)
μ; the mean KxD matrix for every cluster.
(one image has dimentinality of D=28*28=784, where each one of them represents a pixel value. Taking the mean for every pixel of images that belong to the same cluster, say digit 2, we result into a mean vector of 784. Thus, μ would be a matrix of KxD)

At the E-step, we are particularly interested in the expected value of the posterior of the latent variables, or so called responsibilities

E-step of EM algorithm (3)

γ actually returns the expected value of the observation (image) n belong to the cluster k .
γ is an NxK matrix; for each observation, we assign a probability to belong to every cluster. The one with the maximum, is the one that we assign to.

Why I am telling all this?

“The most important think in vectorisation is to understand the dimensions of the variables.”

The calculation of responsibilities is the one that we are going to vectorise

To sum up:
X : NxD matrix
π : 1xK vector
μ : KxD matrix
γ : NxK matrix

Pipeline

We will create a function E_step to run calculate the above expression and test it with the following code

observations = [5, 10, 20, 50, 100, 200, 500, 1000]
for n in observations:
X_test = bin_train_data[:n]
D_test, K_test = X_test.shape[1], 10
mu_test = np.random.uniform(low=.25, high=.75,
size=(K_test,D_test))
pi_test = np.ones(K_test) / K_test
t0 = time.time()
gamma_test = E_step_1(X_test, mu_test, pi_test)
runtime = time.time() - t0
assert gamma_test.shape == (n, K_test)

Feel free to give it a try by yourself first!

Attempt №1

On our first try, we will write everything with for loops; on vectors/matrix operations, only scalars.

By looking at the equations, we can see that there are tree loops; One for every example N, one for every cluster K and one for every dimension of every object D and we are going to loop in this order. So we are going to fill the matrix γ by one element at a time.

def E_step(X, mu, pi):
N, D = X.shape
K = pi.shape[0]
gamma = np.zeros((N, K))
for n in range(N):
for k in range(K):
m = 1
for i in range(D):
m *= mu[k][i]**X[n][i] * (1-mu[k][i])**(1-X[n][i])
gamma[n][k] = m * pi[k]
gamma[n] /= gamma[n].sum()
return gamma

And our results are visible an the below plot.

We can surely do better than that!

Attempt №2

It’s good to start from the inside loop and work your way to the outer ones. And this is exactly what we are going to do!

We want to get rid of the for loop D . Thus every term that is depented on D should now became a vector. Inside this for loop, we have two variables; μ and x (see eq.(2)). Thus x and μ → vectors. Problem; It’s μ**x , vector to the power of another vector and is difficult to calulate. If only we can get around this…

There is a function, that turns a power operation to multiplication. That’s right, is the logarithm! So let’s logarithm our expression and then take the exponent if the result!

Operations on logarithm probabilities are preferred, as they offer numerical stability

Even though in our case it doen’t have any affect, every time that you are using logs, use a constant epsilon inside the expression for stability (to not go to zero, with is -inf).

Thus, we will have to element-wise vector multiplications. Easy 😉

def E_step(X, mu, pi):
N, D = X.shape
K = pi.shape[0]
gamma = np.zeros((N, K))
for n in range(N):
for k in range(K):
log_gamma = np.log(pi[k]) + (X[n] * np.log(mu[k]) \
+ (1 - X[n])*np.log(1 - mu[k])).sum()
gamma[n][k] = np.exp(log_gamma)
gamma[n] /= gamma[n].sum()
return gamma

And our results are…

That is a HUGE win! It is seems like the x axis compered to the Algor. 1 ! But, we can do better 😉

Attempt №3

One loop at a time: it’s K turn!

In the process of vectorisation, we are moving as follows:

scalar → vector → matrix

As we are doing to replace more and more loops with numpy arrays, more and more code will run on C → faster & cleaner code.

We take our previous implementation and we want to remove the K for loop. Thus every scalar that depends on K will turn into a vector and every vector into a matrix . That means X will remain the same and μ, will turn into a matrix and π and γ into vectors. Pay attention to the last one; as γ field row by row, the outcome of our expression have now to be a vector! So the operation of μ and X have to result into a 1xK vector and quick indicators are (i) they have to sum with the vector π , which is also 1xK (ii) the outcome is a row of the matrix γ , also a 1xK vector.

Coding out we result into:

def E_step(X, mu, pi):
N, D = X.shape
K = pi.shape[0]
gamma = np.zeros((N, K))
for n in range(N):
log_gamma = np.log(pi) + np.log(mu) @ X[n] \
+ np.log(1 - mu) @ (1 - X[n])
gamma[n] = np.exp(log_gamma)
gamma[n] /= gamma[n].sum()
return gamma

And the results are: