Source: Deep Learning on Medium

# Vectorisation: How to speed up your Machine Learning algorithm by x78 times faster

Given an equation, we will se how step by step can achieve more efficient code not only x78 times in terms of speed, but using only three lines of code instead of 11! Let’s dive into it…

# Intro

As an** interpreted language, **Python `for`

loops are inherently slower than their C counterpart. This is a big bottleneck for the infamous programming language, as Deep Learning and Machine Learning Algorithms depend heavily on matrix operations, which they are executed via `for`

loops.

This is why developers developed packages, like `numpy`

, which they offer **vectorized** actions on `numpy`

arrays. Meaning that it pushes the `for`

loop that are usually be done in Python down to the **C level**, which is much faster.

Python + C level speed = Paradise

# The Problem

*(you can skip the expenation parts if you are confutable with the EM algorithm)*

We want to use the **Expectation Maximization** (EM) algorithm for an unsupervised learning task (say, for example, recognition of handwritten digits in the MNIST dataset) and our data are binary (e.g binary images). A natural way is to model our data as a **Bernoulli mixture model**; a weighted sum of Bernoulli distributions, were each distribution has its own scalar weight

and its own mean vector **π****μ**** **and represents a cluster of a data (e.g. if our data is images of digits 2, 3 & 4 and we use 3 Bernoulli to model them, one Bernoulli will be the digits 2, the other the digits 4 etc). Overall, that make the former a vector and the latter a matrix.

Let N=number of observations , D=dimensionality of one observation and K=number of clusters. Because it is important for our problem, the types of random variables that we have are:

•

; our data a NxD matrix**X**

(N the number of images, D the dimentinality of them → 5 images that are 28*28 will make a 5×784 matrix X)

• `π`

; a vector K, a scalar for every distribution representing its weight.

(e.g three Bernoulli can have π=[0.2, 0.75, 0.05] weighted vector)

•

; the mean KxD matrix for every cluster.**μ**

(one image has dimentinality of D=28*28=784, where each one of them represents a pixel value. Taking the mean for every pixel of images that belong to the same cluster, say digit 2, we result into a mean vector of 784. Thus,

would be a matrix of KxD)**μ**

At the **E-step**, we are particularly interested in the expected value of the posterior of the latent variables, or so called responsibilities

`γ`

actually returns the expected value of the observation (image) `n`

belong to the cluster `k`

. `γ`

** **is an** NxK matrix**; for each observation, we assign a probability to belong to every cluster. The one with the maximum, is the one that we assign to.

Why I am telling all this?

“The most important think in vectorisation is to understand the dimensions of the variables.”

The calculation of responsibilities is the one that we are going to vectorise

To sum up:

•

: NxD matrix**X**

•

: 1xK vector**π**

•

: KxD matrix**μ**

• **γ**** **:** **NxK matrix

# Pipeline

We will create a function `E_step`

to run calculate the above expression and test it with the following code

observations = [5, 10, 20, 50, 100, 200, 500, 1000]

for n in observations:

X_test = bin_train_data[:n]

D_test, K_test = X_test.shape[1], 10 mu_test = np.random.uniform(low=.25, high=.75,

size=(K_test,D_test))

pi_test = np.ones(K_test) / K_test t0 = time.time()

gamma_test = E_step_1(X_test, mu_test, pi_test)

runtime = time.time() - t0assertgamma_test.shape == (n, K_test)

Feel free to give it a try by yourself first!

# Attempt №1

On our first try, we will write everything with for loops; on vectors/matrix operations, only scalars.

By looking at the equations, we can see that there are tree loops; One for every example `N`

, one for every cluster `K`

and one for every dimension of every object `D`

and we are going to loop in this order. So we are going to fill the matrix `γ`

by one element at a time.

`def E_step(X, mu, pi):`

N, D = X.shape

K = pi.shape[0]

gamma = np.zeros((N, K))

for n in range(N):

for k in range(K):

m = 1

for i in range(D):

m *= mu[k][i]**X[n][i] * (1-mu[k][i])**(1-X[n][i])

gamma[n][k] = m * pi[k]

gamma[n] /= gamma[n].sum()

return gamma

And our results are visible an the below plot.

We can surely do better than that!

# Attempt №2

It’s good to start from the inside loop and work your way to the outer ones. And this is exactly what we are going to do!

We want to get rid of the for loop `D`

. Thus every term that is depented on `D`

should now became a vector. Inside this for loop, we have two variables; `μ`

and `x`

(see eq.`(2)`

). Thus `x`

and `μ`

→ vectors. Problem; It’s `μ**x`

, vector to the power of another vector and is difficult to calulate. If only we can get around this…

There is a function, that turns a power operation to multiplication. That’s right, is the **logarithm**! So let’s logarithm our expression and then take the exponent if the result!

Operations on logarithm probabilities are preferred, as they offer numerical stability

Even though in our case it doen’t have any affect, every time that you are using logs, use a constant `epsilon`

inside the expression for stability (to not go to zero, with is `-inf`

).

Thus, we will have to element-wise vector multiplications. Easy 😉

`def E_step(X, mu, pi):`

N, D = X.shape

K = pi.shape[0]

gamma = np.zeros((N, K))

for n in range(N):

for k in range(K):

log_gamma = np.log(pi[k]) + (X[n] * np.log(mu[k]) \

+ (1 - X[n])*np.log(1 - mu[k])).sum()

gamma[n][k] = np.exp(log_gamma)

gamma[n] /= gamma[n].sum()

return gamma

And our results are…

That is a HUGE win! It is seems like the x axis compered to the `Algor. 1`

! But, we can do better 😉

# Attempt №3

One loop at a time: it’s `K`

turn!

In the process of vectorisation, we are moving as follows:

scalar → vector → matrix

As we are doing to replace more and more loops with `numpy`

arrays, more and more code will run on **C → faster & cleaner code.**

We take our previous implementation and we want to remove the `K`

for loop. Thus every scalar that depends on `K`

will turn into a `vector`

and every `vector`

into a `matrix`

. That means `X`

will remain the same and `μ`

, will turn into a matrix and `π`

and `γ`

into vectors. Pay attention to the last one; as `γ`

field row by row, the outcome of our expression have now to be a vector! So the operation of `μ`

and `X`

have to result into a `1xK`

vector and quick indicators are (i) they have to sum with the vector `π`

, which is also `1xK`

(ii) the outcome is a row of the matrix `γ`

, also a `1xK`

vector.

Coding out we result into:

`def E_step(X, mu, pi):`

N, D = X.shape

K = pi.shape[0]

gamma = np.zeros((N, K))

for n in range(N):

log_gamma = np.log(pi) + np.log(mu) @ X[n] \

+ np.log(1 - mu) @ (1 - X[n])

gamma[n] = np.exp(log_gamma)

gamma[n] /= gamma[n].sum()

return gamma

And the results are: