Source: Deep Learning on Medium

## The detailed explanation behind the math equation to build the practical math foundations for your machine learning or deep learning journey

A big gap from an engineer to a machine learning engineer is the ability to convert math equation to real code. Sometimes we really need to implement some basic concepts from scratch to better understand the magic behind the scenes rather than just import library without further understanding.

So I decided to write some articles to explain how to convert the math equation to real code. This is part 1, and I will give a classification example using logistic regression for a **linear separable problem**. And I will try to make the explanation as simple as possible.

The content is structured as follows. It looks a little long,

- Look at the data
- Linear separable problem
- The vector representation
- Standardization
- Add bias
- Sigmoid function
- Likelihood function
- Update parameter θ
- Plot the line
- Summary

Here is the data, linear_data.csv

x1,x2,y

153,432,0

220,262,0

118,214,0

474,384,1

485,411,1

233,430,0

396,321,1

484,349,1

429,259,1

286,220,1

399,433,0

403,300,1

252,34,1

497,372,1

379,416,0

76,163,0

263,112,1

26,193,0

61,473,0

420,253,1

First, we need to plot this data to see how it looks like. We create a Python file and name it logistic_regression.py.

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# plot

plt.plot(train_x[train_y == 1, 0], train_x[train_y == 1, 1], 'o')

plt.plot(train_x[train_y == 0, 0], train_x[train_y == 0, 1], 'x')

plt.show()

After running the script above, you should see the figure below.

We might think a straight line should separate X and O very well. And this is a linear separable problem.

### 2 Linear separable problem

We need to find a model for such a problem. The most simple case is using linear function.

We use θ to represent the parameter. The θ mark in the left part means the function f(x) has the parameter theta. θ in the right part means there are two parameters.

We can write it as code

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

theta = np.random.randn(2)

def f(x):

return theta[0] + theta[1] * x

### 3 The vector representation

We also can rewrite the linear function as a more simple way, the vector way.

Here the θ and x are all column vector.

The reason why we use transpose of θ is that we can use matrix multiplication.

We can write the code below

import numpy as np

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(2)

# dot product

def f(x):

return np.dot(theta, x)

You might wonder why we don’t write `np.dot(theta.T, x)`

? Because the doc says **If both vectors are 1-D arrays, it is inner product of vectors (without complex conjugation)**. So

`np.dot(theta, x)`

do the same thing like `np.dot(theta.T, x)`

.### 4 Standardization

In order to make training converge fast, we use standardization, also called **z**–**score. **We do it column-wise.

- 𝜇 is mean in each column
- 𝜎 is the standard deviation in each column

import numpy as np

import numpy as np

import matplotlib.pyplot as plt

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(2)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# dot product

def f(x):

return np.dot(theta, x)

### 5 Add bias

We need to add a bias term to our function to make our model have a better generalization. So we increase the parameter from 2 to 3. And add a constant x0=1 in order to align the vector representation.

In order to make the calculation more simple, we convert x to a matrix.

import numpy as np

import numpy as np

import matplotlib.pyplot as plt

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(3)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# get matrix

def to_matrix(std_x):

return np.array([[1, x1, x2] for x1, x2 in std_x])

mat_x = to_matrix(std_x)

# dot product

def f(x):

return np.dot(x, theta)

The dimension of `std_x`

is `(20, 2)`

. After `to_matrix(std_x)`

, the dimension of `mat_x`

is `(20, 3)`

. As for the dot product part, notice here we change the position of x and theta, the dimension of theta is `(3,)`

. So the result of dot production should be `(20,3) x (3,)->(20,)`

, which is a 1-D array containing predictions for 20 samples.

### 6 Sigmoid function

After you get the idea of the linear function.

We will build a more powerful prediction function based on it, the sigmoid function.

We use the z to represent the linear function and pass it to sigmoid function. The sigmoid function will give a probability for each data sample. We have two class in our data, one is `1`

and another is `0`

.

We can see the model predict the sample based on the linear function part.

We can write the code below

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(3)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# get matrix

def to_matrix(std_x):

return np.array([[1, x1, x2] for x1, x2 in std_x])

mat_x = to_matrix(std_x)

# sigmoid function

def f(x):

return 1 / (1 + np.exp(-np.dot(x, theta)))

### 7 Likelihood function

You can just jump to the final part of this step 7 if you are not interested in the equation explanation.

Alright, we prepared our data, model (sigmoid), and what else do we need? Yes, a goal function. **A goal function can guide us on how to update the parameter in the right way. **As for the logistic regression, we usually use log likelihood.

Wait, wait…what the hell about these things!

**Don’t panic. Calm down.**

Let’s take it apart.

- 1->2 (how to get line 1 to line 2):
`log(ab) = log a + log b`

- 2->3:
`log(a)^b = b * log a`

- 3->4: Due to we only have two class, y=0 and y=1, so we can use the below equation:

- 4->5: we use below transformation to make the equation more readable

So we get the final part.

Don’t forget why we start this. **A goal function can guide us how to update the parameter in the right way.**

We need to use this to calculate the loss to update the parameter. More specifically, we need to calculate the **derivative **of the log-likelihood function. Here I will directly give the final update equation.

**In step 6, the most important equation is this one. If you cannot understand how to get this, it is totally ok. All we need to do is to write it as real code.**

But if you are interested, this video should be helpful.

### 8 Update parameter θ

Step 8 is a little longer, but it is very important. **Don’t panic**. We will crack it.

θj is the j-th parameter.

- η is the learning rate, we set it as 0.001 (1e-3).
- n is the number of data samples, in our case, we have 20.
- i is the i-th data sample

Because we have three parameters, we can write it as three equations.

The `:=`

notation is just like `=`

. You can find the explanation here.

The most difficult part is the Σ (summation symbol), so I expand the Σ for better understanding.

Look carefully.

I colored the three parts in the equation because **we can represent them as matrices**. Look at the red and blue part in the first row where we update theta 0.

We write the red part and blue part as column vectors.

Because we have 20 data samples, so the dimension of `f`

is `(20,1)`

. The dimension of `x0`

is `(20,1)`

. We can write matrix multiplication with transpose.

So the dimension should be `(1, 20) x (20, 1) -> (1,)`

. We get one scale to update the theta 0.

The `x1`

and `x2`

is also column vector. And we can write to them as an **X **matrix.

And theta is a row vector

Back to the equation.

We can write is as

Write is as one equation.

A Numpy array-like version might be easy to understand.

Let’s do a little calculation to make sure the dimension is right.

θ: (1, 3)

f^T: (1, 20)

x: (20, 3)

dot production: (1, 20) x (20, 3) -> (1, 3)

Everything seems so right. Let’s write the code. Actually, just two line.

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(3)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# get matrix

def to_matrix(std_x):

return np.array([[1, x1, x2] for x1, x2 in std_x])

mat_x = to_matrix(std_x)

# dot product

def f(x):

return np.dot(x, theta)

# sigmoid function

def f(x):

return 1 / (1 + np.exp(-np.dot(x, theta)))

# update times

epoch = 2000

# learning rate

ETA = 1e-3

# update parameterfor _ in range(epoch):"""

f(mat_x) - train_y: (20,)

mat_x: (20, 3)

theta: (3,)

dot production: (20,) x (20, 3) -> (3,)

"""

theta = theta - ETA * np.dot(f(X) - train_y, mat_x)

Something strange? Remember what we write before the code?

dot production: (1, 20) x (20, 3) -> (1, 3)

The dimension changes make sense here.

But why when we write code, we use `(20,) x (20, 3) -> (3,)`

?

Actually, this is not real math notation, this is the Numpy notation. And if you are using TensorFlow or PyTroch, you should be familiar with it.

`(20,)`

means this is a 1-D array with 20 numbers. It can be a row vector or a column vector because it only has 1 dimension. If we set this as a 2-D array, like `(20, 1)`

or `(1, 20)`

, we can easily determine that`(20, 1)`

is a column vector and `(1, 20)`

is a row vector.

**But why not explicitly set the dimension to eliminate ambiguity?**

Well. Believe me, I have the seam question when I first see this. But after some coding practice, I think I know the reason.

**Because it can save our time!**

We take `(20,) x (20, 3) -> (3,)`

as an example. If we want to get the `(1, 20) x (20, 3) -> (1, 3)`

, what we need to do with `(20,) x (20, 3) -> (3,)`

?

- Convert (20,) to (1, 20)
- Calculate (1, 20) x (20, 3) -> (1, 3)
- Because (1, 3) is a 2-D column vector we need to convert it to a 1-D array. (1,3) -> (3,)

Honestly, it is frustrating. Why we cannot complete these in just one step?

Yes, that’s why we can write`(20,) x (20, 3) -> (3,)`

.

Ok, let’s take a look at how the numpy.dot() doc says.

numpy.dot(): If

ais an N-D array andbis a 1-D array, it is a sum product over the last axis ofaandb.

Hmm, actually I cannot get the point. But np.matmul() describes similar calculations with reshapes to (20,1) or (1,20) to perform standard 2d matrix product. Maybe we can get some inspiration.

np.matmul(): If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.

Ha, this is the missing part! So in our case, `(20,)`

becomes `(1, 20)`

because the first dimension of `(20,3)`

is 20. And `(1, 20) * (20, 3) -> (1, 3)`

. Then prepended 1 is removed, so we get `(3,)`

. One step for all.

### 9 Plot the line

After updating the parameter 2000 times, we should plot the result to see the performance of our model.

We will make some data points as x1, and calculate x2 based on the parameters we learned.

# plot line

x1 = np.linspace(-2, 2, 100)x2 = - (theta[0] + x1 * theta[1]) / theta[2]

plt.plot(std_x[train_y == 1, 0], std_x[train_y == 1, 1], 'o') # train data of class 1

plt.plot(std_x[train_y == 0, 0], std_x[train_y == 0, 1], 'x') # train data of class 0plt.plot(x1, x2, linestyle='dashed') # plot the line we learnedplt.show()

### 10 Summary

Congratulations! I am glad you make it. Hope my article is helpful for you. You can find the whole code below. Leave comments to let me know whether my article is easy to understand. Stay tuned for my next article about the non-linear separable problem.