Source: Deep Learning on Medium

## The detailed explanation behind the math equation to build the practical math foundations for your machine learning or deep learning journey

Hi, everyone! This is “Equation-to-Code” walk-through part 2. In part 1, we talked about how to use linear regression to solve the **linear separable problem**. We learned vector representation, standardization, adding bias, sigmoid function, log-likelihood function, and updating parameters.

This time we are gonna tackle a **non-linear separable problem**. If you haven’t seen part 1, it is totally alright. Part 2 is self-contained. But if you want to get a better understanding of part 2, it would be better to read part 1 first.

**An “Equation-to-Code” Machine Learning Project Walk-Through in Python— Part 1 Linear Separable…**

*The detailed explanation behind the math equation to build the practical math foundations for your machine learning or…*towardsdatascience.com

The content is structured as follows. `*`

means you can skip this step if you already part 1.

- Look at the data
**Non-Linear separable problem**- Standardization*
**Add bias and polynomial term**- Sigmoid function*
- Likelihood function*
- Update parameter θ*
- Plot the line
**Accuracy**- Summary

### 1 Look at the data

Here is the data, non_linear_data.csv

x1,x2,y

0.54508775,2.34541183,0

0.32769134,13.43066561,0

4.42748117,14.74150395,0

2.98189041,-1.81818172,1

4.02286274,8.90695686,1

2.26722613,-6.61287392,1

-2.66447221,5.05453871,1

-1.03482441,-1.95643469,1

4.06331548,1.70892541,1

2.89053966,6.07174283,0

2.26929206,10.59789814,0

4.68096051,13.01153161,1

1.27884366,-9.83826738,1

-0.1485496,12.99605136,0

-0.65113893,10.59417745,0

3.69145079,3.25209182,1

-0.63429623,11.6135625,0

0.17589959,5.84139826,0

0.98204409,-9.41271559,1

-0.11094911,6.27900499,0

First, we need to plot this data to see how it looks like. We create a Python file and name it non_logistic_regression.py.

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("non_linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# plot data points

plt.plot(train_x[train_y == 1, 0], train_x[train_y == 1, 1], 'o')

plt.plot(train_x[train_y == 0, 0], train_x[train_y == 0, 1], 'x')

plt.show()

After running the script above, you should see the figure below.

It seems we cannot use one straight line to separate X and O. We call such a problem as the non-linear separable problem, where data is not linearly separable.

### 2 Non-Linear separable problem

In part 1, we use linear function to solve the linear separable problem.

But for the non-linear separable problem, the linear function is too simple to handle. So we introduce polynomial logistic regression, which adds a polynomial tern in logistic regression.

We use θ to represent the parameter. The θ mark in the left part means the function f(x) has the parameter theta. θ in the right part means there are two parameters. The last term is the polynomial term, which makes the model generalize for non-linear separable data.

Notice that we have x1 and x2 two features in the non_linear_data.csv. We pick up x1 to as the polynomial term. So the function should become below form.

We initialize 4 parameters

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parametertheta = np.random.randn(4)

### 3 Standardization

In order to make training converge fast, we use standardization, also called **z**–**score. **We do it column-wise.

- 𝜇 is mean in each column
- 𝜎 is the standard deviation in each column

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv",delimiter=',',skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(4)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

### 4 add bias and polynomial term

We need to add a bias and polynomial term to construct the data matrix. We add a constant x0=1 in order to align the vector representation.

You can find more vector representation detail in part 1: 3 The vector representation.

In order to make the calculation more simple, we convert x to a matrix.

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(4)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# add x0 and x1^2 to get matrix

def to_matrix(x):

x0 = np.ones([x.shape[0], 1])

x3 = x[:, 0, np.newaxis] ** 2

return np.hstack([x0, x, x3])

mat_x = to_matrix(std_x)

# dot product

def f(x):

return np.dot(x, theta)

We use x3 to represent the `x1*x1`

.

The dimension of `std_x`

is `(20, 2)`

. After `to_matrix(std_x)`

, the dimension of `mat_x`

is `(20, 4)`

. As for the dot product part, the dimension of the result is `(4,)`

. So the result of dot production should be `(20, 4) x (4,) -> (20,)`

, which is a 1-D array containing predictions for 20 samples.

### 5 Sigmoid function

Below is the vector representation

Then we will build a more powerful prediction function based on it, the sigmoid function.

We use the z to represent the linear function and pass it to sigmoid function. The sigmoid function will give a probability for each data sample. We have two class in our data, one is `1`

and another is `0`

.

We can see the model predict the sample based on the linear function part.

We can write the code below

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(4)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# add x0 and x1^2 to get matrix

def to_matrix(x):

x0 = np.ones([x.shape[0], 1])

x3 = x[:, 0, np.newaxis] ** 2

return np.hstack([x0, x, x3])

mat_x = to_matrix(std_x)

# change dot production to sigmoid function

def f(x):

return 1 / (1 + np.exp(-np.dot(x, theta)))

### 6 Likelihood function

You can skip this step if you are not interested in the equation explanation or if you already read this in part 1

Alright, we prepared our data, model (sigmoid), and what else do we need? Yes, a goal function. **A goal function can guide us on how to update the parameter in the right way. **As for the sigmoid (logistic regression), we usually use log likelihood as the goal function

Wait, wait…what the hell about these things!

**Don’t panic. Calm down.**

Let’s take it apart.

- 1->2 (how to get line 1 to line 2):
`log(ab) = log a + log b`

- 2->3:
`log(a)^b = b * log a`

- 3->4: Due to we only have two class, y=0 and y=1, so we can use the below equation:

- 4->5: we use below transformation to make the equation more readable

So we get the final part.

Don’t forget why we start this. **A goal function can guide us how to update the parameter in the right way.**

We need to use this to calculate the loss to update the parameter. More specifically, we need to calculate the **derivative **of the log-likelihood function. Here I will directly give the final update equation. (If you are interested in how to get this equation, this video should be helpful)

**In step 6, the most important equation is this one. If you cannot understand how to get this, it is totally ok. All we need to do is to write it as real code.**

### 7 Update parameter θ

You can skip this step if you already read it in part 1

This step is very important. **Don’t panic**. We will crack it.

θj is the j-th parameter.

- η is the learning rate, we set it as 0.001 (1e-3).
- n is the number of data samples, in our case, we have 20.
- i is the i-th data sample

Because we have three parameters, we can write it as three equations. We use x3 to represent the `x1*x1`

.

The `:=`

notation is just like `=`

. You can find the explanation here.

The most difficult part is the Σ (summation symbol), so I expand the Σ for better understanding.

Look carefully.

I colored the three parts in the equation because **we can represent them as matrices**. Look at the red and blue part in the first row where we update theta 0.

We write the red part and blue part as column vectors.

Because we have 20 data samples, so the dimension of `f`

is `(20,1)`

. The dimension of `x0`

is `(20,1)`

. We can write matrix multiplication with transpose.

So the dimension should be `(1, 20) x (20, 1) -> (1,)`

. We get one scale to update the theta 0.

The `x1`

and `x2`

is also column vector. And we can write to them as an **X **matrix.

And theta is a row vector

Back to the equation.

We can write is as

Write it as one equation.

A Numpy array-like version might be easy to understand.

Let’s do a little calculation to make sure the dimension is right.

θ: (1, 4)

f^T: (1, 20)

x: (20, 4)

dot production: (1, 20) x (20, 4) -> (1, 4)

Everything seems so right. Let’s write the code. Actually, just two lines.

import numpy as np

import matplotlib.pyplot as plt

# read data

data = np.loadtxt("linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(4)

# standardization

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# add x0 and x1^2 to get matrix

def to_matrix(x):

x0 = np.ones([x.shape[0], 1])

x3 = x[:, 0, np.newaxis] ** 2

return np.hstack([x0, x, x3])

mat_x = to_matrix(std_x)

# sigmoid function

def f(x):

return 1 / (1 + np.exp(-np.dot(x, theta)))

# update times

epoch = 2000

# learning rate

ETA = 1e-3

# update parameterfor _ in range(epoch):"""

f(mat_x) - train_y: (20,)

mat_x: (20, 4)

theta: (4,)

dot production: (20,) x (20, 4) -> (4,)

"""theta = theta - ETA * np.dot(f(mat_x) - train_y, mat_x)

Something strange? Remember what we write before the code?

dot production: (1, 20) x (20, 4) -> (1, 4)

The dimension changes make sense here.

But why when we write code, we use `(20,) x (20, 4) -> (4,)`

?

Actually, this is not real math notation, this is the Numpy notation. And if you are using TensorFlow or PyTroch, you should be familiar with it.

`(20,)`

means this is a 1-D array with 20 numbers. It can be a row vector or a column vector because it only has 1 dimension. If we set this as a 2-D array, like `(20, 1)`

or `(1, 20)`

, we can easily determine that`(20, 1)`

is a column vector and `(1, 20)`

is a row vector.

**But why not explicitly set the dimension to eliminate ambiguity?**

Well. Believe me, I have the seam question when I first see this. But after some coding practice, I think I know the reason.

**Because it can save our time!**

We take `(20,) x (20, 4) -> (4,)`

as an example. If we want to get the `(1, 20) x (20, 4) -> (1, 4)`

, what we need to do with `(20,) x (20, 4) -> (4,)`

?

- Convert (20,) to (1, 20)
- Calculate (1, 20) x (20, 4) -> (1, 4)
- Because (1, 4) is a 2-D column vector we need to convert it to a 1-D array. (1,4) -> (4,)

Honestly, it is frustrating. Why we cannot complete these in just one step?

Yes, that’s why we can write`(20,) x (20, 4) -> (4,)`

.

Ok, let’s take a look at how the numpy.dot() doc says.

numpy.dot(): If

ais an N-D array andbis a 1-D array, it is a sum product over the last axis ofaandb.

Hmm, actually I cannot get the point. But np.matmul() describes similar calculations with reshapes to (20,1) or (1,20) to perform standard 2d matrix product. Maybe we can get some inspiration.

np.matmul(): If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.

Ha, this is the missing part! So in our case, `(20,)`

becomes `(1, 20)`

because the first dimension of `(20,4)`

is 20. And `(1, 20) * (20, 4) -> (1, 4)`

. Then prepended 1 is removed, so we get `(4,)`

. One step for all.

### 8 Plot the line

After updating the parameter 2000 times, we should plot the result to see the performance of our model.

We will make some data points as x1, and calculate x2 based on the parameters we learned.

# plot line

x1 = np.linspace(-2, 2, 100)x2 = - (theta[0] + x1 * theta[1] + theta[3] * x1**2) / theta[2]

plt.plot(std_x[train_y == 1, 0], std_x[train_y == 1, 1], 'o') # train data of class 1

plt.plot(std_x[train_y == 0, 0], std_x[train_y == 0, 1], 'x') # train data of class 0plt.plot(x1, x2, linestyle='dashed') # plot the line we learnedplt.show()

### 9 Accuracy

In part 2, we use accuracy to evaluate how our model performance.

import numpy as np

import matplotlib.pyplot as plt

data = np.loadtxt("linear_data.csv", delimiter=',', skiprows=1)

train_x = data[:, 0:2]

train_y = data[:, 2]

# initialize parameter

theta = np.random.randn(4)

mu = train_x.mean(axis=0)

sigma = train_x.std(axis=0)

def standardizer(x):

return (x - mu) / sigma

std_x = standardizer(train_x)

# add x0 and x1^2 to get matrix

def to_matrix(x):

x0 = np.ones([x.shape[0], 1])

x3 = x[:, 0, np.newaxis] ** 2

return np.hstack([x0, x, x3])

mat_x = to_matrix(std_x)

# sigmoid function

def f(x):

return 1 / (1 + np.exp(-np.dot(x, theta)))

# classify sample to 0 or 1

def classify(x):

return (f(x) >= 0.5).astype(np.int)

# update times

epoch = 2000

# learning rate

ETA = 1e-3

# accuracy log

accuracies = []

# update parameter

for _ in range(epoch):theta = theta - ETA * np.dot(f(mat_x) - train_y, mat_x)result = classify(mat_x) == train_y

accuracy = sum(result) / len(result)

accuracies.append(accuracy)

# plot accuracy line

x = np.arange(len(accuracies))

plt.plot(x, accuracies)

plt.show()

`classify(x)`

: if the probability bigger than 0.5, we classify it as True`result`

: contains the prediction as a list, [Ture, False, …]`accuracy = sum(result) / len(result)`

: calculate how many correct samples in prediction in the current epoch.

Finally, we plot the accuracy line.

We can see the line become steady after 1000 epochs.

### 10 Summary

If you already have seen part 1, you should find part 2 is pretty easy to understand. You can find the whole code below. Leave comments to let me know whether my article is easy to understand. Stay tuned for my next article about the regularization.