# Let’s Calculate Manually: Deep Dive Into Logistic Regression

Original article was published on Artificial Intelligence on Medium

# Motivation

The algorithm of Logistic Regression has been well-explained by most of the machine learning experts through various platforms such as blogs, YouTube videos, online courses, etc. However, most of the time, they didn’t provide the calculation examples while they were explaining the algorithm. Imagine one day, if scikit-learn (a free machine learning library for Python) is taken away from your computing environment, can you still perform the process of modelling through the manual calculation especially on a simple dataset to get the similar result like calling the API of scikit-learn? If you can’t, this story is definitely for you.

# Brief Introduction to Logistic Regression

## What is Logistic Regression?

According to Ousley and Hefner (2005) and DiGangi and Hefner(2013), Logistic Regression is one of the statistical approaches that is similar to Linear Regression. Logistic Regression looks for the best equation to produce an output for a binary variable (Y) from one or multiple inputs (X). Linear Regression is capable to handle continuous inputs only whereas Logistic Regression can handle both continuous and categorical inputs. The mathematical concepts of log odds ratio and interactive maximum likelihood are implemented to find the best fit for group membership predictions. The assumption made by Logistic Regression includes all the inputs are independent of each other. Nevertheless, this assumption is not true for most of the times in reality [1, 2].

## What is Multi-Class Logistic Regression?

It is known as Soft-max Regression which can handle the modelling process on the training dataset that contains more than 2 class labels.

# Challenge

Given a set of 9 training data with 2-dimensional inputs and their corresponding class labels as follows.

Can you:

1. Convert the class labels into One-hot Representation?
2. Fit a Multi-Class Logistic Regression model to the training data using the algorithm of Gradient Descent? Provided that the learning rate is set to be 0.05, the number of training epoch is set to be 1 and the initial model parameters are set as follows.

# Solution

## Mathematical Alphanumeric Symbols

To ease the display of mathematical equations, mathematicians use alphabets to represent variables. The alphabets will be used in this story include:

• N = total number of training data (In this story, N=9)
• K = total number of class labels (In this story, K=3)
• D = dimensions of training data (In this story, D=3)
Special Note: D0 for all the inputs (X) normally will be assigned as the default value, which is 1.
• α = learning rate
• W = model weight
• W(new) = model weight before update
• W(old) = model weight after update
• X = inputs
• Y = actual labels
• ^y = predicted labels
• J = Total loss
• L = Loss for a particular number of training data

## One-hot Encoding

First and foremost, the class labels, which are integers, are encoded into One-hot Representation.

## Inputs, Actual Labels and Weights Conversion Into Matrix Representation

Next, the inputs (X), actual labels (Y) and initial weight parameters (W) are converted into matrices. The size of inputs, actual labels and weights should be 9×3, 9×3 and 3×3 for this story respectively.

## Feed-Forward Propagation

Refer to Equation 1 in the following image, the prediction matrix of the entire training dateset will be in the size of Nx1, where N refers to the total number of training data. Refer to Equation 2, the prediction output of each training data will be the class label that contains the highest probability. The probability can be found by using Equation 3 .

To compute the function (f), the inner product between X and W for different k should be obtained first. Then, the exponential of the inner product between X and W for different k is calculated. The output size of the matrix for the inner product between X and W for this story should be 9×3.

The denominator of function (f) can be obtained by performing the sum of the exponential of the inner product between X and W for all k.

As we all know, the division between matrices cannot be directly performed. However, there is a simple trick that can be done in this story, which directly performs the arithmetic division between matrices by using coordinates as a reference. For instances, matrix A got [a1, a2, ……, a9] and matrix B got [b1, b2, ……, b9], the matrix C where its c1 value can be known by using a1 divided by b1, c2 value can be known by using a2 divided by b2 respectively. The size of the matrix after performed this trick remains to 9×1.

The result of matrices for different k are combined and the output of the function (f) which is a matrix is found. The sum of the probability for predict all the class label for each training data will be always equal to 1 as the Soft-max function is applied above. The prediction output of each training data will be the class label that contains the highest probability.

## Back Propagation

The feed-forward propagation is temporarily done as the predicted classes for the training data are found for the first training epoch. How do we compute the loss? The loss can be figured out by using Multi-Category Cross Entropy. If the actual labels are not the same with the predicted labels, the loss value will be very big for that particular training data point.

After the loss is computed, the weight parameters will be updated to make sure that the model will fit more the training data by using Gradient Descent. It will only update the weights once after all the training data have propagated through the model in a training epoch.

After the weights are updated, the 1st training epoch is ended. You can continue to experiment it by performing the manual calculation for the next few training epoch and observe how the values of total loss and weight parameters change across the number of training epochs. The value of the total loss value should be reduced over training epochs if the model learns.

# References:

 Ousley, S. D., & Hefner, J. T. (2005). The statistical determination of ancestry. Proceedings of the 57th annual meeting of the American Academy of Forensic Sciences (pp. 21–26).

 DiGangi, E. A., & Hefner, J. T. (2013). Ancestry estimation. Research methods in human skeletal biology (pp. 117–149). Academic Press.

 Raschka, S. (2019). Softmax Regression. mlxtend.