Source: Deep Learning on Medium
Logistic Regression in depth
We use logistic regression for basic as well as complex classifications in our data or we can say that we use this algorithm to assign our observations from the data to a discrete set of classes.
This algorithm uses the logistic sigmoid function to return a probability value.
(but wait what’s a sigmoid function?)
This function maps any real value into another value between the scale of 0 and 1. In machine learning, we use this function to map predictions to probabilities. The value where X=0 is exactly 0.5, so in this sample we can consider 0.5 as a threshold value for determining the classes (i.e. 1 or 0). If the output is greater than 0.5, we can classify it as class-1(Y=1) or if it’s less than 0.5 we say it belongs to class-0(Y=0).
Now let us see how we can use logistic regression for classification purposes
If you know linear regression (if not read it from the link at the bottom) and we can say that logistic regression is more of a generalized form of linear regression, here instead of outputting the weighted sum of inputs directly, we pass it through a function that can map any real value between 0 and 1.
If we take the weighted sum of inputs as the output as we do in linear regression, the value can be more than 1 but we are not looking for values that are not between 0 and 1. We can say, that this the reason linear regression can’t be used for classification purposes.
Let’s look at some assumptions of logistic regression
- For logistic regression, we require quite a large sample size.
- But only unique and meaningful variables should be included.
- The independent variables are linearly related to the log odds.
- The model should have little or no multicollinearity.
- Binary logistic regression requires the dependent variable to be binary.
Decision boundary: it’s the boundary that separates discrete data values e.g. in the above example 0.5 can be said as the decision boundary.
Vectorization is a technique where we avoid using loops because it approves efficiency and speed. Let’s see with an example. We have 2 arrays both consists of 1 million elements each and if we multiply both the array element-wise & sum the elements in resulting array by (1) for loop, (2) vectorization, and then we will compare the time difference.
As we can see in the output window the vectorized version is approx. 47 times faster than the for loop in this case.
Types of Logistic Regression
- Binary Logistic Regression: here we have only two possible outcomes. E.g. yes or no
- Multinomial Logistic Regression: here the outcomes are more than two, e.g. dog, cat, fish, etc.
- Ordinal Logistic Regression: more than two categories with ordering. E.g. IMDb rating from 1–10.
Simple Logistic Regression
The output we want = 0 or 1
Our hypothesis (or say equation) => Z=WX+B
hθ(X) = sigmoid(Z)
Analysis of hypothesis
the hypothesis outputs the estimated probability, we use this to infer how accurate the prediction is as compared to the actual value. Let’s take a basic example of radiofrequency, we have 2 radio stations A & B, frequency on which station A operates is 91.43 Hz and frequency on which station B operates is 97.08 Hz. Consider the following example.
X = [xo x1] = [ 1 frequency]
Now based on x1 value, say we get estimated probability as 0.88 (we take Y=1 for A & Y=0 for B), now this means that there is 88% chance that the signal is from station A.
Mathematically we can write it as,
hθ(X)=P(Y=1 | X; θ) probability that Y=1 given X which is parameterized by θ.
P(Y=1 | X; θ) + P(Y=0 | X; θ) = 1
P(Y=0 | X; θ) = 1- P(Y=1 | X; θ)
This justifies the name ‘logistic regression’. Data is fitted into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.
Simplified cost function
Cost(hθ(X),y) = -y.log(hθ(X))-(1-y)log(1- hθ(X))
If y=1, (1-y) term will become zero, therefore -log(hθ(X)) alone will we present.
If y=0, (y) term will become zero, therefore -log(1- hθ(X)) alone will be present.
Why this cost function?
Ŷ = P(Y=1 | X), Ŷ is the probability that Y=1, given X
(1-Ŷ) = P(Y=0 | X)
P(Y/X) = ŶY . (1-Ŷ)(1-Y)
If Y=1 => P(Y/X) = Ŷ
ð Log(ŶY . (1-Ŷ)(1-Y))
ð Y.log Ŷ + (1-Y)log(1-Ŷ)
ð logP(Y/X) = -L(Ŷ,Y)
we take negative here because as we train our model, we need to maximize the probability by minimizing loss function. Decreasing the cost will increase the maximum likelihood assuming that samples are drawn from an identically independent distribution.
But why cost function which has been used for linear can not be used for logistic?
If we use it for logistic regression, then it will be a non-convex function of parameters(θ) gradient descent will converge into global minimum only if the function is convex.