Original article was published by Raoof Naushad on Artificial Intelligence on Medium

# Generalised Linear Models — Basics and Implementation

# Probability Distributions:

Probability distributions are fundamental to statistics, just like data structures are to computer science.

Things happen all the time:

1) dice are rolled,

2) it rains,

3) buses arrive.

After the fact, the specific outcomes are certain:

- the dice came up 3 and 4,
- there was half an inch of rain today,
- the bus took 3 minutes to arrive.

Probability distributions describe what we think the probability of each outcome is, which is sometimes more interesting to know than simply which single outcome is most likely.

They come in many shapes, but in only one size: probabilities in a distribution always add up to **1**.

**For example**, flipping a fair coin has two outcomes: it lands heads or tails.

Before the flip, we believe there’s a 1 in 2 chance, or 0.5 probability, of heads. The same is true for tails.

That’s a probability distribution over the two outcomes of the flip, and this is an example of **Bernoulli distribution.**

Above, both outcomes were equally likely, and that’s what’s illustrated in the diagram. The Bernoulli PDF has two lines of equal height, representing the two equally-probable outcomes of 0 and 1 at either end.

# Exponential Family

Some of the basic linear regression and classification algorithms can also be derived from the general form of Exponential Family.

A distribution belongs to exponential family if it can be transformed into the general form:

where

**η** is canonical parameter

**T(x)** is sufficient statistic

**A(η)** is cumulant function or log partition function

Exponential family includes the **Gaussian, binomial, multinomial, Poisson, Gamma** and many others distributions.

A fixed choice of T, a and h defines a family (or set) of distributions that is parameterized by η; as we vary η, we then get different distributions within this family.

Some examples of Exponential Family Distributions used in Machine learning

1) Bernoulli Distribution

2) Gaussian Distribution

etc..

# Generalized Linear Models

Learning GLM lets you understand how we can use **probability distributions as building blocks** for modeling.

## Case 1

Consider the below probability distribution. Does it shows any properties?

It follows a linearity and also the proability distribution here gives an idea that data follow a **normal distribution** with a **fixed variance**.

**Linear regression** is used to predict the value of continuous variable y by the linear combination of explanatory variables X. In the univariate case, linear regression can be expressed as follows;

Here, i indicates the index of each sample. Notice this model assumes normal distribution for the noise term. The model can be illustrated as follows;

Thus understood that linear regression can be modelled using Normal Distribution

## Case 2

Assume you need to predict the number of defect products (Y) with a sensor value (x) as the explanatory variable. The scatter plot looks like this.

Consider this graph. Does it follows linearity or can we make a model it using **Normal Distribution** ?? Does Linear Regression is a nice way to solve it?

1) The relationship between X and Y does not look linear. It’s more likely to be **exponential**.

2) The **variance of Y does not look constant** with regard to X. Here, the variance of Y seems to increase when X increases.

3) As Y represents the number of products, it always has to be a positive integer. In other words, **Y is a discrete variable**. However, the normal distribution used for linear regression assumes continuous variables. This also means the prediction by linear regression can be negative. It’s not appropriate for this kind of count data.

Here, the more proper model you can think of is the **Poisson regression model**. Poisson regression is an example of generalized linear models (GLM).

Poisson distribution is used to model count data. It has only one parameter which stands for both mean and standard deviation of the distribution. This means **the larger the mean, the larger the standard deviation**.

See it matches well. This is how we use **Probability Distribution Functions** for modelling. The magenta curve is the prediction by Poisson regression.

# Constructing GLMs

## Assumptions to follow

**y | x; θ ∼ ExponentialFamily(η)**. I.e., given x and θ, the distribution of y follows some exponential family distribution, with parameter η.- Given x, our goal is to predict the expected value of T(y) given x. In most of our examples, we will have T(y) = y, so this means we would like the prediction h(x) output by our learned hypothesis h to satisfy
**h(x) = E[y|x]**. - The natural parameter η and the inputs x are related linearly:
**η = θT x**.

First lets show Bernoulli and Gaussian as Exponential Family

## Deriving — Least Square Linear Regression from GLM

1) Type of data => Consider the target variable y is continous

2) Modelling => We model the distribution of y given x as Gaussian Distribution y|x; θ ∼ N (µ, σ2 )

3) First equality is based on Assumption 2

4) Second equality is because we use Gaussian as Probability distribution, So expected value is µ

5) Third equality follows Assumption 1 and alos gaussian derivation of exponential family. 6) Last one follow Assumption 3. η and and input x are related linearly.

## Deriving — Logistic Regression from GLM¶

- Type of data => Consider the target variable y E {0,1}

2) Modelling => We model the distribution of y given x as Bernoulli Distribution y|x; θ ∼ Bernoulli(ϕ)

3) First equality is based on Assumption 2

4) Second equality is because we use Gaussian as Probability distribution, So expected value is µ

5) Third equality follows Assumption 1 and alos gaussian derivation of exponential family. 6) Last one follow Assumption 3. η and and input x are related linearly.

# Components Generalized Linear Models

There are three components in generalized linear models.

- Linear predictor
- Link function
- Probability distribution

# 1) Poisson Regression

1) Linear predictor is just a linear combination of parameter (b) and explanatory variable (x).

2) Link function literally “links” the linear predictor and the parameter for probability distribution. In the case of Poisson regression, the typical link function is the log link function.

3) The last component is the probability distribution which generates the observed variable y.

## Note

The prediction curve is exponential as the inverse of the log link function is an exponential function. From this, it is also clear that the parameter for Poisson regression calculated by the linear predictor guaranteed to be positive.

# 2) Linear Regression

Linear regression is also an example of GLM. It just uses identity link function (the linear predictor and the parameter for the probability distribution are identical) and normal distribution as the probability distribution.

# 3) Bernoulli distribution

If you use logit function as the link function and binomial / Bernoulli distribution as the probability distribution, the model is called logistic regression.

The right-hand side of the second equation is called logistic function. Therefore, this model is called **logistic regression**. As the logistic function returns values between 0 and 1 for arbitrary inputs, it is a proper link function for the **binomial distribution**.