Bayesian Learning for ML:

Original article can be found here (source): Artificial Intelligence on Medium

Bayesian Learning for ML:

Hey everyone!!! today we will be going to learn about the Modelling concepts in Probability . so lets jump into it,

Introduction:

Bayesian machine learning is a particular set of approaches to probabilistic machine learning (for other probabilistic models, see Supervised Learning).

Bayesian learning treats model parameters as random variables — in Bayesian learning, parameter estimation amounts to computing posterior distributions for these random variables based on the observed data.

Bayesian probability basically talks about the interpretation of “partial beliefs”.

Bayesian Estimation calculates the validity of the proposition.

Validity of the Proposition depends on two things:

i).Prior Estimate.

ii).New Relevant evidence.

Hypothesis Space (H):
Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the machine learning algorithm would determine the best possible (only one) which would best describe the target function or the outputs.

Hypothesis (h):
A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data.

Bayes Theorem:

Bayes’ Theorem is the fundamental result of probability theory — it puts the posterior probability P(H|D) of a hypothesis as a product of the probability of the data given the hypothesis(P(D|H)), multiplied by the probability of the hypothesis (P(H)), divided by the probability of seeing the data. (P(D)) We have already seen one application of Bayes Theorem in class — in the analysis of Information Cascades, we have found that it is possible for rational decisions to be made where one’s own personal information is discarded, based upon the conditional probabilities calculated via Bayes’ Theorem.

P(h/D) = P(D/h)P(h) / P(D)

By the law of product: P(h D) = P(h).P(D/h)

Its also commutative: P(D h) = P(D).P(h/D)

Note:

If we put them together as equal then we can came up with bayesian formula as given above.also,

P(H) = Prior Probability.

P(D) = Likelihood of the data.

Application of the Bayes Theorem:

Applications of the theorem are widespread and not limited to the financial realm. As an example, Bayestheorem can be used to determine the accuracy of medical test results by taking into consideration how likely any given person is to have a disease and the general accuracy of the test.

MAP Hypothesis:

In Bayesian statistics, a maximum a posterior probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data.

***************h(MAP) = argmax P(h/D)***************

{where h ∈ H , also H-hypothesis Space}

************h(MAP) = argmax P(D/h)P(h)*****************

ML Hypothesis:

Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the process described by the model produced the data that were actually observed.

****************h(ML) = argmax P(D/h) is maximum.*************

Learn a real-valued function:

so here we want to find a function which maximize the f.

f = Target function.

******************(Xi, di)***********************

  • **************di=f(xi) + (εi) ****************

Where (εi)-epsilon is an error also its independently generated.

also di comes from a normal distribution with a mean value zero and a variance =sigma square.

h(ML) = argmax(for h) P(D/h)

h(ML) = argmax(for h) π(i=1 to m) 1/√2πσ² . e(in Power) -1/2(d-h(x))/σ

on evaluating it further we will get.

h(ML) = argmin(for h)Σ(from i=1 to m)[di — h(xi)]whole square.

Bayes Optimal Classifier:

The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for a new example. … Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.

Usually, a dataset DD is considered to consist of nn i.i.d. samples xixi of a distribution that generates your data. Then, you build a predictive model from the given data: given a sample xixi, you predict the class f^(xi)f^(xi), whereas the real class of the sample is f(xi)f(xi).

However, in theory, you could decide not to choose one particular model f^chosenf^chosen, but rather consider all possible models f^f^ at once and combine them somehow into one big model F^F^.

Of course, given the data, many of the smaller modells could be quite improbable or inappropriate (for example, models that predict only one value of the target, even though there are multiple values of the target in your dataset DD).

In any case, you want to predict the target value of new samples, which are drawn from the same distribution as xixis. A good measure ee of the performance of your model would be

e(model)=P[f(X)=model(X)],e(model)=P[f(X)=model(X)],

i.e., the probability that you predict the true target value for a randomly sampled XX.

Using Bayes formula, you can compute, what is the probability that a new sample xx has target value vv, given the data DD:

P(v∣D)=∑f^P(v∣f^)P(f^∣D).

Ques).Given a new instance x, what is its most probable classification?

h(MAP){x} is not the most probable classification.

Example: let P(h1|D)=0.4, P(h2|D)=0.3, P(h3 |D)=0.3

given new data x, we have h1(x) =+, h2(x) =-, h3(x) =-

what is the most probable classification of x?

using the Bayes optimal classification,

P(h1|D) = 0.4 , P(-|h1) =0, P(+|h1)= 1

P(h2|D) = 0.3, P(-|h2) =1, P(+|h2)=0

P(h3|D) = 0.3, P(-|h3) =1, P(+|h3)=0

Gibbs Sampling:

Gibbs sampling is a Markov Chain Monte Carlo (MCMC) algorithm where each random variable is iteratively resampled from its conditional distribution given the remaining variables. It’s a simple and often highly effective approach for performing posterior inference in probabilistic models.

Gibbs sampling is commonly used for statistical inference (e.g. determining the best value of a parameter, such as determining the number of people likely to shop at a particular store on a given day, the candidate a voter will most likely vote for, etc.).

Thank you:

so with this we come to an end hope the above content proved to be fruitfull for you. Dont forget to appreciate by shooting the clap button.