STATISTICS FOR DATA SCIENCE

Original article was published by Rakib Ansari on Deep Learning on Medium


STATISTICS FOR DATA SCIENCE

ALL IMPORTANT CONCEPTS OF STATISTICS IN DATA SCIENCE

1. VARIABLE : It a place holder which stores values.

2. Random variable : It is a random collection of variables.

It is of two types :

A. Numerical variable : A numerical is one that may take on any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose, …)

Numerical variable is further divided into two parts :

A.1. Continuous(floating number) : A continuous variable is one which have decimal values. For example : 5.6, 7.8, 0.001, 846.245

A.2. Discrete(whole number) : Discrete numbers are the basic counting numbers. For example : 0, 1, 2, 3, 4, 5, 6

B. Categorical Variable : A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values (e.g. race, sex, age group)

Categorical Variable is further divided into two parts :

B.1. Nominal : A nominal variable does not have orders.

B.2. Ordinal : An ordinal variable is a categorical variable for which the possible values are ordered (e.g. education level (“high school”, ”BS”, ”MS”, ”PhD”))

RANDOM VARIABLE CONCLUSION :

RANDOM VARIABLE

3. MEASURE OF CENTRAL TENDENCIES :

A. MEAN : it is the sum of a collection of numbers divided by the count of numbers in the collection

mean = sum of number of collection / total collection

B. MEDIAN : The “middle” of a sorted list of numbers(When there are two middle numbers we average them).

C. MODE : The mode of a set of data values is the value that appears most often.

NOTE : mean, median, mode helps in handling missing values.

4. RANGE : The Range is the difference between the lowest and highest values. Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9 − 3 = 6.

MEAN
MEDIAN VS MODE VS RANGE

5. POPULATION, SAMPLE, POPULATION MEAN, SAMPLE MEAN :

POPULATION : a population is a set of similar items or events.

SAMPLE : small collection of items from population.

Every dataset that we get to perform ML model is a sample of data.

Population vs sample use case : exit poll on election.

POPULATION MEAN : The population mean is an average of a group characteristic.

SAMPLE MEAN : A sample mean refers to the average of the sample data.

POPULATION VS SAMPLE
POPULATION MEAN VS SAMPLE MEAN

6. VARIANCE :

variance : It is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value.

7. Standard deviation and measure of dispersion:

Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of spread of data about the mean. SD is the square root of sum of squared deviation from the mean divided by the number of observations.

The standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

STANDARD DEVIATION

8. GAUSSIAN/NORMAL DISTRIBUTION :

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.

Gaussian distribution to Standard normal distribution(mean=0 and standard deviation=1) [(x-mean)/standard deviation = (z-score)].

GAUSSIAN / NORMAL DISTRIBUTION

9. STANDARD NORMAL DISTRIBUTION :

The standard normal distribution is a normal distribution with a mean of zero and standard deviation of 1.

Empirical formula :
68.2% lies in 1st standard deviation
95.4% lies in 1st standard deviation
99.7% lies in 1st standard deviation

STANDARD NORMAL DISTRIBUTION

10. Z-SCORE :

The value of the z-score tells you how many standard deviations you are away from the mean. If a z-score is equal to 0, it is on the mean. A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean.

Z-SCORE

10. PROBABILITY DENSITY FUNCTION :

A probability density function, or density of a continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

PROBABILITY DENSITY FUNCTION

11. CUMULATIVE DISTRIBUTION FUNCTION :

The cumulative distribution function (CDF) of a real-valued random variable , is the probability that will take a value less than or equal to.

CUMULATIVE DISTRIBUTION FUNCTION

12. HYPOTHESIS TESTING :

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.

HYPOTHESIS TESTING

13. KERNEL DENSITY ESTIMATION(KDE) :

KERNEL DENSITY ESTIMATION(KDE) is a non-parametric way to estimate the probability density function of a random variable.

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

KERNAL DENSITY ESTIMATOR

14. CENTRAL LIMIT THEOREM :

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.

The central limit theorem tells us that no matter what the distribution of the population is, the shape of the sampling distribution will approach normality as the sample size (N) increases.

CENTRAL LIMIT THEOREM

15. SKEWNESS :

Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution.

SKEWNESS

16. COVARIANCE :

covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive.

covariance only tells magnitude.

COVARIANCE FORMULA
POSITIVE, NEGATIVE AND ZERO COVARIANCE

17. PEARSON CORRELATION COVARIANCE :

Pearson’s correlation coefficient (r) is a measure of the strength of the association between the two variables.

Pearson Correlation Coefficient helps in feature selection.
Pearson Correlation Coefficient lies b/w -1 to 1.

Pearson Correlation Coefficient tells magnitude and direction.

PEARSON CORRELATION COVARIANCE
FORMULA OF PEARSON CORRELATION COEFFICIENT

18. SPEARMAN RANK CORRELATION :

It assesses how well the relationship between two variables can be described using a monotonic function(function between ordered sets that preserves or reverses the given order.).

Spearman’s rank correlation coefficient tells magnitude and direction even for non linear data and outliers.

FORMULA OF SPEARMAN RANK CORRELATION

FORMULA OF SPEARMAN RANK CORRELATION

SAME RESULT WHEN THERE IS NO OUTLIER

SAME RESULT WHEN THERE IS NO OUTLIER

SPEARMAN GIVE BETTER RESULT IN OUTLIER

SPEARMAN GIVE BETTER RESULT IN OUTLIER

POSITIVE SPEARMAN CORRELATION

POSITIVE SPEARMAN CORRELATION

NEGATIVE SPEARMAN CORRELATION

NEGATIVE SPEARMAN CORRELATION

19. Q-Q PLOT :

Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.

A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.

20. CHEBYSHEV’S INEQUALITY :

Chebyshev’s inequality guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean.

Specifically, no more than 1/k2 of the distribution’s values can be more than k standard deviations away from the mean (or equivalently, at least 1 − 1/k2 of the distribution’s values are within k standard deviations of the mean)

CHEBYSHEV’S INEQUALITY FORMULA

21. BINOMIAL DISTRIBUTION :

A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail.

Binomial distributions must also meet the following three criteria:

A. The number of observations or trials is fixed.

B. Each observation or trial is independent.

C. The probability of success is exactly the same from one trial to another.

Real Life Examples :

If a new drug is introduced to cure a disease, it either cures the disease (it’s successful) or it doesn’t cure the disease (it’s a failure). If you purchase a lottery ticket, you’re either going to win money, or you aren’t. Basically, anything you can think of that can only be a success or a failure can be represented by a binomial distribution.

BINOMIAL DISTRIBUTION FORMULA
n stands for the number of times the experiment runs and p represents the probability of one specific outcome.

22. BERNOULLI DISTRIBUUTION :

A Bernoulli distribution is a discrete probability distribution for a Bernoulli trial — a random experiment that has only two outcomes (usually called a “Success” or a “Failure”). For example, the probability of getting a heads (a “success”) while flipping a coin is 0.5. The probability of “failure” is 1 — P (1 minus the probability of success, which also equals 0.5 for a coin toss). It is a special case of the binomial distribution for n = 1. In other words, it is a binomial distribution with a single trial (e.g. a single coin toss).

23. LOG-NORMAL DISTRIBUTION :

A log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

LOG-NORMAL DISTRIBUTION

Lognormal is extremely useful when analyzing stock prices. As long as the growth factor used is assumed to be normally distributed.

The log-normal distribution curve can therefore be used to help better identify the compound return that the stock can expect to achieve over a period of time. Note that log-normal distributions are positively skewed with long right tails due to low mean values and high variances in the random variables.

24. POWER LAW :

The power law (also called the scaling law) states that a relative change in one quantity results in a proportional relative change in another. The simplest example of the law in action is a square; if you double the length of a side (say, from 2 to 4 inches) then the area will quadruple (from 4 to 16 inches squared).

POWER LAW

25. BOX-COX TRANSFORM :

A Box Cox transformation is a transformation of a non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

26. POISSON DISTRIBUTION :

The Poisson distribution is the discrete probability distribution of the number of events occurring in a given time period, given the average number of times the event occurs over that time period.

EXAMPLE : A certain fast-food restaurant gets an average of 3 visitors to the drive-through per minute. This is just an average, however. The actual amount can vary.

POISSON DISTRIBUTION

27. NON-GAUSSIAN DISTRIBUTION :

Although the normal distribution takes center stage in statistics, many processes follow a non normal distribution. This can be due to the data naturally following a specific type of non normal distribution (for example, bacteria growth naturally follows an exponential distribution). In other cases, your data collection methods or other methodologies may be at fault.

Types of Non Normal Distribution

  1. Beta Distribution.
  2. Exponential Distribution.
  3. Gamma Distribution.
  4. Inverse Gamma Distribution.
  5. Log Normal Distribution.
  6. Logistic Distribution.
  7. Maxwell-Boltzmann Distribution.
  8. Poisson Distribution.
  9. Skewed Distribution.
  10. Symmetric Distribution.
  11. Uniform Distribution.
  12. Unimodal Distribution.
  13. Weibull Distribution.

Reasons for the Non Normal Distribution :

  1. Outliers
  2. Multiple distributions may be combined in your data.
  3. Insufficient Data.
  4. Data may be inappropriately graphed.

Dealing with Non Normal Distributions

You have several options for handling your non normal data. Many tests, including the one sample Z test, T test and ANOVA assume normality. You may still be able to run these tests if your sample size is large enough (usually over 20 items). You can also choose to transform the data with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you may want to run a non parametric test. A non parametric test is one that doesn’t assume the data fits a specific distribution type. Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test.

REFERENCES :

  1. GOOGLE SEARCH
  2. GOOGLE IMAGE
  3. WIKIPEDIA
  4. SOME STATISTICS SITE