Introduction to Statistics. Day 6–7 (100daysofdscode)

Source: Deep Learning on Medium


Kindly see the curriculum and topic for each weekhttps://gist.github.com/Emekaborisama/c172599855a757b624c783c9a64ebd0c

Hi, welcome to week 2 and our topic for this week is Statistic and probability (Maths for Data Science).

In our previous class, we talked about 9 Python Libraries for Data Science and we did an exercise course at Datacamp .

Today we will learn Statistics

At the end of today’s session, we will be able to

  1. Understand the basic of Statistics
  2. Roles of Statistics
  3. The basic concept of Statistics
  4. Source and method of Data Collection
  5. Summarizing and normalizing Data.

Introduction to Statistics

As time goes on the activities of man and those of various organization that often refer as firm exponentially increases.
This brings to the need of man and firm to make a quality decision on these activities. The need for quality and quantity of information required to make the decision also increases.
The management of any firm requires a scientific method to provide a meaningful value from the data collected. Statistics plays an important role as a management tool for making decisions.
Statistics have a lot of definition from different perceptions but in all, I prefer to go with this definition — Stats is defined as the scientific method which is used for collecting, analyzing, classifying and presenting information in such a way that we can have a comprehensible understanding of the reality the information represents.

Roles of Statistics

1. Household– Statistics is useful in our everyday life, starting from going to the market, purchasing a food, taking a decision on what kind of food to buy and the quality and quantity of the item to maximize satisfaction. For this decision, you need to make use of statistics
2. Government — Government uses statistics as a tool for collecting data on economic aggregates such as gross national product, savings, consumption, and national income.
The government uses stats for the census, various form sent by the govt to individual and firm for a tax return, costs, wage rate, and annual income generate a lot of stats data used by govt.
3. Business — Business uses statistics for decision making on production, marketing, admin and labor force.
Management also uses statistics to establish a relationship between two or more variable for the purpose of predicting a variable in terms of others. Other sectors that uses stats includes
a. Healthcare
b. Information technology
c. Agriculture
d. Education e.t.c

Basic concept of stats

1. Entity– This may be person, place or things on which we make observations. For instance, in studying the nutritional well being of pupils in a secondary school, the entity is the pupil in the school.
2. Variable– In programming variable is used to store value but in statistics, variable are characteristics of an entity for instance weights of pupils in the secondary school is a variable.
3. Quantitative variable– A variable that it value are numerical, for instance, hourly patronage of a restaurant.
4. Qualitative variable– This is unmeasurable variable, they can’t be counted. for example the taste of biscuit or how many grains of rice in a bag.
5. Discrete variable — This variable are whole numbers. an example includes a number of female student at Harvard University, numbers of female staff in IBM. In discrete variable, interruption are decimal numbers (11.1,12.113.1 e.t.c)

Data nature, source and method

I am well sure you know that statistics uses numerical data. This numerical data are divided into two-

1. Primary data and

2. Secondary data

Primary data are data collected by the person who wants to make use of it. It’s a data collected originally for the purpose which is comprehensible for which they are obtained.

Examples of primary data
1. Height and weight of students collected to determine their nutritional value
2. The measurement of a client width and length to prevent overfitting in making his cloth.

Various ways of collecting primary data include — survey, questionnaire, interview, observation, and experiment.

Secondary data

This is the data collected for some other purpose, frequently for admin or talent management reasons, and used for purpose undefined or which they are not collected for.
Secondary data are derived from a textbook, journal, open dataset library, newspaper, API, and of course google.
Examples of secondary data are
1. The crime rate in a particular area over a period of time, collected from the police.
2. Twitter tweet or replies collected from twitter.
3. Your kaggle dataset.
Method to collect data
1. Questionnaire
2. Survey
3. Interview
4. Observation
5. Exprientment 
6. Open dataset library

Summarization/ normalization of Data

When data are collected, they appear to be in raw format which can be rough for analysis
The data must be arranged in order either ascending or descending in magnitude.
E.g Suppose you have 6 pupils and their age are [2,8,3,4,0,8,] 
From the look of this, it has no arrangement . an ordered form of value shows the following
[0,2,3,4,8,8]

Why we need to arrange our data
1. We can see whether any value has a duplicate
2. We can identify the zero value
3. We can easily divide the data into sections
4. Fasten the analysis process

Frequency distribution
An ordered array doesn’t efficiently summarise a data but a frequency distribution table will do that.
e.g age of pupils in Grade 2 — [0,5,3,2,8,9,7,6]
in an orderly form
[0,2,3,5,6,7,8,9]

┌────┬────┬──┐
│ FD │ X1 │ │
├────┼────┼──┤
│ 0 │ 0 │ │
│ 1 │ 2 │ │
│ 2 │ 3 │ │
│ 3 │ 5 │ │
│ 4 │ 6 │ │
│ 5 │ 7 │ │
│ 6 │ 8 │ │
│ 7 │ 9 │ │
└────┴────┴──┘

Grouping Data
It is possible to have a frequency in our previous example without grouping data because that value is not much, but in the case where the value exceeds to thousand and millions then we will need the group the data into classes,
Grouping data are easily done using class interval.
Class interval is grouping close related value in classes.
A class interval of the age of pupils in grade 10 could be 
10–13, 14–17, 18–21, 22–24
e.g 
students in maths class in grade 1
[10,23,13,12,11,19,17,15,14,23,20,21,22,24]
Arrange them in order.
[10,11,12,13,14,15,17,19,20,21,22,23,23,24]

┌────────┬─────────┬───────┐
│ Class │ Tallies │ SCORE │
│ 10–13 │ IIII │ 4 │
│ 14–17 │ III │ 3 │
│ 18–21 │ III ` │ 3 │
│ 22–25 │ IIII │ 4 │
└────────┴─────────┴───────┘

Graphics representation of data

Graphics representation make your data sensible to stakeholders,
They are various ways of presenting your data, which are 
1. PIE CHART
2. BAR CHART
3. HISTOGRAM
4. SCATTER CHART
5. LINE GRAPHS

In this session, we are only going to look into Pie chart
PIE CHART– This represents your data in a sectorial angle based on their value.
Sectorial angle = value of item all over the total value of all the item *360
e.g suppose We have a budget allocation.

┌────────────┬───────┐
│ Item │ N │
│ Feeding │ N9625 │
│ Rent │ N4125 │
│ Education │ N5300 │
│ Savings │ N6875 │
│ Other │ N1375 │
└────────────┴───────┘

You are required to represent the data on Pie chart. We will compute the sectorial angle.

┌────────────┬─────┐
│ Item │ N │
│ Feeding │ 126 │
│ Rent │ 54 │
│ Education │ 72 │
│ Savings │ 40 │
│ Other │ 18 │
│ 360 │ │
└────────────┴─────┘

So far we have seen how much each item contribute to the development of a project
In conclusion, we have been able to understand 
1. Statistics
2. Roles of Stats
3. The basic concept of stats
4. Source and method of Data Collection
5. Summarizing and normalizing Data

When you done with this course kindly share your progress on your social media account

Hints

Day 6–7: Introduction to Statistics

Day 6–7 Lesson:At the end of today’s session I have a comprehensive understanding of Statistics. Best Data Science journey ever

#100daysofcode #100daysofDscode #100days #Day6–7 #DataScience #MachineLearning #Ai

THANKS, AND SEE YOU IN DAY 8.

Kindly anticipate a Datacamp course on this topic.