Professional Documents
Culture Documents
Data by Tarun
Data by Tarun
n Statistics, the data are the individual pieces of factual information recorded, and it is used
for the purpose of the analysis process. The two processes of data analysis are
interpretation and presentation. Statistics are the result of data analysis. Data classification
and data handling are an important process as it involves a multitude of tags and labels to
define the data, its integrity and confidentiality. In this article, we are going to discuss the
different types of data in statistics in detail.
Data sampling is a statistical analysis technique used to select, manipulate and analyse a
representative subset of data points to identify patterns and trends in the larger data set being
examined.
It is the practice of selecting an individual group from a population in order to study the whole
population.
Let’s say we want to know the percentage of people who use iPhones in a city, for example. One way
to do this is to call up everyone in the city and ask them what type of phone they use. The other way
would be to get a smaller subgroup of individuals and ask them the same question, and then use this
information as an approximation of the total population.
However, this process is not as simple as it sounds. Whenever you follow this method, your sample
size has to be ideal - it should not be too large or too small. Then once you have decided on the size
of your sample, you must use the right type of sampling techniques to collect a sample from the
population. Ultimately, every sampling type comes under two broad categories:
Probability sampling - Random selection techniques are used to select the sample.
Non-probability sampling - Non-random selection techniques based on certain criteria are used to
select the sample.
Probability Sampling Techniques
Probability Sampling Techniques is one of the important types of sampling techniques. Probability
sampling allows every member of the population a chance to get selected. It is mainly used in
quantitative research when you want to produce results representative of the whole population.
1. Simple Random Sampling
In simple random sampling, the researcher selects the participants randomly. There are a number
of data analytics tools like random number generators and random number tables used that are
based entirely on chance.
Example: The researcher assigns every member in a company database a number from 1 to 1000
(depending on the size of your company) and then use a random number generator to select 100
members.
2. Systematic Sampling
In systematic sampling, every population is given a number as well like in simple random sampling.
However, instead of randomly generating numbers, the samples are chosen at regular intervals.
Example: The researcher assigns every member in the company database a number. Instead of
randomly generating numbers, a random starting point (say 5) is selected. From that number
onwards, the researcher selects every, say, 10th person on the list (5, 15, 25, and so on) until the
sample is obtained.
3. Stratified Sampling
In stratified sampling, the population is subdivided into subgroups, called strata, based on some
characteristics (age, gender, income, etc.). After forming a subgroup, you can then use random or
systematic sampling to select a sample for each subgroup. This method allows you to draw more
precise conclusions because it ensures that every subgroup is properly represented.
Example: If a company has 500 male employees and 100 female employees, the researcher wants to
ensure that the sample reflects the gender as well. So, the population is divided into two subgroups
based on gender.
4. Cluster Sampling
In cluster sampling, the population is divided into subgroups, but each subgroup has similar
characteristics to the whole sample. Instead of selecting a sample from each subgroup, you
randomly select an entire subgroup. This method is helpful when dealing with large and diverse
populations.
Example: A company has over a hundred offices in ten cities across the world which has roughly the
same number of employees in similar job roles. The researcher randomly selects 2 to 3 offices and
uses them as the sample.
The hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be testable,
either by experiment or observation.
Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This is an assumption that
we are making based on the average wins and losses team had under his captaincy. We can test this
statement based on all the match data.
The null hypothesis is the hypothesis to be tested for possible rejection under the assumption that it
is true. The concept of the null is similar to innocent until proven guilty We assume innocence until
we have enough evidence to prove that a suspect is guilty.
In simple language, we can understand the null hypothesis as already accepted statements, for
example, Sky is blue. We already accept this statement.
It is denoted by H0.
The alternative hypothesis complements the Null hypothesis. It is the opposite of the null
hypothesis such that both Alternate and null hypothesis together cover all the possible values of the
population parameter.
It is denoted by H1.
Let’s understand this with an example:
A soap company claims that its product kills on an average of 99% of the germs. To test the claim of
this company we will formulate the null and alternate hypothesis.
Null Hypothesis(H0): Average =99%
Alternate Hypothesis(H1): Average is not equal to 99%.
Note: When we test a hypothesis, we assume the null hypothesis to be true until there is sufficient
evidence in the sample to prove it false. In that case, we reject the null hypothesis and support the
alternate hypothesis. If the sample fails to provide sufficient evidence for us to reject the null
hypothesis, we cannot say that the null hypothesis is true because it is based on just the sample
data. For saying the null hypothesis is true we will have to study the whole population data.
Simple and Composite Hypothesis Testing
When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and if it
specifies a range of values then it is called a composite hypothesis.
e.g., Motor cycle company claiming that a certain model gives an average mileage of 100Km per
litre, this is a case of simple hypothesis.
The average age of students in a class is greater than 20. This statement is a composite hypothesis.
So, Type I and type II error is one of the most important topics of hypothesis testing.
A false positive (type I error) — when you reject a true null hypothesis.
A false negative (type II error) — when you accept a false null hypothesis.
The probability of committing Type I error (False positive) is equal to the significance level or size of
critical region α.
α= P [rejecting H0 when H0 is true]
The probability of committing Type II error (False negative) is equal to the beta β. It is called the
‘power of the test’.
β = P [not rejecting H0 when h1 is true]
Example:
The person is arrested on the charge of being guilty of burglary. A jury of judges has to decide guilty
or not guilty.
H0: Person is innocent
H1: Person is guilty
Type I error will be if the Jury convicts the person [rejects H0] although the person was innocent [H0
is true].
Type II error will be the case when Jury released the person [Do not reject H0] although the person is
guilty [H1 is true].
Level of confidence
As the name suggests a level of confidence: how confident are we in taking out decisions. LOC (Level
of confidence) should be more than 95%. Less than 95% of confidence will not be accepted.
Level of significance(α)
The significance level, in the simplest of terms, is the threshold probability of incorrectly rejecting
the null hypothesis when it is in fact true. This is also known as the type I error rate.
It is the probability of a type 1 error. It is also the size of the critical region.
Generally, strong control of α is desired and in tests, it is prefixed at very low levels like 0.05(5%) or
01(1%).
If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis is true
with 95% assurance.
One-tailed and two-tailed Hypothesis Testing
If the alternate hypothesis gives the alternate in both directions (less than and greater than)
of the value of the parameter specified in the null hypothesis, it is called a Two-tailed test.
If the alternate hypothesis gives the alternate in only one direction (either less than or
greater than) of the value of the parameter specified in the null hypothesis, it is called
a One-tailed test.