Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Q- What is Data? Define Data?

n Statistics, the data are the individual pieces of factual information recorded, and it is used
for the purpose of the analysis process. The two processes of data analysis are
interpretation and presentation. Statistics are the result of data analysis. Data classification
and data handling are an important process as it involves a multitude of tags and labels to
define the data, its integrity and confidentiality. In this article, we are going to discuss the
different types of data in statistics in detail.

Types of Classification of Data in Statistics

Qualitative or Categorical Data


Qualitative data, also known as the categorical data, describes the data that fits into the
categories. Qualitative data are not numerical. The categorical information involves
categorical variables that describe the features such as a person’s gender, home town etc.
Categorical measures are defined in terms of natural language specifications, but not in
terms of numbers.
Sometimes categorical data can hold numerical values (quantitative value), but those values
do not have mathematical sense. Examples of the categorical data are birthdate, favourite
sport, school postcode. Here, the birthdate and school postcode hold the quantitative value,
but it does not give numerical meaning.
Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables
without providing the numerical value. Nominal data is also called the nominal scale. It
cannot be ordered and measured. But sometimes, the data can be qualitative and
quantitative. Examples of nominal data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are
grouped into categories, and then the frequency or the percentage of the data can be
calculated. These data are visually represented using the pie charts.
Ordinal Data
Ordinal data/variable is a type of data which follows a natural order. The significant feature
of the nominal data is that the difference between the data values is not determined. This
variable is mostly found in surveys, finance, economics, questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated
and interpreted through many visualisation tools. The information may be expressed using
tables in which each row in the table shows the distinct category.
Quantitative or Numerical Data
Quantitative data is also known as numerical data which represents the numerical value
(i.e., how much, how often, how many). Numerical data gives information about the
quantities of a specific thing. Some examples of numerical data are height, length, size,
weight, and so on. The quantitative data can be classified into two different types based on
the data sets. The two different classifications of numerical data are discrete data and
continuous data.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite
number of possible values. Those values cannot be subdivided meaningfully. Here, things
can be counted in the whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values
that can be selected within a given specific range.
Example: Temperature range

Q- What do you mean by Data Sampling?

Data sampling is a statistical analysis technique used to select, manipulate and analyse a
representative subset of data points to identify patterns and trends in the larger data set being
examined.
It is the practice of selecting an individual group from a population in order to study the whole
population.
Let’s say we want to know the percentage of people who use iPhones in a city, for example. One way
to do this is to call up everyone in the city and ask them what type of phone they use. The other way
would be to get a smaller subgroup of individuals and ask them the same question, and then use this
information as an approximation of the total population.
However, this process is not as simple as it sounds. Whenever you follow this method, your sample
size has to be ideal - it should not be too large or too small. Then once you have decided on the size
of your sample, you must use the right type of sampling techniques to collect a sample from the
population. Ultimately, every sampling type comes under two broad categories:
Probability sampling - Random selection techniques are used to select the sample.
Non-probability sampling - Non-random selection techniques based on certain criteria are used to
select the sample.
Probability Sampling Techniques

Probability Sampling Techniques is one of the important types of sampling techniques. Probability
sampling allows every member of the population a chance to get selected. It is mainly used in
quantitative research when you want to produce results representative of the whole population.
1. Simple Random Sampling
In simple random sampling, the researcher selects the participants randomly. There are a number
of data analytics tools like random number generators and random number tables used that are
based entirely on chance.
Example: The researcher assigns every member in a company database a number from 1 to 1000
(depending on the size of your company) and then use a random number generator to select 100
members.
2. Systematic Sampling
In systematic sampling, every population is given a number as well like in simple random sampling.
However, instead of randomly generating numbers, the samples are chosen at regular intervals.
Example: The researcher assigns every member in the company database a number. Instead of
randomly generating numbers, a random starting point (say 5) is selected. From that number
onwards, the researcher selects every, say, 10th person on the list (5, 15, 25, and so on) until the
sample is obtained.
3. Stratified Sampling
In stratified sampling, the population is subdivided into subgroups, called strata, based on some
characteristics (age, gender, income, etc.). After forming a subgroup, you can then use random or
systematic sampling to select a sample for each subgroup. This method allows you to draw more
precise conclusions because it ensures that every subgroup is properly represented.
Example: If a company has 500 male employees and 100 female employees, the researcher wants to
ensure that the sample reflects the gender as well. So, the population is divided into two subgroups
based on gender. 
4. Cluster Sampling
In cluster sampling, the population is divided into subgroups, but each subgroup has similar
characteristics to the whole sample. Instead of selecting a sample from each subgroup, you
randomly select an entire subgroup. This method is helpful when dealing with large and diverse
populations.
Example: A company has over a hundred offices in ten cities across the world which has roughly the
same number of employees in similar job roles. The researcher randomly selects 2 to 3 offices and
uses them as the sample.

Non-Probability Sampling Techniques


Non-Probability Sampling Techniques is one of the important types of Sampling techniques. In non-
probability sampling, not every individual has a chance of being included in the sample. This
sampling method is easier and cheaper but also has high risks of sampling bias. It is often used in
exploratory and qualitative research with the aim to develop an initial understanding of the
population.
1. Convenience Sampling
In this sampling method, the researcher simply selects the individuals which are most easily
accessible to them. This is an easy way to gather data, but there is no way to tell if the sample is
representative of the entire population. The only criteria involved is that people are available and
willing to participate.
Example: The researcher stands outside a company and asks the employees coming in to answer
questions or complete a survey.
2. Voluntary Response Sampling
Voluntary response sampling is similar to convenience sampling, in the sense that the only criterion
is people are willing to participate. However, instead of the researcher choosing the participants, the
participants volunteer themselves. 
Example: The researcher sends out a survey to every employee in a company and gives them the
option to take part in it.
3. Purposive Sampling
In purposive sampling, the researcher uses their expertise and judgment to select a sample that they
think is the best fit. It is often used when the population is very small and the researcher only wants
to gain knowledge about a specific phenomenon rather than make statistical inferences.
Example: The researcher wants to know about the experiences of disabled employees at a company.
So, the sample is purposefully selected from this population.
4. Snowball Sampling
In snowball sampling, the research participants recruit other participants for the study. It is used
when participants required for the research are hard to find. It is called snowball sampling because
like a snowball, it picks up more participants along the way and gets larger and larger.
Example: The researcher wants to know about the experiences of homeless people in a city. Since
there is no detailed list of homeless people, a probability sample is not possible. The only way to get
the sample is to get in touch with one homeless person who will then put you in touch with other
homeless people in a particular area.

Q- What do you mean by Hypothesis Testing?

The hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be testable,
either by experiment or observation.
Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This is an assumption that
we are making based on the average wins and losses team had under his captaincy. We can test this
statement based on all the match data.

Null and Alternative Hypothesis Testing

The null hypothesis is the hypothesis to be tested for possible rejection under the assumption that it
is true. The concept of the null is similar to innocent until proven guilty We assume innocence until
we have enough evidence to prove that a suspect is guilty.
In simple language, we can understand the null hypothesis as already accepted statements, for
example, Sky is blue. We already accept this statement.
It is denoted by H0.
The alternative hypothesis complements the Null hypothesis. It is the opposite of the null
hypothesis such that both Alternate and null hypothesis together cover all the possible values of the
population parameter.
It is denoted by H1.
Let’s understand this with an example:
A soap company claims that its product kills on an average of 99% of the germs. To test the claim of
this company we will formulate the null and alternate hypothesis.
Null Hypothesis(H0): Average =99%
Alternate Hypothesis(H1): Average is not equal to 99%.
Note: When we test a hypothesis, we assume the null hypothesis to be true until there is sufficient
evidence in the sample to prove it false. In that case, we reject the null hypothesis and support the
alternate hypothesis. If the sample fails to provide sufficient evidence for us to reject the null
hypothesis, we cannot say that the null hypothesis is true because it is based on just the sample
data. For saying the null hypothesis is true we will have to study the whole population data.
Simple and Composite Hypothesis Testing
When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and if it
specifies a range of values then it is called a composite hypothesis.
e.g., Motor cycle company claiming that a certain model gives an average mileage of 100Km per
litre, this is a case of simple hypothesis.
The average age of students in a class is greater than 20. This statement is a composite hypothesis.

Type I and Type II Error

So, Type I and type II error is one of the most important topics of hypothesis testing.

A false positive (type I error) — when you reject a true null hypothesis.
A false negative (type II error) — when you accept a false null hypothesis.
The probability of committing Type I error (False positive) is equal to the significance level or size of
critical region α.
α= P [rejecting H0 when H0 is true]
The probability of committing Type II error (False negative) is equal to the beta β. It is called the
‘power of the test’.
β = P [not rejecting H0 when h1 is true]
Example:
The person is arrested on the charge of being guilty of burglary. A jury of judges has to decide guilty
or not guilty.
H0: Person is innocent
H1: Person is guilty
Type I error will be if the Jury convicts the person [rejects H0] although the person was innocent [H0
is true].
Type II error will be the case when Jury released the person [Do not reject H0] although the person is
guilty [H1 is true].

Level of confidence
As the name suggests a level of confidence: how confident are we in taking out decisions. LOC (Level
of confidence) should be more than 95%. Less than 95% of confidence will not be accepted.
Level of significance(α)
The significance level, in the simplest of terms, is the threshold probability of incorrectly rejecting
the null hypothesis when it is in fact true. This is also known as the type I error rate.
It is the probability of a type 1 error. It is also the size of the critical region.
Generally, strong control of α is desired and in tests, it is prefixed at very low levels like 0.05(5%) or
01(1%).
If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis is true
with 95% assurance.
One-tailed and two-tailed Hypothesis Testing

If the alternate hypothesis gives the alternate in both directions (less than and greater than)
of the value of the parameter specified in the null hypothesis, it is called a Two-tailed test.
If the alternate hypothesis gives the alternate in only one direction (either less than or
greater than) of the value of the parameter specified in the null hypothesis, it is called
a One-tailed test.

e.g., if H0: mean= 100 H1: mean not equal to 100


here according to H1, mean can be greater than or less than 100. This is an example of a
Two-tailed test
Similarly, if H0: mean>=100 then H1: mean< 100
Here, the mean is less than 100. It is called a One-tailed test.
Critical Region
The critical region is that region in the sample space in which if the calculated value lies then
we reject the null hypothesis.

Let’s understand this with an example:


Suppose you are looking to rent an apartment. You listed out all the available apartments
from different real state websites. You have a budget of Rs. 15000/ month. You cannot
spend more than that. The list of apartments you have made has a price ranging from
7000/month to 30,000/month.
You select a random apartment from the list and assume below hypothesis:
H0: You will rent the apartment.
H1: You won’t rent the apartment.
Now, since your budget is 15000, you have to reject all the apartments above that price.
Here all the Prices greater than 15000 become your critical region. If the random
apartment’s price lies in this region, you have to reject your null hypothesis and if the
random apartment’s price doesn’t lie in this region, you do not reject your null hypothesis.
The critical region lies in one tail or two tails on the probability distribution curve according
to the alternative hypothesis. The critical region is a pre-defined area corresponding to a cut
off value in the probability distribution curve. It is denoted by α.
Critical values are values separating the values that support or reject the null hypothesis
and are calculated on the basis of alpha.
We will see more examples later on and it will be clear how do we choose α.
Based on the alternative hypothesis, three cases of critical region arise:
Case 1) This is a double-tailed test.

Case 2) This scenario is also called a Left-tailed test.


Case 3) This scenario is also called a Right-tailed test.

You might also like