Statistics ClassNotes_1

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Introduction

Statistics is a fundamental field that plays a crucial role in various aspects of our lives. It
involves the collection, analysis, interpretation, and presentation of data using
mathematical and statistical methods. Statistics is widely used in fields such as business,
psychology, agriculture, banking, biology, data science, and many others.
One of the key aspects of statistics is its ability to communicate stories and insights through
numbers and graphical representations. For example, using statistics, we can summarize and
communicate information about the number of COVID-19 cases in different countries, the
effectiveness of vaccines, or the impact of wearing helmets or seatbelts on reducing fatalities
in accidents.
Statistics helps us understand and address uncertainty and variation. It provides tools and
techniques, such as probability concepts, to deal with uncertain events and sources of
variation. For example, statistical methods such as regression analysis, hypothesis testing,
and descriptive statistics allow us to make predictions, test hypotheses, and describe patterns
in data.
Statistics is not limited to a specific field but is involved in nearly all fields of study. It is
essential for making informed decisions, evaluating evidence, and drawing conclusions based
on data. As H.G. Wells stated in 1903, "statistical thinking would one day be as necessary for
good citizenship as the ability to read and write." Statistics continues to be a critical
component of modern society, helping us better understand the world and make informed
decisions in a data-driven era.

Descriptive and Inferential Statistics


Descriptive and inferential statistics are two branches of statistical analysis.
Descriptive Statistics
Descriptive statistics involves the analysis and summary of data using measures such as
mean, median, mode, and standard deviation. Descriptive statistics methods are used to
describe the characteristics of samples and population. It is used to describe and summarize
the features of a dataset, such as the centre and variability of the data, or the distribution of
the data. Descriptive statistics does not involve making inferences or generalizations about a
larger population based on the data, but rather focuses on the data itself.
Let's say you want to describe the test scores of a group of students who took a math exam.
You have the following scores for a sample of 10 students: 75, 80, 85, 90, 92, 93, 95, 98, 99,
100. How are you going to describe them? Numerical measures such as, measures of central
tendency, measures of dispersion, frequency distribution and data visualization. Data
visualizations are often considered as part of descriptive statistics. Data visualizations are
graphical representations of data that provide a visual summary of the main features or
patterns in a dataset. They can be used to explore, analyse, and communicate data in a clear
and effective manner.
Inferential Statistics
In contrast, inferential statistics involves using statistical methods to draw conclusions or
make predictions about a larger population based on a sample of data. Inferential statistics
involves testing hypotheses, estimating parameters, and making predictions based on
probability theory. Inferential statistics is used to make generalizations or predictions about a
larger population based on a sample of data. Inferential statistics utilizes descriptive statistics
properties to test the hypothesis and draw inferences. Predictive methods and pattern finding
approaches in data science are actually derived from statistics. Even inferential statistics
provide far better insight when used properly in exploratory data analysis. Data science
utilizes and expands the statistical concepts to deal with big data.

For example, if you wanted to understand the average height of a population of people, you
might collect a sample of data and use inferential statistics to estimate the population mean
height based on the sample mean and sample size.

In summary, descriptive statistics is used to describe and summarize data, while inferential
statistics is used to make inferences or predictions about a larger population based on a
sample of data.

Sample vs population
Sampling is a common practice in statistical analysis as it is often impractical or costly to
study an entire population. A sample is a smaller subset of a population that is carefully
selected to represent the characteristics of the larger population. By studying a well-designed
sample, researchers can make inferences about the larger population. However, it's important
to ensure that the sample is representative and unbiased to avoid misleading results. The size
and composition of the sample are critical factors that affect the accuracy of the findings. A
larger sample generally provides more reliable results, while a smaller sample may have
higher variability. Understanding the relationship between sample and population is crucial in
designing and interpreting research studies, as the generalizability of findings to the broader
population depends on the quality of the sample.

Figure 1: The current figure explains an important concept of inferential statistics. We recruit samples from a
population using some sampling techniques to avoid bias. We apply an already proposed hypothesis on the
recruited sample. Based on applied statistical analysis we infer about the population.

Before moving to sampling techniques, it is important to understand the difference between


sample and population.
Population: A population refers to the entire group or set of individuals, objects, or events
that a data scientist is interested in studying. It includes all the members of a particular group
or the entire target population. For example, if a researcher is interested in studying the
average height of all adults in a country, the population would be all adults living in that
country.
Sample: A sample is a smaller subset of a population that is selected for analysis. It is a
representative portion of the population that is carefully chosen to draw conclusions about the
entire population. For example, if a researcher randomly selects 500 adults from the
population of a country to measure their height, these 500 adults would form the sample.

Example: In a COVID19 study, it was mentioned that there were 82,583 cases found in
Telangana state during the second wave. When 1000 cases were randomly selected there
were 632 males and 362 females.

Population: All 82,583 cases in Telangana states during the second COVID wave.
Sample: The 1000 randomly selected cases.

Parameter vs Statistics
In statistics, parameters and statistics are both numerical measures that describe
characteristics of a population or sample, respectively.
1- Parameter: A parameter is a numerical measure that describes a characteristic of an
entire population. It is a fixed value that typically cannot be calculated exactly, but is
estimated using data from a sample. For example, the population mean (average)
income of all adults in a country is a parameter, which represents the true but
unknown value for the entire population.

2- Statistic: A statistic is a numerical measure that describes a characteristic of a sample,


which is a subset of a population. It is calculated from sample data and provides an
estimate or approximation of the population parameter. For example, the sample mean
income of 500 randomly selected adults from a country is a statistic, which provides
an estimate of the population mean income based on the data collected from the
sample.
Example: Consider a study that aims to estimate the average height of all students in a school
district. If the researcher measures the heights of a random sample of 100 students and
calculates the sample mean height, that would be a statistic. On the other hand, if the
researcher is interested in estimating the true average height of all students in the school
district, which is unknown, that would be a parameter. The sample mean height would be
used as an estimate of the population mean height or parameter.

Data Types in statistics


Statistics and data science are incomplete without data. Data are a collection of observations,
such as height, bank account numbers, gender information and blood sugar level. These
collected data are used to analyse a proposed hypothesis. A single data value is called a
datum, a term rarely used. The term “data” is plural, so it is correct to say “data are…” not
“data is…”). Every study or experiments generate a set of data. The type of data influences
the statistical methods that can be used to analyse it.
Let’s have an example here. Shiva and Sai are roommates. Both are using cell phones. Shiva
has an iPhone, but Sai is using Samsung, and he has two phones. Both watch movies only on
their cell phones. For the same movie, the iPhone consumes 2.35 GB of data. However, the
Samsung phone consumes 2.78 GB of data. Is this not a real-life example? The whole
example talks about data only. Now it's your turn to figure out different types of data in the
above example. We will discuss this example again at the end of the chapter, hoping that you
will be able to answer these on your own at the end of this chapter.

A good researcher focuses on hypotheses first. Based on the hypothesis he/she collects data.
It means that data is secondary and hypothesis is primary. The data we gather is raw data. It
means before doing any analysis, we must have to perform some data cleaning and
manipulation steps.

Data
Data refers to a collection of facts, information, or observations that are represented in
various formats, such as numbers, text, images, audio, or video. Data can be either qualitative
or quantitative, and it serves as the foundation for information and knowledge in various
fields, including science, business, technology, and social sciences.
Data can be collected through various methods, such as surveys, experiments, observations,
or through existing sources like databases or records. Data can be raw or processed, and it can
be organized, analyzed, and interpreted to extract meaningful insights, patterns, or trends.
Data is used to make informed decisions, support research findings, develop theories, create
models, and drive decision-making in various fields of study and practice.
Examples of data include sales figures, temperature readings, customer feedback, survey
responses, medical records, social media posts, DNA sequences, satellite images, and much
more. Proper management, analysis, and interpretation of data are essential for extracting
valuable information and knowledge that can inform decision-making and drive meaningful
outcomes.
Variable
Data are often discussed in terms of variables. Variable is defined as “Any characteristic
that varies from one member of a population to another”. For example, the height of an
animal, the weight of a cell phone, gender information, etc. Understanding data is imperative
for statistics and data science. Without proper knowledge of data types, it is impossible to
apply proper statistical methods, data science approaches, and even good data visualization
techniques. For example, regression techniques require continuous variables. Classification
models require qualitative data. We use qualitative data to create bar plots and quantitative
data to create histograms. These terms might be new but necessary in the field of statistics
and data science. A random variable is a variable that takes on different values with certain
probabilities in a statistical experiment.
Types of Variables
There are two basic types of variables in statistics, qualitative and quantitative. Qualitative
variables are also known as categorical variables. Quantitative variables are also known as
numerical variables.

Qualitative Data/Categorical Data


Qualitative data is non-numerical data that describes qualities or characteristics. It is often
subjective and relies on human interpretation. Examples of qualitative data include gender,
educational qualification, class grade, religion, name, zip code, etc. Qualitative data is further
divided into two sub-types:
Nominal Variables
A nominal variable is a type of qualitative or categorical variable in statistics. It is a variable
that represents categories or groups that are mutually exclusive and have no inherent
numerical order or ranking. In other words, nominal variables do not have any inherent
numerical or quantitative meaning, and the categories or groups are simply used to classify
data. We cannot say that males are greater than females or vice versa. If they are coded as 0
and 1, we cannot calculate magnitude or perform calculations such as mean, median etc.
Examples of nominal variables include:

● Gender (e.g., male, female, non-binary)

● Ethnicity (e.g., Asian, Black, Hispanic, White)

● Marital status (e.g., married, single, divorced, widowed)

● Type of vehicle (e.g., car, truck, motorcycle)

● Political affiliation (e.g., BJP, Congress, BSR, Independent)

In statistical analysis, nominal variables are typically used for descriptive purposes, such as
calculating frequencies and percentages, and for making comparisons between categories.
They can also be used in inferential statistics, such as chi-squared tests, to determine if there
are significant differences between categories. However, it's important to note that nominal
variables do not have any inherent numerical meaning or hierarchy, and any numerical values
assigned to them are purely for coding or labelling purposes, rather than representing actual
quantitative measurements.
Ordinal Variable
An ordinal variable is a type of categorical variable that represents data with categories that
have a specific order or ranking. It conveys relative differences or preferences among
categories but does not have equal intervals between them. We are still not concerned with
magnitude of the numbers. Examples of ordinal variables include:
1- Education level (e.g., high school diploma, bachelor's degree, master's degree, etc.)
2- Likert scale responses (e.g., strongly agree, agree, neutral, disagree, strongly
disagree)
3- Socioeconomic status (e.g., low income, middle income, high income)
4- Performance ratings (e.g., poor, fair, good, excellent)
5- Survey responses with ordered options (e.g., low, medium, high)

Quantitative Data
Quantitative data, on the other hand, is numerical data that can be measured and analysed
using mathematical and statistical methods. It is often objective and can be represented in
graphs or tables. Examples of quantitative data include measurements such as height, weight,
or blood pressure, and counts such as the number of people in a survey. Quantitative data is
further divided into two sub-types:
Discrete variables
Discrete variables are a type of quantitative variable that can only take on specific, distinct
values, often represented by integers or whole numbers. They are characterized by gaps or
jumps between values, and they do not have any intermediate values between the distinct
values. Order and magnitude both are important. Examples of discrete variables include:
1- Number of children in a family (e.g., 0, 1, 2, 3, etc.)
2- Number of cars in a parking lot (e.g., 0, 1, 2, 3, etc.)
3- Number of customer complaints received in a day (e.g., 0, 1, 2, 3, etc.)
4- Number of items sold in a store (e.g., 0, 1, 2, 3, etc.)
5- Number of students absent from school (e.g., 0, 1, 2, 3, etc.)
Discrete variables are used in various fields, such as statistics, economics, business, and
social sciences, and they can be analysed using techniques such as frequency distributions,
probability distributions, and statistical tests.

Continuous variables
A continuous variable is a type of quantitative variable that can take on any value within a
certain range, often represented by real numbers. They are characterized by having an infinite
number of possible values and can have values at any point along a continuous scale, without
any gaps or jumps. Order and magnitude both are important. Examples of continuous
variables include:
1- Height (e.g., 5.5 feet, 6.2 feet, 5.9 feet, etc.)
2- Weight (e.g., 150 lbs, 175 lbs, 130 lbs, etc.)
3- Temperature (e.g., 98.6°F, 72.3°F, 100.2°F, etc.)
4- Time (e.g., 2.5 hours, 1.75 hours, 0.25 hours, etc.)
5- Blood pressure (e.g., 120/80 mmHg, 140/90 mmHg, 110/70 mmHg, etc.)
Continuous variables are used in various fields, such as physics, engineering, medicine, and
social sciences, and they can be analysed using techniques such as probability density
functions, regression analysis, and hypothesis testing. They are often represented on a
continuous scale, such as a line graph or scatter plot, to visualize their values and patterns.
Qualitative and quantitative data are used for different purposes and are analysed using
different statistical methods. Qualitative data is often used to generate ideas or hypotheses,
while quantitative data is used to test those hypotheses or make predictions. Qualitative data
is analysed using methods such as thematic analysis, while quantitative data is analysed using
methods such as regression analysis, t-tests, and ANOVA.

Sampling techniques

Sampling
Sampling is the process of selecting a sample from a population. There are various sampling
methods, such as random sampling, stratified sampling, and convenience sampling, among
others, that are used to select samples in studies.
Example: Let's consider a research study that aims to investigate the average income of
employees in a particular company. The population would be all the employees working in
that company. However, it may be impractical or time-consuming to collect income data
from all employees. Therefore, the researcher may randomly select a sample of 200
employees from the company as a representative subset. The income data collected from this
sample would be used to make inferences about the average income of the entire employee
population in the company.
Sampling bias
Sampling bias refers to a error or distortion that occurs in the process of selecting a sample
from a population, resulting in a sample that is not representative of the entire population.
This can lead to inaccurate or misleading conclusions when generalizing findings from the
sample to the population. Sampling bias can occur due to various reasons, such as non-
random sampling, self-selection bias, volunteer bias, and measurement bias. It can result in
over-representing or under-representing certain groups or characteristics in the sample,
leading to biased estimates and incorrect inferences. Proper sampling techniques, such as
random sampling, stratified sampling, or cluster sampling, can help minimize sampling bias
and ensure a more accurate representation of the population.
Sample recruitment for a study is a real challenge. Multiple methods can be used to recruit
samples in the study. Sampling techniques in real life depend on your aim, time, funds to
spend and sample availability. Domain expertise also helps us to find a better approach.
Sampling techniques are basically of two types: probability sampling and non-probability
sampling.

Probability Sampling Techniques


Probability sampling gives everyone a chance to be a part of a study and we can determine
the probability of people being sampled.
● Everyone has a chance to be part of a study

● Random selection of people


● We can generalize the result on population. We can inference about population based
on result
There are many types of probability sampling techniques. Here are some very commonly
used techniques:

● Simple random sampling


It is the simplest sampling technique and gives everyone an equal chance to be part of
the study based on random selection. For example, 50 students will be chosen
randomly from 785 students from OdinSchool.

Source: https://www.simplypsychology.org/simple-random-sampling.html

● Systematic sampling
Samples are recruited from a population according to a random starting point, but
with a fixed periodic interval. For example, every 10th student based on admission
will be chosen from OdinSchool.

Source: https://www.dreamstime.com/systematic-sampling-method-statistics-research-sample-collecting-data-scientific-survey-techniques-systematic-

Non-probability Sampling Techniques


Not everyone has a chance to be a part of the study. Hence, results cannot be generalized and
specific to the selected sample. Here are a few examples of non-probability sampling
techniques, but not required to go into depth as results cannot be inferred about population.

You might also like