Professional Documents
Culture Documents
Statistics ClassNotes_1
Statistics ClassNotes_1
Statistics ClassNotes_1
Statistics is a fundamental field that plays a crucial role in various aspects of our lives. It
involves the collection, analysis, interpretation, and presentation of data using
mathematical and statistical methods. Statistics is widely used in fields such as business,
psychology, agriculture, banking, biology, data science, and many others.
One of the key aspects of statistics is its ability to communicate stories and insights through
numbers and graphical representations. For example, using statistics, we can summarize and
communicate information about the number of COVID-19 cases in different countries, the
effectiveness of vaccines, or the impact of wearing helmets or seatbelts on reducing fatalities
in accidents.
Statistics helps us understand and address uncertainty and variation. It provides tools and
techniques, such as probability concepts, to deal with uncertain events and sources of
variation. For example, statistical methods such as regression analysis, hypothesis testing,
and descriptive statistics allow us to make predictions, test hypotheses, and describe patterns
in data.
Statistics is not limited to a specific field but is involved in nearly all fields of study. It is
essential for making informed decisions, evaluating evidence, and drawing conclusions based
on data. As H.G. Wells stated in 1903, "statistical thinking would one day be as necessary for
good citizenship as the ability to read and write." Statistics continues to be a critical
component of modern society, helping us better understand the world and make informed
decisions in a data-driven era.
For example, if you wanted to understand the average height of a population of people, you
might collect a sample of data and use inferential statistics to estimate the population mean
height based on the sample mean and sample size.
In summary, descriptive statistics is used to describe and summarize data, while inferential
statistics is used to make inferences or predictions about a larger population based on a
sample of data.
Sample vs population
Sampling is a common practice in statistical analysis as it is often impractical or costly to
study an entire population. A sample is a smaller subset of a population that is carefully
selected to represent the characteristics of the larger population. By studying a well-designed
sample, researchers can make inferences about the larger population. However, it's important
to ensure that the sample is representative and unbiased to avoid misleading results. The size
and composition of the sample are critical factors that affect the accuracy of the findings. A
larger sample generally provides more reliable results, while a smaller sample may have
higher variability. Understanding the relationship between sample and population is crucial in
designing and interpreting research studies, as the generalizability of findings to the broader
population depends on the quality of the sample.
Figure 1: The current figure explains an important concept of inferential statistics. We recruit samples from a
population using some sampling techniques to avoid bias. We apply an already proposed hypothesis on the
recruited sample. Based on applied statistical analysis we infer about the population.
Example: In a COVID19 study, it was mentioned that there were 82,583 cases found in
Telangana state during the second wave. When 1000 cases were randomly selected there
were 632 males and 362 females.
Population: All 82,583 cases in Telangana states during the second COVID wave.
Sample: The 1000 randomly selected cases.
Parameter vs Statistics
In statistics, parameters and statistics are both numerical measures that describe
characteristics of a population or sample, respectively.
1- Parameter: A parameter is a numerical measure that describes a characteristic of an
entire population. It is a fixed value that typically cannot be calculated exactly, but is
estimated using data from a sample. For example, the population mean (average)
income of all adults in a country is a parameter, which represents the true but
unknown value for the entire population.
A good researcher focuses on hypotheses first. Based on the hypothesis he/she collects data.
It means that data is secondary and hypothesis is primary. The data we gather is raw data. It
means before doing any analysis, we must have to perform some data cleaning and
manipulation steps.
Data
Data refers to a collection of facts, information, or observations that are represented in
various formats, such as numbers, text, images, audio, or video. Data can be either qualitative
or quantitative, and it serves as the foundation for information and knowledge in various
fields, including science, business, technology, and social sciences.
Data can be collected through various methods, such as surveys, experiments, observations,
or through existing sources like databases or records. Data can be raw or processed, and it can
be organized, analyzed, and interpreted to extract meaningful insights, patterns, or trends.
Data is used to make informed decisions, support research findings, develop theories, create
models, and drive decision-making in various fields of study and practice.
Examples of data include sales figures, temperature readings, customer feedback, survey
responses, medical records, social media posts, DNA sequences, satellite images, and much
more. Proper management, analysis, and interpretation of data are essential for extracting
valuable information and knowledge that can inform decision-making and drive meaningful
outcomes.
Variable
Data are often discussed in terms of variables. Variable is defined as “Any characteristic
that varies from one member of a population to another”. For example, the height of an
animal, the weight of a cell phone, gender information, etc. Understanding data is imperative
for statistics and data science. Without proper knowledge of data types, it is impossible to
apply proper statistical methods, data science approaches, and even good data visualization
techniques. For example, regression techniques require continuous variables. Classification
models require qualitative data. We use qualitative data to create bar plots and quantitative
data to create histograms. These terms might be new but necessary in the field of statistics
and data science. A random variable is a variable that takes on different values with certain
probabilities in a statistical experiment.
Types of Variables
There are two basic types of variables in statistics, qualitative and quantitative. Qualitative
variables are also known as categorical variables. Quantitative variables are also known as
numerical variables.
In statistical analysis, nominal variables are typically used for descriptive purposes, such as
calculating frequencies and percentages, and for making comparisons between categories.
They can also be used in inferential statistics, such as chi-squared tests, to determine if there
are significant differences between categories. However, it's important to note that nominal
variables do not have any inherent numerical meaning or hierarchy, and any numerical values
assigned to them are purely for coding or labelling purposes, rather than representing actual
quantitative measurements.
Ordinal Variable
An ordinal variable is a type of categorical variable that represents data with categories that
have a specific order or ranking. It conveys relative differences or preferences among
categories but does not have equal intervals between them. We are still not concerned with
magnitude of the numbers. Examples of ordinal variables include:
1- Education level (e.g., high school diploma, bachelor's degree, master's degree, etc.)
2- Likert scale responses (e.g., strongly agree, agree, neutral, disagree, strongly
disagree)
3- Socioeconomic status (e.g., low income, middle income, high income)
4- Performance ratings (e.g., poor, fair, good, excellent)
5- Survey responses with ordered options (e.g., low, medium, high)
Quantitative Data
Quantitative data, on the other hand, is numerical data that can be measured and analysed
using mathematical and statistical methods. It is often objective and can be represented in
graphs or tables. Examples of quantitative data include measurements such as height, weight,
or blood pressure, and counts such as the number of people in a survey. Quantitative data is
further divided into two sub-types:
Discrete variables
Discrete variables are a type of quantitative variable that can only take on specific, distinct
values, often represented by integers or whole numbers. They are characterized by gaps or
jumps between values, and they do not have any intermediate values between the distinct
values. Order and magnitude both are important. Examples of discrete variables include:
1- Number of children in a family (e.g., 0, 1, 2, 3, etc.)
2- Number of cars in a parking lot (e.g., 0, 1, 2, 3, etc.)
3- Number of customer complaints received in a day (e.g., 0, 1, 2, 3, etc.)
4- Number of items sold in a store (e.g., 0, 1, 2, 3, etc.)
5- Number of students absent from school (e.g., 0, 1, 2, 3, etc.)
Discrete variables are used in various fields, such as statistics, economics, business, and
social sciences, and they can be analysed using techniques such as frequency distributions,
probability distributions, and statistical tests.
Continuous variables
A continuous variable is a type of quantitative variable that can take on any value within a
certain range, often represented by real numbers. They are characterized by having an infinite
number of possible values and can have values at any point along a continuous scale, without
any gaps or jumps. Order and magnitude both are important. Examples of continuous
variables include:
1- Height (e.g., 5.5 feet, 6.2 feet, 5.9 feet, etc.)
2- Weight (e.g., 150 lbs, 175 lbs, 130 lbs, etc.)
3- Temperature (e.g., 98.6°F, 72.3°F, 100.2°F, etc.)
4- Time (e.g., 2.5 hours, 1.75 hours, 0.25 hours, etc.)
5- Blood pressure (e.g., 120/80 mmHg, 140/90 mmHg, 110/70 mmHg, etc.)
Continuous variables are used in various fields, such as physics, engineering, medicine, and
social sciences, and they can be analysed using techniques such as probability density
functions, regression analysis, and hypothesis testing. They are often represented on a
continuous scale, such as a line graph or scatter plot, to visualize their values and patterns.
Qualitative and quantitative data are used for different purposes and are analysed using
different statistical methods. Qualitative data is often used to generate ideas or hypotheses,
while quantitative data is used to test those hypotheses or make predictions. Qualitative data
is analysed using methods such as thematic analysis, while quantitative data is analysed using
methods such as regression analysis, t-tests, and ANOVA.
Sampling techniques
Sampling
Sampling is the process of selecting a sample from a population. There are various sampling
methods, such as random sampling, stratified sampling, and convenience sampling, among
others, that are used to select samples in studies.
Example: Let's consider a research study that aims to investigate the average income of
employees in a particular company. The population would be all the employees working in
that company. However, it may be impractical or time-consuming to collect income data
from all employees. Therefore, the researcher may randomly select a sample of 200
employees from the company as a representative subset. The income data collected from this
sample would be used to make inferences about the average income of the entire employee
population in the company.
Sampling bias
Sampling bias refers to a error or distortion that occurs in the process of selecting a sample
from a population, resulting in a sample that is not representative of the entire population.
This can lead to inaccurate or misleading conclusions when generalizing findings from the
sample to the population. Sampling bias can occur due to various reasons, such as non-
random sampling, self-selection bias, volunteer bias, and measurement bias. It can result in
over-representing or under-representing certain groups or characteristics in the sample,
leading to biased estimates and incorrect inferences. Proper sampling techniques, such as
random sampling, stratified sampling, or cluster sampling, can help minimize sampling bias
and ensure a more accurate representation of the population.
Sample recruitment for a study is a real challenge. Multiple methods can be used to recruit
samples in the study. Sampling techniques in real life depend on your aim, time, funds to
spend and sample availability. Domain expertise also helps us to find a better approach.
Sampling techniques are basically of two types: probability sampling and non-probability
sampling.
Source: https://www.simplypsychology.org/simple-random-sampling.html
● Systematic sampling
Samples are recruited from a population according to a random starting point, but
with a fixed periodic interval. For example, every 10th student based on admission
will be chosen from OdinSchool.
Source: https://www.dreamstime.com/systematic-sampling-method-statistics-research-sample-collecting-data-scientific-survey-techniques-systematic-