Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Basic Concepts of Statistics

Introduction

Abel CHEROUAT, Université de Technologie de Troyes


abel.cherouat@utt.fr
2
3
Statistics are used in our everyday lives in various ways.
Here are some examples of where we use mean,
median, and mode in everyday situations:

1. Shopping: When we go shopping, we often look at the prices of different products and compare them. The mean price
of the products can give us an idea of the average price range of the products we want to buy.

2.Grades: Students often receive grades on their assignments, tests, and exams. The average of these grades can be
calculated to find the mean, median, or mode of the scores. This can give the students an idea of how well they did
compared to others in their class.

3.Sports: In sports, we often use statistics to analyze the performance of individual players or teams. Mean, median, and
mode can be used to determine the average points per game, rebounds per game, or assists per game.

4.Health: In the field of medicine, statistics play a crucial role in analyzing health-related data. For example, the average
weight, height, and body mass index (BMI) of a population can be determined using mean, median, and mode.

5.Finance: Statistics are used extensively in finance, including analyzing the performance of stocks, bonds, and other
financial instruments. Mean, median, and mode can be used to determine the average returns or losses of a particular
investment.

6.Public Opinion: In surveys and polls, statistics are used to analyze public opinion on various issues. Mean, median, and
mode can be used to determine the average responses or the most common response to a particular question.
4
7. Quality Control: Order statistics are used to analyze and evaluate the quality of products in manufacturing
processes. For example, the minimum and maximum values of a set of measurements can be used to determine
whether a particular product meets certain quality standards.

8. Environmental Science: Order statistics are used to analyze extreme events, such as floods, droughts, and heat
waves, in environmental science. For example, the maximum and minimum values of temperature or rainfall in a
particular region can be used to evaluate the risk of such events occurring in the future.
Statistical methods are used to analyze data related to environmental issues, such as climate change, air pollution, and
water quality.

9. Epidemiology: Order statistics are used in epidemiology to study the distribution of diseases in a population. For
example, the k-th highest incidence rate of a disease can be used to estimate the risk of getting the disease.

5
Statistics are used in our everyday lives in various ways.
Here are some examples of where we use mean,
median, and mode in everyday situations:

10. Health: Here we apply the principles of statistics, which says that the average is a more true/accurate reflection of a
variable. One simple example is while measuring blood pressure we often take 3 readings taken at regular intervals and
then calculate it's average, to eliminate the possibility of erratic reading which often happens while measuring by our
home electronic bp machines, when accurate measurement of bp is crucial for a person.

11. Risk aversion: Would you rather win $30 guaranteed or 50% chance of winning $100?
In statistics, the expected value is 50% * $100 + 50% * $0 = $50.
Someone who is more risk averse would likely take $30 versus someone is more risk-seeking.
This could extend to non-monetary situations such as human relationships
The expected value of a choice = probability of that choice * the utility of that choice + probability of not making that
choice * the utility of not making that choice.

12. Manufacturing : Suppose a car manufacturer wants to know what percentage of its vehicles experience a particular
mechanical problem within the first year of ownership.
The manufacturer might collect data on a random sample of cars sold in the past year, record whether or not each car
experienced the problem, and then use statistical methods to analyze the data and estimate the overall percentage of
cars with the problem.
Once the manufacturer has this estimate, it can use it to make decisions about how to address the problem. For example,
if the estimate is high, the manufacturer might issue a recall or redesign the affected component.
Alternatively, if the estimate is low, the manufacturer might choose to focus its resources on other areas of improvement.
6
As for future applications of mathematical statistics, there are many possibilities.
With the increasing availability of data, there will be a growing need for statistical
methods to analyze and make sense of this data.
1.Predictive Analytics: Statistical methods will be used to predict future outcomes based on historical
data, such as predicting the likelihood of a disease outbreak or the success of a marketing campaign.

2.Machine Learning: Statistical methods will be used to develop algorithms and models for machine
learning applications, such as speech recognition, image recognition, and natural language
processing.

3.Personalized Medicine: Statistical methods will be used to analyze large datasets of patient
information to develop personalized treatment plans and predict treatment outcomes.

4.Smart Cities: Statistical methods will be used to analyze data from sensors and other sources in
smart cities to improve urban planning, traffic management, and energy efficiency.

5.Quantum Computing: Statistical methods will play an important role in developing algorithms for
quantum computing, which has the potential to revolutionize many fields, including cryptography,
drug discovery, and materials science.
There are several common statistical terms that are often
used in everyday life
•Average: This refers to the typical or central value in a set of data. It can be calculated using various
measures, such as the mean, median, or mode.

•Probability: This refers to the likelihood or chance of an event occurring. It is often expressed as a
percentage or a fraction.

•Standard deviation: This refers to the amount of variation or spread in a set of data. A smaller standard
deviation indicates that the data is clustered closely around the mean, while a larger standard deviation
indicates that the data is more spread out.

•Confidence interval: This refers to a range of values that is likely to contain the true value of a population
parameter, such as the mean or proportion. It is often used to estimate the precision of a sample statistic.

•Regression: This refers to a statistical method that is used to model the relationship between two or more
variables. It is often used to predict the value of one variable based on the value of another variable.
TEACHING METHODS
This course enables learners to acquire knowledge of basic statistical ideas, methods and
terminology.

Study of the content of the course enables learners to represent and use statistical data in
graphical, diagrammatic and tabular forms, interpret statistical statements, calculations and
diagrams, perform statistical calculations accurately and acquire knowledge of elementary
ideas in probability,

Python is the most popular programming language for data science. This course introduces
Python within the context of the closely related areas of statistics and data science.

ASSESSMENT AND GRADING


• Homework : 25%
• Project (Python): 30%
• Final exam: 45%
OUTLINE
 Introduction to statistics.
 Descriptive statistics.
 Exercises.
 Statistics and Excel.
 Representing Data.
 Data science and Python.
 Exercises.
Statistics can be used to explain many things like DNA
testing, factors associated with diseases (like cancer or
heart disease) or the lottery.

Statistics are present everywhere in our day-to-day life,


from presidential election polls, from weather prediction
probabilities to data science and machine learning.

Statistics is the branch of mathematics that deals with the


collection, organization, analysis, interpretation, and
representation of data.

Machine Learning (or AI) which is the most sought-after


tech in the present time, and is basically the analysis of
statistics to help computers make decisions based on
repeatable characteristics found in the data.
11
When is statistics actually used in real life?

Example 1: Weather Forecasting


Statistics is used heavily in the field of weather forecasting.
In particular, probability is used by weather forecasters to assess how likely it is
that there will be rain, snow, clouds, etc. on a given day in a certain area.
Forecasters will regularly say things like “there is a 90% chance of rain today
between after 5PM” to indicate that there’s a high likelihood of rain during
certain hours.

Example 2: Sales Tracking


Retail companies often use descriptive statistics like the mean,
median, mode, standard deviation, and interquartile range to track
the sales behavior of certain products.

This gives companies an idea of how many products they can expect
to sell during different time periods and allows them to know how
much they should keep in inventory.
Example 3: Health Insurance
Health insurance companies often use statistics and probability to
determine how likely it is that certain individuals will spend a certain
amount on healthcare each year.
Example, an actuary at a health insurance company might use factors like
age, existing medical conditions, current health status, etc. to determine
that there’s a 80% probability that a certain individual will spend $10,000 or
more on healthcare in a given year.

Example 4: Traffic
Traffic engineers regularly use statistics to monitor total
traffic in different areas of a city, which allows them to
decide whether or not they should add or remove
roads to optimize traffic flow.

Also, traffic engineers often use time series analysis to


monitor how traffic changes throughout the day so
they can optimize the behavior of traffic lights.
Example 5: Investing
Investors use statistics and probability to assess how likely it is that a
certain investment will pay off.
Example, a given investor might determine that there is a 5%
chance that the stock of company A will increase 100x during the
upcoming year.
Based on this probability, they’ll decide how much of their portfolio
to invest in the stock.

Example 6: Medical Studies


Statistics is regularly used in medical studies to understand how
different factors are related.
Example, medical professions often use correlation to analyze how
factors like weight, height, smoking habits, exercise habits, and
diet are related.
If a certain diet and overall weight is found to be negatively
correlated, a medical professional may recommend the diet to
an individual who needs to lose weight. 14
Example 7: Manufacturing
Statistics is often used in manufacturing to monitor the efficiency of
different processes.
Example, manufacturing engineers may collect a random sample of
widgets from a certain assembly line and track how many of the
widgets are defective.
They may then perform a one proportion z-test to determine if the
proportion of widgets that are defective is lower than a certain
value that is considered acceptable.

Example 8: Urban Planning


Statistics is regularly used by urban planners to decide how
many apartments, shops, stores, etc. should be built in a
certain area based on population growth patterns.
Example, if an urban planner sees that population growth in a
certain part of the city is increasing at an exponential rate
compared to other parts of the city then they may decide to
prioritize building new apartment complexes in that part of
town compared to another area.
Statistical Methods in Quality Improvement
Pareto analysis
Pareto analysis identifies the most important quality-related problems to resolve in a process.

Pareto chart
A Pareto chart shows the frequency of occurrences of quality-related problems to highlight those that need
the most attention.

What is the Pareto Principle?


• The Pareto Principle states that 80% of the results are determined by 20% of the causes.
• Therefore, you should try to find the 20% of defect types that cause 80% of all defects.

16
Example :
Pareto charts are a common tool used by manufacturers to analyze quality and defect data, providing a
simple visual representation as to the frequency of certain issues and the cumulative percentage of their
occurrence.
Type of Frequency
% of Total Cumulative %
Defect of Defect
Button
23 23/59 = 39.0 39.0
Defect
Pocket
16 27.1 39.0+27.1 = 66.1
Defect
Collar
10 16.9 83.1
Defect
Cuff
7 11.9 95.0
Defect
Sleeve
3 5.0 100
Defect
Total 59 - -
Statistics Analyze :
While the 80/20 rule does not apply perfectly to the example above, focusing on just 2 types of defects
(Button and Pocket) has the potential to remove the majority of all defects (66%).
What is Statistics?
Why?
Data
1. Collecting Data Analysis
e.g., Sample, Survey,
Observe,
Simulate
2. Characterizing Data
e.g., Organize/Classify,
Count, Summarize
3. Presenting Data Decision-
e.g., Tables, Charts, Making
Statements
4. Interpreting Results
e.g. Infer, Conclude, Specify
Confidence
© 1984-1994 T/Maker Co.
Statistics is the science of data : collecting, classifying, summarizing, organizing, analyzing,
19 and
interpreting numerical information
Why Study Statistics?
1. Numerical information is everywhere.

2. Statistical techniques are used to make decisions that affect our daily lives.

3. The knowledge of statistical methods will help you understand how decisions are made and
give you a better understanding of how they affect you.

4. No matter what line of work you select, you will find yourself faced with decisions where an
understanding of data analysis is helpful.

Statistics is the science of conducting studies to collect, organize, summarize, analyze, present,
interpret and draw conclusions from data. 20
Steps of Statistical Investigation

Begin with a research question, then proceed with these steps:

1. Produce Data: Determine what to measure, then collect data.

2. Explore the Data: Analyze and summarize the data (also called exploratory data analysis).

3. Draw a Conclusion: Use the data, probability, and statistical inference to draw a conclusion
about the population.

21
Steps of Statistical Investigation
Definition:
Science of collection, presentation, analysis, and reasonable interpretation of data.

 Statistics presents a rigorous scientific method for gaining insight into data.

 Example : suppose we measure the weight of 100 patients in a study. With so many
measurements, simply looking at the data fails to provide an informative account.

 However statistics can give an instant overall picture of data based on graphical presentation
or numerical summarization irrespective to the number of data points.

 Besides data summarization, another important task of statistics is to make inference (steps in
reasoning, moving from premises to logical consequences; etymologically, the word infer means to carry
forward) and predict relations of variables.
Steps of Statistical Investigation
Definition:
Facts or figures, which are numerical or otherwise, collected with a definite purpose are called
data.

 Everyday we come across a lot of information in the form of facts, numerical figures, tables,
graphs, etc.

 These are provided by newspapers, televisions, magazines and other means of


communication.

 These may relate to profits of a company, temperatures of cities, expenditures in various sectors
of a five year plan, polling results, and so on.

 These facts or figures, which are numerical or otherwise, collected with a definite purpose are
called data. 23
Population/
Type of
Cause and Type of Research Question Examples
Effect Study

Make an estimate about the


population (often an estimate What proportion of all college
Observatio
Population about an average value or students are enrolled at a
nal Study
a proportion with a given community college?
characteristic)

Test a claim about the population


Do the majority of community
(often a claim about an average Observatio
Population college students qualify for federal
value or a proportion with a given nal Study
student loans?
characteristic)

Compare two populations (often a


Are college athletes more likely than
comparison of population averages Observatio
Population nonathletes to receive academic
or proportions with a given nal Study
advising?
characteristic)
Is academic counseling associated
Investigate a relationship between Observatio
Population with quicker completion of a college
two variables in the population nal Study
degree?
Cause and Does drinking red wine lower the risk
Test cause and effect Experiment
Effect of a heart attack?
Different Models of Statistics
Statistics being a broad term used in various forms, different models of statistics are used in
different forms

Skewness refers to a measure of the asymmetry in a probability distribution where it measures


the deviation of the normal distribution curve for data.

ANOVA Statistics - used to evaluate the difference between the means of more than two groups.

Regression Analysis - determines the relationship between the variables.


Analysts use the ANOVA test to determine the impact of independent variables on the dependent variable
EXERCISE

Datasets and Data Tables


OBS AGE BMI FFNUM TEMP( 0F) GENDER LEVEL QUESTION

1 26 23.2 0 61.0 0 1 1

2 30 30.2 9 65.5 1 3 2

3 32 28.9 17 59.6 1 3 4

4 37 22.4 1 68.4 1 2 3

5 33 25.5 7 64.5 0 3 5

Dataset:
6 29 22.3 1 70.2 0 2 2

7 32 23.0 0 67.3 0 1 1

Data for a set of variables collection in group of 8 33 26.3 1 72.8 0 3 1

persons.
9 32 22.2 3 71.5 0 1 4

10 33 29.1 5 63.2 1 1 4

11 26 20.8 2 69.1 0 1 3

Data Table: 12 34 20.9 4 73.6 0 2 3

13 31 36.3 1 66.3 0 2 5

A dataset organized into a table, with one 14 31 36.4 0 66.9 1 1 5

column for each variable and one row for each 15 27 28.6 2 70.2 1 2 2

person. 16 36 27.5 2 68.5 1 3 3

17 35 25.6 143 67.8 1 3 4

18 31 21.2 11 70.7 1 1 2

19 36 22.7 8 69.8 0 2 1

Typical Data Table


20 33 28.1 3 67.8 0 2 1
Basic Concepts

Data :
Set of values of one or more variables recorded on one or more observational units

Sources of data :
1. Routinely kept records
2. Surveys (census)
3. Experiments
4. External source

Categories of data :
1. Primary data: observation, questionnaire, record form, interviews, survey,
2. Secondary data: web, census, medical record, registry
Primary Data Vs Secondary Data

Primary Data
 Primary data is the data that is collected for the first time through personal experiences or evidence,
particularly for research.

 It is also described as raw data or first-hand information.

 The mode of assembling the information is costly.

 The data is mostly collected through observations, physical testing, mailed questionnaires, surveys,
personal interviews, telephonic interviews, case studies, and focus groups, etc.
Primary Data Vs Secondary Data

Secondary Data
 Secondary data is a second-hand data that is already collected and recorded by some researchers for
their purpose, and not for the current research problem.

 It is accessible in the form of data collected from different sources such as government publications,
censuses, internal records of the organisation, books, journal articles, websites and reports, etc.

 This method of gathering data is affordable, readily available, and saves cost and time.

 However, the one disadvantage is that the information assembled is for some other purpose and may
not meet the present research purpose or may not be accurate.
29
Discrete Vs continuous data
 Discrete data (countable) is information that can only take certain values. These values don’t
have to be whole numbers but they are fixed values – such as shoe size, number of teeth,
number of kids, etc.

 Discrete data includes discrete variables that are finite, numeric, countable, and non-negative
integers (5, 10, 15, and so on).

 Continuous data (measurable) is data that can take any value. Height, weight, temperature
and length are all examples of continuous data.

 Continuous data changes over time and can have different values at different time intervals
like weight of a person.
30
Definitions for Variables

• AGE: Age in years


• BMI: Body mass index, weight/height2 in kg/m2
• FFNUM: The average number of times eating “fast food” in a week
• TEMP: High temperature for the day
• GENDER: 1- Female 0- Male
• EXERCISE LEVEL: 1- Low 2- Medium 3- High
• QUESTION: what is your satisfaction rating for this Biostatistics session ?
1- Very Satisfied 2- Somewhat Satisfied 3- Neutral 4- Somewhat dissatisfied 5- Dissatisfied
Terminology
Ratio
 Categorical Variables
 Quantity Variables Interval
 Nominal Variables
 Ordinal Variables Ordinal
 Binary Data
 Discrete and Continuous Data
 Interval and Ratio Variables Nominal
 Qualitative and Quantitative Traits/ Characteristics of Data
34
35
36
Types of variables and data
Data

Qualitative Quantitative

Numerical Nonnumerical Numerical

Nominal Ordinal Nominal Ordinal Interval Ratio


DATA : Variable - any characteristic of an individual or entity.
A variable can take different values for different individuals. Variables can be categorical or quantitative.

• Nominal - Categorical variables with no inherent order or ranking sequence such as names or classes
(e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II, III). The only operation that
can be applied to Nominal variables is enumeration.

• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be compared for
equality, or greater or less, but not how much greater or less.

• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences between values
are meaningful, however, the scale is not absolutely anchored. Calendar dates and temperatures on the
Fahrenheit scale are examples. Addition and subtraction, but not multiplication and division are meaningful
operations.

• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero point, e.g. age, weight,
temperature (Kelvin). Addition, subtraction, multiplication, and division are all meaningful operations.
38
Qualitative analysis contrasts
with quantitative analysis, which focuses on
numbers found in reports such as balance
sheets.
Qualitative data

 The objects being studied are grouped into categories based on some qualitative trait.
 The resulting data are merely labels or categories: Categorical Data

Categorical data

Nominal Ordinal
data data
Examples :
Eye color
Blue, brown, black, green, etc.
Smoking status
Smoker, non-smoker
Attitudes towards the death penalty
Strongly disagree, disagree, neutral, agree, strongly agree.
Nominal data

 A type of categorical data in which objects fall into unordered categories.

 Studies measuring nominal data must ensure that each category is mutually exclusive and the
system of measurement needs to be exhaustive.

 Variables that have only two responses i.e. Yes or no, are known as dichotomies.
Examples of Nominal Data

 Type of car
BMW, Mercedes, Lexus, Toyota, Renault, Ford, etc.

 Ethnicity
White British, afro-caribbean, Asian, Arab, Chinese, other, etc.

 Smoking status
Smoker, non-smoker
Binary Data

 A type of categorical data in which there are only two categories.

Examples:
 Smoking status- smoker, non-smoker
 Attendance- present, absent
 Result of a exam- pass, fail
 Status of student- undergraduate, postgraduate
Ordinal data

• Ordinal data is data that comprises of categories that can be rank ordered.

• Similarly with nominal data the distance between each category cannot be calculated but the
categories can be ranked above or below each other.
Examples of Ordinal Data

 Grades in exam- A+, A, B+ B, C+, C ,D , D+, and fail.

 Degree of illness- none, mild, moderate, acute, chronic.

 Opinion of students about stats classes-


Very unhappy, unhappy, neutral, happy, ecstatic!
Nominal data (Binary) & Ordinal data

Examples What is your


gender? (please tick)

Male What is the level of satisfaction with


the new curriculum at a medical
Female school received? (please tick)

Very satisfied
Did you enjoy the
Somewhat satisfied
teaching session ?
Neutral (please tick)
Somewhat dissatisfied
Very dissatisfied Yes
No
Quantitative Data

 The objects being studied are ‘measured’ based on some quantitative trait.
 The resulting data are set of numbers.
Examples
 Pulse Rate
 Height
 Age
 Exam marks
 Time to complete a statistics test
 Number of cigarettes smoked
Quantitative
data

Discrete Continuous

Discrete Data
Only certain values are possible (there are gaps between the possible values). Implies counting.

Continuous Data

Theoretically, with a fine enough measuring device. Implies measuring.


Discrete data -- Gaps between possible values

Continuous data -- Theoretically,


no gaps between possible values
Examples of Discrete Data

 Number of children in a family


 Number of students passing a stats exam
 Number of crimes reported to the police
 Number of bicycles sold in a day.

Generally, discrete data are counts.


We would not expect to find 2.2 children in a family or 88.5 students passing an exam or 127.2 crimes being
reported to the police or half a bicycle being sold in one day.
Example of Continuous Data
 Age ( in years)
 Height( in cms.)
 Weight (in Kgs.)
 Sys.BP, Hb., Etc.,

Generally, continuous data come from


measurements.
Quant vs. Qual

52
Quant vs. Qual
Statistical Description of Data

 Statistics describes a numeric set of data by its


 Center
 Variability
 Shape
 Statistics describes a categorical set of data by
 Frequency, percentage or proportion of each category

54
Summary Measures in Descriptive Statistics
Summary Measures

Central Tendency Variation


Quartile

Mean Mode

Range Coefficient of
Median Variation
Midrange Variance

Standard Deviation

You might also like