FDSA Unit-2

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

UNIT II DESCRIPTIVE ANALYTICS

Frequency distributions – Outliers –interpreting distributions – graphs – averages -


describing variability – interquartile range – variability for qualitative and ranked data
- Normal distributions – z scores –correlation – scatter plots – regression –
regression line – least squares regression line – standard error of estimate –
interpretation of r2 – multiple regression equations – regression toward the mean.
2.1 Frequency distributions

What is a frequency distribution?

The frequency of a value is the number of times it occurs in a dataset. A frequency distribution is the pattern of frequencies

of a variable. It’s the number of times each possible value of a variable occurs in a dataset.

Types of frequency distributions

1. Ungrouped frequency distributions

2. Grouped frequency distributions

3. Relative frequency distributions

4. Cumulative frequency distributions

2.1.1 Ungrouped frequency distributions

- For Categorial Value (Ordinal or Nominal)


2.1 Frequency distributions

2.1.1 Ungrouped frequency distributions

Example: Making an ungrouped frequency table

A gardener set up a bird feeder in their backyard. To help them decide how much and what type of birdseed to buy, they

decide to record the bird species that visit their feeder. Over the course of one morning, the following birds visit their feeder:
2.1 Frequency distributions

2.1.2 Grouped frequency distributions

Example: Grouped frequency distribution

A sociologist conducted a survey of 20 adults. She wants to report the frequency distribution of the ages of the survey
respondents. The respondents were the following ages in years:

52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37
2.1 Frequency distributions

2.1.3 Relative frequency distributions


2.1 Frequency distributions

2.1.4 Cumulative frequency distributions


2.1 Frequency distributions

2.1.4 How to graph a frequency distribution


2.2 Outliers
In statistics, an outlier is a data point that significantly deviates from the rest of the data.

Example Data: 60,64,66,70,72,72,74,90


2.2 Outliers
Python Programming:

x=[60,64,66,70,72,72,74,90]
import matplotlib.pyplot as plt
plt.boxplot(x)
plt.show()

2.2.1 caused by:

● Variability in measurement: Maybe the instrument used to collect data wasn't calibrated properly.
● Novel data: It could represent a genuine new discovery that doesn't fit the current understanding.
● Error: Sometimes, mistakes happen during data collection or analysis.
2.3 interpreting distributions

Data distribution shows how data points are spread out, like the different shapes in a puzzle.
It helps understand & predict data, make informed decisions, & choose the right analysis methods.
Types of Distributions

1. Discrete Distributions: These describe data that can only take on certain distinct values, like the number of heads in 5 coin
flips (0, 1, 2, 3, 4, or 5). Some common discrete distributions include:

Bernoulli distribution: Describes the probability of success or failure in a single trial (e.g., coin toss).
Binomial distribution: Describes the probability of getting a certain number of successes in a fixed number of trials
(e.g., getting 3 heads in 5 coin flips).
Poisson distribution: Describes the probability of a certain number of events occurring in a fixed interval of time or
space (e.g., the number of customers arriving at a store in an hour).

2. Continuous Distributions: These describe data that can take on any value within a continuous range, like the height of
people (any value between 0 and, say, 3 meters). Some common continuous distributions include:

Normal distribution (bell curve): Describes data that is symmetric and bell-shaped, with most values clustered
around the mean (e.g., heights of people).
Uniform distribution: Describes data where all values within a certain range are equally likely (e.g., random numbers
between 0 and 1).
Exponential distribution: Describes the time between events in a Poisson process (e.g., the time between customer
arrivals at a store).
2.3 interpreting distributions
The normal distribution is commonly used in
machine learning and data science

Type of Skewness
Various types of skewness used in mathematics are,

● Positive Skewness (Mean > Median >Mode)


● Negative Skewness (Mode > Median > Mean)
● Zero Skewness (Normal Distribution)
(Mean = Mode = Median)
2.3 interpreting distributions (Practice Problems)

Problem 1: Find the skewness for the given Data ( 2,4,6,6)


mean=4.5
median=5
mode=6
S.D=1.732
Skewness = 3(Mean – Median)/S.D.
Skewness = – 0.866
So, the skewness of these data is negative.

Problem 2: A local library tracks the number of books borrowed by


patrons each month. Mean no. of books borrowed:5 books. Median no. of
books borrowed:3 books. Analyze the skewness of the book borrowing data.

We can't calculate the exact skewness value due to missing information


about the standard deviation.
● Mean (5) is higher than Median (3): This suggests a possible positive skew.
2.3 interpreting distributions (Practice Problems)
Rules:
● Mean > Median: Suggests a positive skew, with a longer tail towards higher incomes.
● Mean < Median: Suggests a negative skew, with a longer tail towards lower incomes.
● Mean ≈ Median: Indicates a relatively symmetrical distribution.

Problem: A bakery analyzed its daily bread sales for the past week. They found the following:

● Mean daily sales: 25 loaves


● Median daily sales: 20 loaves
Analyze the skewness of the daily bread sales data.
Solutions:
1. Mean > Median: This is the key observation. The higher mean compared to the median suggests a distribution where the
"tail" extends towards higher values.
2. Positive Skew Interpretation: In a positively skewed distribution, a few high values can significantly influence the
average (mean) upwards, while the majority of data points might fall around the median. This aligns with the bakery
scenario where occasional high sales days could pull the mean up compared to the more typical daily sales reflected in
the median.
Limitations:

We can't calculate the exact skewness value due to missing information about the standard deviation.
2.4 Graphs
Graph Data Science is an analytics and machine learning (ML) solution that analyzes relationships in
data to improve predictions and discover insights.
There are various types of graphs used in data science:

1. Line Graph:

● Purpose: Shows trends and changes over time.


● Use cases: Tracking stock prices, website traffic, weather patterns, etc.

2. Bar Graph:

● Purpose: Compares categories or discrete data points.


● Use cases: Comparing sales figures across different products,etc.
2.4 Graphs
3. Scatter Plot:

● Purpose: Identifies relationships between two continuous variables.


● Use cases: Exploring the correlation between advertising spending and sales,
customer age and purchase amount, etc.

4. Histogram:

● Purpose: Visualizes the distribution of continuous data.


● Use cases: Understanding the spread and shape of data like income levels,
test scores, customer age, etc.

5. Pie Chart:

● Purpose: Shows the composition of a whole, highlighting the proportion


of different categories.
● Use cases: Representing market share, budget allocation, customer
preferences, etc.
2.4 Graphs
6. Box Plot:

● Purpose: Summarizes the distribution of a dataset, showing the


median, quartiles, and outliers.
● Use cases: Comparing the distribution of salaries across different departments,
analyzing customer satisfaction ratings, etc.

7. Heatmap:

● Purpose: Visualizes data points as color intensities within a matrix,


revealing patterns and relationships.
● Use cases: Analyzing correlations between genes, exploring stock
market trends, visualizing customer sentiment on a product feature matrix, etc.

8. Network Graph:

● Purpose: Represents entities (nodes) and their relationships (edges), often


used for social network analysis, knowledge graphs, etc.
● Use cases: Analyzing relationships between people in a social network, exploring
connections between products in a recommendation system, visualizing the flow of
information in a knowledge base, etc.
2.5 Averages (Mean, Median, Mode)

Ex: The mean, or the average, is calculated by adding all the figures within the data set and then
dividing by the number of figures within the set. For example, the sum of the following data set is 20:
(2, 3, 4, 5, 6). The mean is 4 (20/5).

2.6 describing variability (Range, Interquartile Range, Variance, Standard Deviation)

let's use the following example data set of exam scores: 78, 82, 85, 88, 90, 92, 95, 98, 100.

Range:

The range is the difference between the highest and lowest values in the data set.

● Calculation: Range = Highest value - Lowest value


● Solution: Range = 100 - 78 = 22
2.6 describing variability (Range, Interquartile Range, Variance, Standard Deviation)
Interquartile Range (IQR):
2.6 describing variability (Range, Interquartile Range, Variance, Standard Deviation)
Variance and Standard deviation

1. Variance is the average squared deviation of all data points from the mean. It essentially tells
you how much, on average, each data point deviates from the central tendency (mean).
2. Standard deviation is the square root of the variance. It gives you a measure of the spread of
data in the same units as the original data, making it easier to interpret and compare data sets
with different units.
2.7 variability for qualitative and ranked data
2.7 variability for qualitative and ranked data
Example: Variability in normal distributions

You are investigating the amounts of time spent on phones daily by different groups of people.

Using simple random samples, you collect data from 3 groups:

● Sample A: high school students,


● Sample B: college students,
● Sample C: adult full-time employees.
2.8 Normal Distributions
The normal distribution, also known as the Gaussian distribution, is a bell-shaped curve that describes
data where most values cluster around the center (mean)

PDF

formula
2.8 Normal Distributions
Example: Calculate the probability density function of normal distribution using the following
data. x = 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can write;

Hence, f(3,4,2) = 1.106.


2.9 Z-score and Standard Normal Distribution
A z-score is a statistical measure of how much something deviates from the average.
2.9 Z-score and Standard Normal Distribution
2.9 Z-score and Standard Normal Distribution
2.9 Z-score and Standard Normal Distribution
2.9 Normal Distribution practice problems
Problem 1: For some computers, the time period between charges of the battery is normally distributed with
a mean of 50 hours and a standard deviation of 15 hours. Rohan has one of these computers and needs to
know the probability that the time period will be between 50 and 70 hours.

Problem 2: The speeds of cars are measured using a radar unit, on a motorway. The speeds are normally
distributed with a mean of 90 km/hr and a standard deviation of 10 km/hr. What is the probability that a car
selected at chance is moving at more than 100 km/hr?

Problem 3: A factory produces widgets, and the weights of the widgets are normally distributed with a mean
of 100 grams and a standard deviation of 5 grams.What percentage of widgets fall within 5 grams of the
mean (between 95 and 105 grams)? What is the probability that a randomly chosen widget weighs less than
80 grams?

Problem 4: A company sells running shoes. They know that the shoe size for their target market is normally
distributed with an average size of 9 (US) and a standard deviation of 1.5. They recently received a shipment
of 1000 new shoes.How many shoes can they expect to be larger than size 11 (US)? They want to offer a
discount on shoes that are unlikely to sell due to size. What is the minimum size shoe they should discount, if
they want to target the bottom 10% of shoe sizes? (Ans: 1. 0.9082 * 1000 shoes = approximately 908 shoes.)
(Ans:2 cumulative area of 0.10 (10%). This value is approximately -1.28. z-score back to shoe size: -1.28 * 1.5 + 9 = 6.72 ~ 6)
2.10 Correlation and Scatter Plots

correlation is a statistical method that measure the relationship between two variables.
Ex: measuring the dance moves of two friends, Alice and Bob.
2.10 correlation ( -1< r >1 )
Example: Determine the correlation coefficient for the following data

Note: Draw Scatter Plots for the Above data


2.11 Regression
Regression is a statistical method that helps us to analyze and understand the relationship between
dependent variables and one or more independent variables.
Ex: predicting Alice's next move based on how Bob is moving
2.12 Regression Line
❖ A regression line is a straight line that describes how a response variable y changes as an
explanatory variable x changes.
❖ A regression line can be used to predict the value of y for a given value of x.
2.12 Regression Line
2.12 Regression Line
2.12 least squares regression line
The least squares regression line, ̂ 𝑦=𝑎+𝑏𝑥,minimizes the sum of the squared differences of the points
from the line, hence, the phrase “least squares.”

Find a using y=a+bx (y- y mean, x- x mean)

Find a?
2.13 standard error of estimate (Sigma)
Find the sum of the squared errors (SSE)
2.14 interpretation of r2
The coefficient of determination is a number between 0 and 1 that measures how well a
statistical model predicts an outcome.
2.14 interpretation of r2
2.14 interpretation of r2

Formula 2: output of regression model

sum of squares due to regression (SSR)


sum of squares of total (SST)

You will get R squared value.We get R square= 0.74,Which shows that the prediction values
are somehow close to the actual values
2.14 multiple regression equations

Multiple linear regression refers to a statistical technique that uses two or more independent variables
to predict the outcome of a dependent variable.
2.14 multiple regression equations
Example Problem:

Formula: y=b0+b1x1+b2x2
b1= 3.148
b2= -1.656
b0= -6.867
2.14 regression toward the mean
Regression toward the mean is a common statistical phenomenon that describes how extreme values
(either very high or very low) tend to move closer to the average (mean) in subsequent measurements.
Example
● A basketball player has an abnormally high number of points in one game. Regression to the
mean suggests their scoring average over the season will likely be closer to their typical
performance, not as high as this single game.

You might also like