Professional Documents
Culture Documents
Data Science Presentation
Data Science Presentation
• Some ways you can describe patterns found in univariate data include central tendency
(mean, mode and median) and dispersion: range , variance, maximum, minimum, quartiles
(including the interquartile range), and standard deviation.
• You have several options for describing data with univariate data. Click on the link to find out
more about each type of graph or chart:
• This type of data occurs all the time in real-world situations and we
typically use the following methods to analyze this type of data:
• Scatterplots
• Correlation Coefficients
• Simple Linear Regression
Example 1: Business
• Businesses often collect bivariate data about total money spent on
advertising and total revenue.
• The business may decide to fit a simple linear regression model to this
dataset and find the following fitted model:
• This tells the business that for each additional dollar spent on advertising,
total revenue increases by an average of $2.70.
Example 2: Medical
Medical researchers often collect bivariate data to gain a better
understanding of the relationship between variables related to health.
For example, a researcher may collect the following data about age and
resting heart rate for 15 individuals:
• The researcher may then decide to calculate the correlation between
the two variables and find it to be 0.812.
• As MVA has multiple variables, the variables are grouped and sorted on the
basis of their unique features.
• As multivariate data analysis deals with multiple variables, all the variables can either
be independent or dependent on each other.
• This helps the analysis to search for factors that can help in drawing accurate
conclusions.
• Since the analysis is tested, the drawn conclusions are closer to real-life situations.
multivariate data analysis
Importance of EDA in Data Science
• The Data Science field is now very important in the business world as
it provides many opportunities to make vital business decisions by
analyzing hugely gathered data.
• Understanding the data thoroughly needs its exploration from every
aspect.
• The impactful features enable making meaningful and beneficial
decisions; therefore, EDA occupies an invaluable place in Data
science.
Objective of Exploratory Data Analysis
• The overall objective of exploratory data analysis is to obtain vital
insights and hence usually includes the following sub-objectives:
• 1. Data Collection
• Nowadays, data is generated in huge volumes and various forms
belonging to every sector of human life, like healthcare, sports,
manufacturing, tourism, and so on. Every business knows the
importance of using data beneficially by properly analyzing it.
However, this depends on collecting the required data from various
sources through surveys, social media, and customer reviews, to
name a few. Without collecting sufficient and relevant data, further
activities cannot begin.
2. Finding all Variables and Understanding Them
• When the analysis process starts, the first focus is on the available
data that gives a lot of information.
• This information contains changing values about various features or
characteristics, which helps to understand and get valuable insights
from them. It requires first identifying the important variables which
affect the outcome and their possible impact. This step is crucial for
the final result expected from any analysis.
3. Cleaning the Dataset
• The next step is to clean the data set, which may contain null values
and irrelevant information.
• These are to be removed so that data contains only those values that
are relevant and important from the target point of view.
• This will not only reduce time but also reduces the computational
power from an estimation point of view.
• Preprocessing takes care of all issues, such as identifying null values,
outliers, anomaly detection, etc.
4. Identify Correlated Variables
• Univariate
• Bivariate
• Multivariate
• 1. Univariate Non-Graphical
• It is the simplest of all types of data analysis used in practice.
• As the name suggests, uni means only one variable is considered
whose data (referred to as population) is compiled and studied.
• The main aim of univariate non-graphical EDA is to find out the details
about the distribution of the population data and to know some
specific parameters of statistics.
• The significant parameters which are estimated from a distribution
point of view are as follows:
• Central Tendency: This term refers to values located at the data's central
position or middle zone.
• The three generally estimated parameters of central tendency are mean,
median, and mode.
• Mean is the average of all values in data, while the mode is the value that
occurs the maximum number of times.
• The Median is the middle value with equal observations to its left and right.
• Range: The range is the difference between the maximum and minimum
value in the data, thus indicating how much the data is away from the
central value on the higher and lower side
• Variance and Standard Deviation: Two more useful parameters are
standard deviation and variance.
• Variance is a measure of dispersion that indicates the spread of all
data points in a data set.
• It is the measure of dispersion mostly used and is the mean squared
difference between each data point and mean, while standard
deviation is the square root value of it.
• The larger the value of standard deviation, the farther the spread of
data, while a low value indicates more values clustering near the
mean.
• 2. Univariate Graphical
• The graphs in this section are based on Auto MPG dataset available
on the UCI repository.
• Some common types of univariate graphics are:
• Stem-and-leaf Plots: This is a very simple but powerful EDA method
used to display quantitative data but in a shortened format.
• It displays the values in the data set, keeping each observation intact
but separating them as stem (the leading digits) and remaining or
trailing digits as leaves.
• But histogram is mostly used in its place now.
• Histograms (Bar Charts): These plots are used to display both grouped
or ungrouped data.
• On the x-axis, values of variables are plotted, while on the y-axis are
the number of observations or frequencies.
• Histograms are very simple to quickly understand your data, which tell
about values of data like central tendency, dispersion, outliers, etc
• There are many types of histograms, a few of which are listed below:
• Simple Bar Charts: These are used to represent categorical variables with rectangular bars,
where the different lengths correspond to the values of the variables.
• Multiple or Grouped charts: Grouped bar charts are bar charts representing multiple sets of data
items for comparison where a single color is used to denote one specific series in the dataset.
• Percentage Bar Charts: These are bar graphs that depict the data in the form of percentages for
each observation. The following image shows a percentage bar chart with dummy values.
• Box Plots: These are used to display the distribution of quantitative value in the data. If the data
set consists of categorical variables, the plots can show the comparison between them.
• Further, if outliers are present in the data, they can be easily identified. These graphs are very
useful when comparisons are to be shown in percentages, like values in the 25 %, 50 %, and 75%
range (quartiles).
• 3. Multivariate Non-Graphical
• The multivariate non-graphical exploratory data analysis technique is
usually used to show the connection between two or more variables
with the help of either cross-tabulation or statistics.
• 4. Multivariate Graphical
• Graphics are used in multivariate graphical data to show the
relationships between two or more variables.
• Here the outcome depends on more than two variables, while the
change-causing variables can also be multiple.
• Some common types of multivariate graphics include:
• A) Scatter Plot
• A run chart is a data line chart drawn over time. In other words, a run
chart visually illustrates the process performance or data values in a
time sequence.
• Rather than summary statistics, seeing data across time yields a more
accurate conclusion. A trend chart or time series plot is another name
for a run chart.
• D) Bubble Chart
Median: This identifies the value in the middle of all the values in the
dataset when values are ranked in order.
Data distribution
• Data distribution is a function that specifies all possible values for a
variable and also quantifies the relative frequency (probability of how
often they occur).
• Distributions are considered to be any population that has a
scattering of data.
Binomial Distribution
• Normal Distribution
• The most common and naturally occurring distribution is Normal Distribution.
• It is otherwise also known as Gaussian Distribution. There is no field where
this distribution is not seen.
• Finance, Statistics, Chemistry, you name it.
• It is an omnipresent distribution.
• A classic example could be the distribution of SAT scores higher number of
students will score around the mean.
• As the distance increases from either side of the mean, the probability
decreases.
• nums = np.random.normal(50, 5, 1000)
• sns.set(style="darkgrid", palette="cividis",)
• fig,ax = plt.subplots(figsize=(15,7))
• sns.distplot(nums)
• Exponential Distribution
• The exponential distribution is often associated with the time elapsed
until some event happens.
• The events within the time interval occur continuously and at an
average constant rate.
• If you know high school chemistry, then chemical first-order reaction r
time until a radioactive substance decays follow an exponential
distribution.
• A more general example could be amount of months a car battery
lasts.
• from scipy.stats import expon
• sample_space = np.arange(0,5,0.05)
• sns.set(style="darkgrid", palette="muted",)
• fig,ax = plt.subplots(figsize=(15,7))
• plt.ylim(0,1.6,0.1)
• sns.lineplot(x = sample_space, y = expon.pdf(sample_space,scale=2), label="sigma = 0.5 and beta = 2")
• #scale parameter is used which is inverse of rate parameter
• sns.lineplot(x = sample_space, y = expon.pdf(sample_space,scale=1), label="sigma = 1 and beta = 1")
• sns.lineplot(x = sample_space, y = expon.pdf(sample_space,scale=2/3), label="sigma=1.5 and beta=0.67")
• plt.ylabel('f(x)')
• plt.xlabel('x')
• plt.title('Exponential distribution PDF', fontdict = {'size':16})
• plt.plot()
Outlier Treatment
• One of the most important steps as part of data preprocessing is
detecting and treating the outliers as they can negatively affect the
statistical analysis and the training process of a machine learning
algorithm resulting in lower accuracy.
• What are Outliers?
• We all have heard of the idiom ‘odd one out which means something
unusual in comparison to the others in a group.
• Boxplots
• Z-score
• Inter Quantile Range(IQR)
• Detecting outliers using Boxplot:
• Python code for boxplot is:
Mean
• The "average" number; found by adding all data points and dividing
by the number of data points.
• Mean is used to find the average value around which your data values
range.
• Generally, when working with data, you may want to know the
average data value.
• This will give you a term that incorporates every data value from the
dataset.
• The mean (aka the arithmetic mean, different from the geometric
mean) of a dataset is the sum of all values divided by the total
number of values
• . It's the most commonly used measure of central tendency and is
often referred to as the “average.”
Now, you will understand mean with the help of an example.
Consider a class whose students have obtained the following
marks out of 50 in mathematics :
You can see that there are 12 data points. So all you have to do is add
up each value and divide the result by 12, as shown below :
Hence, you get the mean as 37. This means that, on average, a student
belonging to the above class will score 37 out of 50 in mathematics.
What Is Mode?
• The Mode refers to the most frequently occurring value in your data.
• You find the frequency of occurrence of each number and the
number with the highest frequency is your mode.
• If there are no recurring numbers, then there is no mode in the data.
• Using the mode, you can find the most commonly occurring point in
your data.
• This is helpful when you have to find the central tendency of
categorical values, like the flavor of the most popular chip sold by a
brand.
• You cannot find the average based on the orders; instead, you choose
the chip flavor with the highest orders.
Over here, the value 35 occurs the most frequently and hence is the mode. But what if the values
are categorical? In that case, you must use the formula below:
Where,
We now come to the last of the mean, median, and mode trio - mode.
Skewness and Kurtosis
• The skewness is a measure of symmetry or asymmetry of data
distribution, and kurtosis measures whether data is heavy-tailed or
light-tailed in a normal distribution.
• Data can be positive-skewed (data-pushed towards the right side) or
negative-skewed (data-pushed towards the left side)
What are the three types of skewness?
• Right skew (also called positive skew).
• A right-skewed distribution is longer on the right side of its peak than
on its left.
• Left skew (also called negative skew).
• A left-skewed distribution is longer on the left side of its peak than on
its right.
• Zero skew.
• 1. Positively Skewed:
• In a distribution that is Positively Skewed, the values are more
concentrated towards the right side, and the left tail is spread out.
• Hence, the statistical results are bent towards the left-hand side.
• Hence, that the mean, median, and mode are always positive.
• In this distribution, Mean > Median > Mode.
Positively Skewed
2. Negatively Skewed:
• In a Negatively Skewed distribution, the data points are more
concentrated towards the right-hand side of the distribution.
• This makes the mean, median, and mode bend towards the right.
• Hence these values are always negative.
• In this distribution, Mode > Median > Mean.
Negatively Skewed
What Is a Normal Distribution?
• A normal distribution is a continuous probability distribution for a
random variable.
• A random variable is a variable whose value depends on the outcome
of a random event.
• For example, flipping a coin will give you either heads or tails at
random.
• You cannot determine with absolute certainty if the following
outcome is a head or a tail.
Normal Distribution