Eda

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Data Science Process

Data Science
• Data science is the study of data analysis by advanced
technology (Machine Learning, Artificial Intelligence,
Big data). It processes a huge amount of structured,
semi-structured, and unstructured data to extract
insight meaning, from which one pattern can be
designed that will be useful to take a decision for
grabbing the new business opportunity, the betterment
of product/service, and ultimately business growth.
Data science is the process to make sense of Big
data/huge amount of data that is used in business.
Exploratory Data Analysis
• Exploratory data analysis (EDA) is used by data
scientists to analyze and investigate data sets
and summarize their main characteristics,
often employing data visualization methods. It
helps determine how best to manipulate data
sources to get the answers you need, making
it easier for data scientists to discover
patterns, spot anomalies, test a hypothesis, or
check assumptions.
EDA Importance
• The main purpose of EDA is to help look at data before making any
assumptions. It can help identify obvious errors, as well as better
understand patterns within the data, detect outliers or anomalous
events, find interesting relations among the variables.

• Data scientists can use exploratory analysis to ensure the results


they produce are valid and applicable to any desired business
outcomes and goals. EDA also helps stakeholders by confirming they
are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals.
Once EDA is complete and insights are drawn, its features can then
be used for more sophisticated data analysis or modeling, including
machine learning.
EDA Tools
• Specific statistical functions and techniques you
can perform with EDA tools include:
• Clustering and dimension reduction techniques,
which help create graphical displays of high-
dimensional data containing many variables.
• Univariate visualization of each field in the raw
dataset, with summary statistics.
• Bivariate visualizations and summary statistics
that allow you to assess the relationship between
each variable in the dataset and the target
variable you’re looking at.
EDA Tools….
• Multivariate visualizations, for mapping and
understanding interactions between different fields in
the data.
• K-means Clustering is a clustering method in
unsupervised learning where data points are assigned
into K groups, i.e. the number of clusters, based on the
distance from each group’s centroid. The data points
closest to a particular centroid will be clustered under
the same category. K-means Clustering is commonly
used in market segmentation, pattern recognition, and
image compression.
• Predictive models, such as linear regression, use
statistics and data to predict outcomes.
Types of EDA
• Univariate Non Graphical
• Univariate Graphical
• Multivariate nongraphical
• Multivariate graphical
Univariate non-graphical
• This is simplest form of data analysis, where
the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main
purpose of univariate analysis is to describe
the data and find patterns that exist within it.
Univariate graphical
• Non-graphical methods don’t provide a full picture of
the data. Graphical methods are therefore required.
Common types of univariate graphics include:
• Stem-and-leaf plots, which show all data values and the
shape of the distribution.
• Histograms, a bar plot in which each bar represents the
frequency (count) or proportion (count/total count) of
cases for a range of values.
• Box plots, which graphically depict the five-number
summary of minimum, first quartile, median, third
quartile, and maximum.
Stem and Leaf Plot
• A stem and leaf plot is a visual technique used in Exploratory Data
Analysis (EDA) to display the distribution of a single variable. It's
particularly useful for small to medium datasets.
• Components:
• Stem: Represents the leftmost digits (usually all digits except the last)
of a data point.
• Leaf: Represents the rightmost digit (usually the last digit) of a data
point.
• Construction:
1. Split the Data: Separate each data point into its stem and leaf based
on the chosen number of digits for the stem.
2. Organize Stems: List the unique stem values in ascending order on
the left side of the plot.
3. Place Leaves: For each stem value, write the corresponding leaves in
ascending order to the right, separated by a comma or space.
Example
• Data: Here's some data representing exam
scores (out of 100) for 8 students:
• 78, 82, 91, 65, 88, 95, 72, 85
Stem Leaf
------- ----
6 5
7 28
8 0258
9 15
Histogram
Box Plots
• Five Values
Minimum 17
Lower quartile 52
Median 69
Upper quartile 87
Maximum 100
Multivariate nongraphical:
• Multivariate data arises from more than one
variable. Multivariate non-graphical EDA
techniques generally show the relationship
between two or more variables of the data
through cross-tabulation or statistics.
Multivariate graphical:
• Multivariate data uses graphics to display
relationships between two or more sets of
data.
• The most used graphic is a grouped bar plot or
bar chart with each group representing one
level of one of the variables and each bar
within a group representing the levels of the
other variable.
• Other common types of multivariate graphics
include:

• Scatter plot, which is used to plot data points on a


horizontal and a vertical axis to show how much
one variable is affected by another.
• Multivariate chart, which is a graphical
representation of the relationships between
factors and a response.
• Bubble chart, which is a data visualization that
displays multiple circles (bubbles) in a two-
dimensional plot.
Grouped Bar Chart
Scatter Plot Example

A scatterplot that displays multiple variables through different effects


Bubble Chart Example
Multivariate Chart
• Multi-vari Chart is a
visualization of the
relationship
between factors
(input variables) and
a response (output
variable). It is used
as a preliminary tool
to investigate
variation in the
data, including
cyclical variations
and interactions
between factors.
Data Science Process
• The more examples you see of people doing
data science, the more you’ll find that they fit
into the general framework shown below.
Data Science Process
• First we have the Real World. Inside the Real
World are lots of people busy at various
activities. Some people are using Google+,
others are competing in the Olympics; there
are spammers sending spam, and there are
people getting their blood drawn. Say we have
data on one of these things.
Contd..
• Specifically, we’ll start with raw data—logs,
Olympics records, Enron employee emails, or
recorded genetic material (note there are lots of
aspects to these activities already lost even when
we have that raw data). We want to process this to
make it clean for analysis. So we build and use
pipelines of data munging: joining, scraping,
wrangling, or whatever you want to call it. To do
this we use tools such as Python, shell scripts, R, or
SQL, or all of the above.
• Eventually we get the data down to a nice format,
like something with columns:
name | event | year | gender | event time
Once we have this clean dataset, we should be doing
some kind of EDA. In the course of doing EDA, we may
realize that it isn’t actually clean because of
duplicates, missing values, absurd outliers, and data
that wasn’t actually logged or incorrectly logged. If
that’s the case, we may have to go back to collect
more data, or spend more time cleaning the dataset.
• Next, we design our model to use some
algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something
else. The model we choose depends on the
type of problem we’re trying to solve, of
course, which could be a classification
problem, a prediction problem, or a basic
description problem.
• We then can interpret, visualize, report, or
communicate our results. This could take the
form of reporting the results up to our boss or
coworkers, or publishing a paper in a journal
and going out and giving academic talks about
it.
• Alternatively, our goal may be to build or
prototype a “data product”; e.g., a spam
classifier, or a search ranking algorithm, or a
recommen‐ dation system. Now the key here
that makes data science special and distinct from
statistics is that this data product then gets
incorporated back into the real world, and users
interact with that product, and that generates
more data, which creates a feedback loop.
• This is very different from predicting the weather,
say, where your model doesn’t influence the
outcome at all. For example, you might predict it
will rain next week, and unless you have some
powers we don’t know about, you’re not going to
cause it to rain. But if you instead build a
recommendation system that generates evidence
that “lots of people love this book,” say, then you
will know that you caused that feedback loop.
• Take this loop into account in any analysis you do by
adjusting for any biases your model caused. Your models are
not just predicting the future, but causing it! A data product
that is productionized and that users interact with is at one
extreme and the weather is at the other, but regardless of
the type of data you work with and the “data product” that
gets built on top of it—be it public policy determined by a
statistical model, health insurance, or election polls that get
widely reported and perhaps in‐ fluence viewer opinions—
you should consider the extent to which your model is
influencing the very phenomenon that you are trying to
observe and understand.

You might also like