Data Analisis 2

4/2/20
Describing Data: Wine Quality Dataset

• Size: 4897 x 12
• All numeric a6ributes
Data Analysis: Visualisation • Class a6ribute: quality
• No missing values
• Imbalanced data
• No instances for class
1,2,10
Data Mining
ITERA
Semester II 2019/2020 2
1 2
Wine Quality Dataset: Outliers Data Analysis Process

Dataset
(raw)
Import Data Data Export

dataset understanding preparation dataset
This observation suggests that there are extreme values (outliers) in our data Describing data
Dataset for
modelling
set. There is notably a large difference between 75th %tile and max values of Verifying data quality
predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”. Next: exploring data This ppt
4
3
3 4
1
4/2/20
Exploring Data Data Query

• Exploring data by using data query, visualization, • Selecting instances with single or multiple filtering
statistics to indicate data characteristics or lead to condition
interesting subsets for further examination. • Selecting columns
• sampling, • Aggregating data
• feature engineering, • Grouping data
• correlation
5 6
5 6
Selecting Instances using Python

Selec4ng Instances using Python
(2)
Data query to interesting subset, e.g. missing value • Multiple filtering conditions
in Automobile dataset
• Single filtering condition
7 8
7 8
2
4/2/20
Selec4ng Columns Aggregating Data
9 10
9 10
Aggregating Data Grouping Data

by Category
Values
11 12
11 12
3
4/2/20
Exercise 1: Data Query Exercise 2: Data Query

Execute this program for Automobile dataset, and Write program to show number of instances that
explain its output. contains missing values from Automobile dataset.
for i in range(26): Hint: The output should be 46.
nr=df[df[i]=='?'].shape[0]
if nr>0:
print(i,':',nr)
13 14
13 14
Exercise 3: Data Query Data Analysis in Statistics

Descriptive Data Analysis (DDA)
Execute this program for Automobile dataset, and explain A series of methods that summarize data (eg.
its output.
sample mean and standard deviation)
import numpy as np
df_3features=df[df.columns[np.r_[6:8,25]]]
df_3features=df_3features[df_3features[25]!='?'] Exploratory Data Analysis (EDA) Confirmatory Data Analysis (CDA)
A series of methods for generating hypotheses using A series of methods for statistical inference, calculation
df_3features[25] = df_3features[25].astype("float64") visualizations. of p-values and interpretation of their implications for
df_grp=df_3features.groupby([df.columns[6],df.columns[7]]).mean() Analysis to prepare the data for modeling proving hypotheses
df_pivot=df_grp.pivot_table(index=df.columns[7],columns=df.columns[6])
df_pivot
EDA will be conducted on dataset to understand the data & prepare the hypothesis
15 http://www.models.kvl.dk/sites/default/files/Data_Analysis.png, cited from Allen et al. (2018) 16
15 16
4
4/2/20
Exploratory Data Analysis Exploratory Data Analysis (EDA): Automobile Dataset
• EDA is NOT about making fancy visualizations or even Main goal: “What are the characteristics that have the most
aesthetically pleasing ones, the goal is to try and answer impact on the car price?”
questions with data.
• EDA indeed makes sure that you explore the data in such ● Summarize main characteristics of the data
a way that interesting features and relationships between ● Gain better understanding of the dataset,
features will become more clear. ● Uncover relationships between different variables, and
• In EDA, you typically explore and compare many different ● Extract important variables for the problem we are trying to solve
variables with a variety of techniques to search and find
systematic patterns.
18
https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python 17
17 18
Data Visualization in Python Data Visualization

01 Pandas ● provides basic plotting library
● 2-D plotting library that helps in

visualizing figures. Matplotlib
02 matplotlib emulates Matlab like graphs and
visualizations.
● a Python data visualization library
based on matplotlib.
03 seaborn ● It provides a high-level interface for
drawing attractive and informative
statistical graphics.
19 20
19 20
5
4/2/20
Histogram for Numerical Data Histogram: Pandas vs seaborn

A histogram represents the distribution
of data by forming bins along the range
of the data and then drawing bars to
show the number of observations that
fall in each bin.
df[10].hist().set(xlabel='Attribute 10', df[10].hist(grid=False).set(xlabel='Attr import seaborn as sns

ylabel='Frequency') ibute 10', ylabel='Frequency') sns.distplot(df[10], kde=False, bins=5)
21 22
21 22
DistPlot: Parameter Bins DistPlot: Parameter KDE
import seaborn as sns import seaborn as sns

import matplotlib.pyplot as plt import matplotlib.pyplot as plt
f, axes = plt.subplots(1, 2, figsize=(10,5), sharex=True) f, axes = plt.subplots(1, 3, figsize=(15,5), sharex=True)
sns.distplot(df[10], kde=False, ax=axes[0]).set(xlabel='Attribute 10', ylabel='Frequency') sns.distplot(df[10], kde=False, bins=5, ax=axes[0]).set(xlabel='Attribute 10', ylabel='Frequency')
sns.distplot(df[10], kde=False, bins=5, ax=axes[1]).set(xlabel='Attribute 10', ylabel='Frequency') sns.distplot(df[10], bins=5, ax=axes[1]).set(xlabel='Attribute 10', ylabel='Density')
23 sns.distplot(df[10], bins=5, hist=False, ax=axes[2]).set(xlabel='Attribute 10', ylabel='Density') 24
23 24
6
4/2/20
Histogram for Categorical Data CountPlot Parameters

A count plot can be thought of as a
histogram across a categorical, instead
of quantitative, variable. It shows the
counts of observations in each
categorical bin using bars.
https://seaborn.pydata.org/generated/seaborn.countplot.html
import seaborn as sns

sns.countplot(df[6]).set(xlabel='Attribute 6', ylabel='Frequency')
sns.countplot(y=df[6]) sns.countplot(x=df[6],hue=df[3])
25 26
25 26
Exercise 4: Histogram for Price BoxPlot The end of the box shows the
upper and lower quartiles.
Write program to show histogram for attribute 25 A boxplot is a standardized way

of displaying the distribution of
(price) into 3 bins. data based on a five number
summary (“minimum”, first
quartile (Q1), median, third
quartile (Q3), and “maximum”).
• The line that divides the box into 2
parts represents the median of the
data.
The extreme lines shows
the highest and lowest
27
value excluding outliers. 29
27 29
7
4/2/20
BoxPlot: Example BoxPlot Parameters https://seaborn.pydata.org/generated/seaborn.boxplot.html
# Make boxplot for one group only

sns.boxplot(y=df[0]).set_title('basic plot')

sns.boxplot(x=df[0]).set_title('horizontal') sns.boxplot(y=df[0],notch=True)
.set_title('notched plot')
30 31
30 31
Notched BoxPlot BoxPlot with Outliers
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
32 33
https://sites.google.com/site/davidsstatistics/home/notched-box-plots
32 33
8
4/2/20
BoxPlot with Outliers: Example BoxPlot Grouped-by

sns.boxplot(y=df[20]).set_title('basic plot')
IQR = 9.4-8.6=0.8
Max = Q3+1.5*IQR=9.4+1.2=10.6 import seaborn as sns
Min = Q1-1.5*IQR=8.6-1.2=7.4 sns.boxplot(y=df[0], x=df[3]).set_title('group by fuel-type')
34 35
34 35
Exercise 5: Boxplot drive-wheels

BoxPlot Grouped-by 2 Categories
and Price
Write program to show boxplot representing effect
of attribute 7 (drive-wheels) to attribute 25 (prices).
Hints: y=prices; x=drive-wheels

sns.boxplot(y=df[0], x=df[3], hue=df[4]).set_title('grouped
by fuel-type & aspiration') 36 37
36 37
9
4/2/20
ScatterPlot ScatterPlot & Correlation Coefficient

Scatter plot represents
relationship between x and y.

df_price=df[df[25]!='?'][25].astype(
"int64")
sns.scatterplot(x=df[16],
y=df_price).set(xlabel='Attribute
16', ylabel='Attribute 25') df_price_engine=df.iloc[:,[16,25]]
df_price_engine=df_price_engine[df_price_engine[25]!='?']
df_price_engine[25]=df_price_engine[25].astype("int64")
https://seaborn.pydata.org/generated/seaborn.scatterplot.html 39
df_price_engine.corr() 40
39 40
Correlation Coefficient Correlation Coefficient: Notes

• Correlation coefﬁcients are a quantitative • Correlation coefﬁcient will lie between -1 and 1
measure that describe the strength of • The greater the absolute value (closer to -1 or 1), the stronger
association/relationship between two the relationship between the variables:
variables. • The strongest correlation is a -1 or a 1
• The correlation between two sets of data • The weakest correlation is a 0
tells us about how they move together. • A positive correlation means that as one variable increases, the
Would changing one help us predict the other one tends to increase as well
other? • A negative correlation means that as one variable increases,
the other one tends to decrease
41 42
41 42
10
4/2/20
Correlation Coefficient Correlation Coefficient Heatmap

corr = df.corr()
sns.heatmap(corr,
annot=True, fmt='.2f')
43 44
43 44
Heatmap Plot from Data Grid Heatmap from Pivot Table

Heatmap plot represents pivot
table. Heat map takes a
rectangular grid of data and
assigns a color intensity based
on the data value at the grid import seaborn as sns
points. sns.heatmap(df_pivot, annot=True,
fmt='.2f').set(xlabel='Attribute 6',
ylabel='Attribute 7')
https://seaborn.pydata.org/generated/
seaborn.heatmap.html
45 46
45 46
11
4/2/20
Generate Pivot Table: Example PairPlot

import numpy as np
PairPlot draw scatterplots for
#select body-style(6), drive-wheels(7), price(25) from df
df_3features=df[df.columns[np.r_[6:8,25]]] joint relationships and
#select data where price is not null
histograms for univariate
df_3features=df_3features[df_3features[25]!='?'] distributions.
#convert price attribute in object type into float
df_3features[25] = df_3features[25].astype("float64")
A PairPlot allows us to see both
distribution of single variables
#group by body-style(6), drive-wheels(7)
df_grp=df_3features.groupby([df.columns[6],df.columns[7]]).mean() and relationships between two
#create pivot table
variables.
df_pivot=df_grp.pivot_table(index=df.columns[7],columns=df.columns[6]) https://seaborn.pydata.org/generated/
seaborn.pairplot.html
47 48
47 48
PairPlot
#select length(10),width(11),height(12),price(25) from

df
df_4features=df[df.columns[np.r_[10:13,25]]]
#convert price attr. in object type into float

df_4features=df_4features[df_4features[25]!='?']
df_4features[25] = df_4features[25].astype("float64")
df_4features[25] = pd.cut(df_4features[25], 5)
df_4features[25] = df_4features[25].astype("category")
sns.pairplot(data=df_4features,hue=df.columns[25])
https://infogram.com/pag
e/choose-the-right-chart-
data-visualization
49 50
49 50
12
4/2/20
Exercise 5: Data Visualization

1. Load white Wine Quality dataset
(https://archive.ics.uci.edu/ml/datasets/wine+quality)
df = pd.read_csv("winequality-
white.csv",sep=';')
2. Exploring data by using visualization. Report interesting
insights, especially about outliers and imbalanced dataset.
51
51
13

Data Analisis 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analisis 2

Uploaded by

Copyright:

Available Formats

4/2/20

Describing Data: Wine Quality Dataset

Wine Quality Dataset: Outliers Data Analysis Process

Import Data Data Export

Exploring Data Data Query

Selecting Instances using Python

Selec4ng Columns Aggregating Data

Aggregating Data Grouping Data

Exercise 1: Data Query Exercise 2: Data Query

Exercise 3: Data Query Data Analysis in Statistics

15 http://www.models.kvl.dk/sites/default/files/Data_Analysis.png, cited from Allen et al. (2018) 16

Exploratory Data Analysis Exploratory Data Analysis (EDA): Automobile Dataset

Data Visualization in Python Data Visualization

● 2-D plotting library that helps in

Histogram for Numerical Data Histogram: Pandas vs seaborn

df[10].hist().set(xlabel='Attribute 10', df[10].hist(grid=False).set(xlabel='Attr import seaborn as sns

DistPlot: Parameter Bins DistPlot: Parameter KDE

import seaborn as sns import seaborn as sns

Histogram for Categorical Data CountPlot Parameters

import seaborn as sns

Write program to show histogram for attribute 25 A boxplot is a standardized way

BoxPlot: Example BoxPlot Parameters https://seaborn.pydata.org/generated/seaborn.boxplot.html

import seaborn as sns

# Make boxplot for one group only

import seaborn as sns import seaborn as sns

Notched BoxPlot BoxPlot with Outliers

BoxPlot with Outliers: Example BoxPlot Grouped-by

Exercise 5: Boxplot drive-wheels

import seaborn as sns

ScatterPlot ScatterPlot & Correlation Coefficient

import seaborn as sns

Correlation Coefficient Correlation Coefficient: Notes

Correlation Coefficient Correlation Coefficient Heatmap

Heatmap Plot from Data Grid Heatmap from Pivot Table

Generate Pivot Table: Example PairPlot

#select length(10),width(11),height(12),price(25) from

#convert price attr. in object type into float

Exercise 5: Data Visualization

You might also like