Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

4/2/20

Describing Data: Wine Quality Dataset


• Size: 4897 x 12
• All numeric a6ributes
Data Analysis: Visualisation • Class a6ribute: quality
• No missing values
• Imbalanced data
• No instances for class
1,2,10
Data Mining
ITERA
Semester II 2019/2020 2

1 2

Wine Quality Dataset: Outliers Data Analysis Process


Dataset
(raw)

Import Data Data Export


dataset understanding preparation dataset

This observation suggests that there are extreme values (outliers) in our data Describing data
Dataset for
modelling
set. There is notably a large difference between 75th %tile and max values of Verifying data quality
predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”. Next: exploring data This ppt

4
3

3 4

1
4/2/20

Exploring Data Data Query


• Exploring data by using data query, visualization, • Selecting instances with single or multiple filtering
statistics to indicate data characteristics or lead to condition
interesting subsets for further examination. • Selecting columns
• sampling, • Aggregating data
• feature engineering, • Grouping data
• correlation

5 6

5 6

Selecting Instances using Python


Selec4ng Instances using Python
(2)
Data query to interesting subset, e.g. missing value • Multiple filtering conditions
in Automobile dataset
• Single filtering condition

7 8

7 8

2
4/2/20

Selec4ng Columns Aggregating Data

9 10

9 10

Aggregating Data Grouping Data


by Category
Values

11 12

11 12

3
4/2/20

Exercise 1: Data Query Exercise 2: Data Query


Execute this program for Automobile dataset, and Write program to show number of instances that
explain its output. contains missing values from Automobile dataset.
for i in range(26): Hint: The output should be 46.
nr=df[df[i]=='?'].shape[0]
if nr>0:
print(i,':',nr)
13 14

13 14

Exercise 3: Data Query Data Analysis in Statistics


Descriptive Data Analysis (DDA)
Execute this program for Automobile dataset, and explain A series of methods that summarize data (eg.

its output.
sample mean and standard deviation)

import numpy as np

df_3features=df[df.columns[np.r_[6:8,25]]]
df_3features=df_3features[df_3features[25]!='?'] Exploratory Data Analysis (EDA) Confirmatory Data Analysis (CDA)
A series of methods for generating hypotheses using A series of methods for statistical inference, calculation
df_3features[25] = df_3features[25].astype("float64") visualizations. of p-values and interpretation of their implications for
df_grp=df_3features.groupby([df.columns[6],df.columns[7]]).mean() Analysis to prepare the data for modeling proving hypotheses
df_pivot=df_grp.pivot_table(index=df.columns[7],columns=df.columns[6])
df_pivot
EDA will be conducted on dataset to understand the data & prepare the hypothesis

15 http://www.models.kvl.dk/sites/default/files/Data_Analysis.png, cited from Allen et al. (2018) 16

15 16

4
4/2/20

Exploratory Data Analysis Exploratory Data Analysis (EDA): Automobile Dataset

• EDA is NOT about making fancy visualizations or even Main goal: “What are the characteristics that have the most
aesthetically pleasing ones, the goal is to try and answer impact on the car price?”
questions with data.
• EDA indeed makes sure that you explore the data in such ● Summarize main characteristics of the data
a way that interesting features and relationships between ● Gain better understanding of the dataset,
features will become more clear. ● Uncover relationships between different variables, and
• In EDA, you typically explore and compare many different ● Extract important variables for the problem we are trying to solve
variables with a variety of techniques to search and find
systematic patterns.
18
https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python 17

17 18

Data Visualization in Python Data Visualization


01 Pandas ● provides basic plotting library

● 2-D plotting library that helps in


visualizing figures. Matplotlib
02 matplotlib emulates Matlab like graphs and
visualizations.
● a Python data visualization library
based on matplotlib.
03 seaborn ● It provides a high-level interface for
drawing attractive and informative
statistical graphics.
19 20

19 20

5
4/2/20

Histogram for Numerical Data Histogram: Pandas vs seaborn


A histogram represents the distribution
of data by forming bins along the range
of the data and then drawing bars to
show the number of observations that
fall in each bin.

df[10].hist().set(xlabel='Attribute 10', df[10].hist(grid=False).set(xlabel='Attr import seaborn as sns


ylabel='Frequency') ibute 10', ylabel='Frequency') sns.distplot(df[10], kde=False, bins=5)
21 22

21 22

DistPlot: Parameter Bins DistPlot: Parameter KDE

import seaborn as sns import seaborn as sns


import matplotlib.pyplot as plt import matplotlib.pyplot as plt
f, axes = plt.subplots(1, 2, figsize=(10,5), sharex=True) f, axes = plt.subplots(1, 3, figsize=(15,5), sharex=True)
sns.distplot(df[10], kde=False, ax=axes[0]).set(xlabel='Attribute 10', ylabel='Frequency') sns.distplot(df[10], kde=False, bins=5, ax=axes[0]).set(xlabel='Attribute 10', ylabel='Frequency')
sns.distplot(df[10], kde=False, bins=5, ax=axes[1]).set(xlabel='Attribute 10', ylabel='Frequency') sns.distplot(df[10], bins=5, ax=axes[1]).set(xlabel='Attribute 10', ylabel='Density')
23 sns.distplot(df[10], bins=5, hist=False, ax=axes[2]).set(xlabel='Attribute 10', ylabel='Density') 24

23 24

6
4/2/20

Histogram for Categorical Data CountPlot Parameters


A count plot can be thought of as a
histogram across a categorical, instead
of quantitative, variable. It shows the
counts of observations in each
categorical bin using bars.

https://seaborn.pydata.org/generated/seaborn.countplot.html

import seaborn as sns


import seaborn as sns import seaborn as sns
sns.countplot(df[6]).set(xlabel='Attribute 6', ylabel='Frequency')
sns.countplot(y=df[6]) sns.countplot(x=df[6],hue=df[3])

25 26

25 26

Exercise 4: Histogram for Price BoxPlot The end of the box shows the
upper and lower quartiles.

Write program to show histogram for attribute 25 A boxplot is a standardized way


of displaying the distribution of
(price) into 3 bins. data based on a five number
summary (“minimum”, first
quartile (Q1), median, third
quartile (Q3), and “maximum”).
• The line that divides the box into 2
parts represents the median of the
data.
The extreme lines shows
the highest and lowest
27
value excluding outliers. 29

27 29

7
4/2/20

BoxPlot: Example BoxPlot Parameters https://seaborn.pydata.org/generated/seaborn.boxplot.html

import seaborn as sns

# Make boxplot for one group only


sns.boxplot(y=df[0]).set_title('basic plot')

import seaborn as sns import seaborn as sns


sns.boxplot(x=df[0]).set_title('horizontal') sns.boxplot(y=df[0],notch=True)
.set_title('notched plot')

30 31

30 31

Notched BoxPlot BoxPlot with Outliers

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
32 33
https://sites.google.com/site/davidsstatistics/home/notched-box-plots

32 33

8
4/2/20

BoxPlot with Outliers: Example BoxPlot Grouped-by


import seaborn as sns
sns.boxplot(y=df[20]).set_title('basic plot')

IQR = 9.4-8.6=0.8
Max = Q3+1.5*IQR=9.4+1.2=10.6 import seaborn as sns
Min = Q1-1.5*IQR=8.6-1.2=7.4 sns.boxplot(y=df[0], x=df[3]).set_title('group by fuel-type')

34 35

34 35

Exercise 5: Boxplot drive-wheels


BoxPlot Grouped-by 2 Categories
and Price
Write program to show boxplot representing effect
of attribute 7 (drive-wheels) to attribute 25 (prices).
Hints: y=prices; x=drive-wheels

import seaborn as sns


sns.boxplot(y=df[0], x=df[3], hue=df[4]).set_title('grouped
by fuel-type & aspiration') 36 37

36 37

9
4/2/20

ScatterPlot ScatterPlot & Correlation Coefficient


Scatter plot represents
relationship between x and y.

import seaborn as sns


df_price=df[df[25]!='?'][25].astype(
"int64")
sns.scatterplot(x=df[16],
y=df_price).set(xlabel='Attribute
16', ylabel='Attribute 25') df_price_engine=df.iloc[:,[16,25]]
df_price_engine=df_price_engine[df_price_engine[25]!='?']
df_price_engine[25]=df_price_engine[25].astype("int64")
https://seaborn.pydata.org/generated/seaborn.scatterplot.html 39
df_price_engine.corr() 40

39 40

Correlation Coefficient Correlation Coefficient: Notes


• Correlation coefficients are a quantitative • Correlation coefficient will lie between -1 and 1
measure that describe the strength of • The greater the absolute value (closer to -1 or 1), the stronger
association/relationship between two the relationship between the variables:
variables. • The strongest correlation is a -1 or a 1
• The correlation between two sets of data • The weakest correlation is a 0
tells us about how they move together. • A positive correlation means that as one variable increases, the
Would changing one help us predict the other one tends to increase as well
other? • A negative correlation means that as one variable increases,
the other one tends to decrease

41 42

41 42

10
4/2/20

Correlation Coefficient Correlation Coefficient Heatmap


import seaborn as sns

corr = df.corr()

sns.heatmap(corr,
annot=True, fmt='.2f')

43 44

43 44

Heatmap Plot from Data Grid Heatmap from Pivot Table


Heatmap plot represents pivot
table. Heat map takes a
rectangular grid of data and
assigns a color intensity based
on the data value at the grid import seaborn as sns
points. sns.heatmap(df_pivot, annot=True,
fmt='.2f').set(xlabel='Attribute 6',
ylabel='Attribute 7')
https://seaborn.pydata.org/generated/
seaborn.heatmap.html

45 46

45 46

11
4/2/20

Generate Pivot Table: Example PairPlot


import numpy as np
PairPlot draw scatterplots for
#select body-style(6), drive-wheels(7), price(25) from df
df_3features=df[df.columns[np.r_[6:8,25]]] joint relationships and
#select data where price is not null
histograms for univariate
df_3features=df_3features[df_3features[25]!='?'] distributions.
#convert price attribute in object type into float
df_3features[25] = df_3features[25].astype("float64")
A PairPlot allows us to see both
distribution of single variables
#group by body-style(6), drive-wheels(7)
df_grp=df_3features.groupby([df.columns[6],df.columns[7]]).mean() and relationships between two
#create pivot table
variables.
df_pivot=df_grp.pivot_table(index=df.columns[7],columns=df.columns[6]) https://seaborn.pydata.org/generated/
seaborn.pairplot.html

47 48

47 48

PairPlot

#select length(10),width(11),height(12),price(25) from


df
df_4features=df[df.columns[np.r_[10:13,25]]]

#convert price attr. in object type into float


df_4features=df_4features[df_4features[25]!='?']
df_4features[25] = df_4features[25].astype("float64")
df_4features[25] = pd.cut(df_4features[25], 5)
df_4features[25] = df_4features[25].astype("category")

sns.pairplot(data=df_4features,hue=df.columns[25])
https://infogram.com/pag
e/choose-the-right-chart-
data-visualization
49 50

49 50

12
4/2/20

Exercise 5: Data Visualization


1. Load white Wine Quality dataset
(https://archive.ics.uci.edu/ml/datasets/wine+quality)
df = pd.read_csv("winequality-
white.csv",sep=';')
2. Exploring data by using visualization. Report interesting
insights, especially about outliers and imbalanced dataset.

51

51

13

You might also like