Professional Documents
Culture Documents
Data Analisis 2
Data Analisis 2
1 2
This observation suggests that there are extreme values (outliers) in our data Describing data
Dataset for
modelling
set. There is notably a large difference between 75th %tile and max values of Verifying data quality
predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”. Next: exploring data This ppt
4
3
3 4
1
4/2/20
5 6
5 6
7 8
7 8
2
4/2/20
9 10
9 10
11 12
11 12
3
4/2/20
13 14
its output.
sample mean and standard deviation)
import numpy as np
df_3features=df[df.columns[np.r_[6:8,25]]]
df_3features=df_3features[df_3features[25]!='?'] Exploratory Data Analysis (EDA) Confirmatory Data Analysis (CDA)
A series of methods for generating hypotheses using A series of methods for statistical inference, calculation
df_3features[25] = df_3features[25].astype("float64") visualizations. of p-values and interpretation of their implications for
df_grp=df_3features.groupby([df.columns[6],df.columns[7]]).mean() Analysis to prepare the data for modeling proving hypotheses
df_pivot=df_grp.pivot_table(index=df.columns[7],columns=df.columns[6])
df_pivot
EDA will be conducted on dataset to understand the data & prepare the hypothesis
15 16
4
4/2/20
• EDA is NOT about making fancy visualizations or even Main goal: “What are the characteristics that have the most
aesthetically pleasing ones, the goal is to try and answer impact on the car price?”
questions with data.
• EDA indeed makes sure that you explore the data in such ● Summarize main characteristics of the data
a way that interesting features and relationships between ● Gain better understanding of the dataset,
features will become more clear. ● Uncover relationships between different variables, and
• In EDA, you typically explore and compare many different ● Extract important variables for the problem we are trying to solve
variables with a variety of techniques to search and find
systematic patterns.
18
https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python 17
17 18
19 20
5
4/2/20
21 22
23 24
6
4/2/20
https://seaborn.pydata.org/generated/seaborn.countplot.html
25 26
25 26
Exercise 4: Histogram for Price BoxPlot The end of the box shows the
upper and lower quartiles.
27 29
7
4/2/20
30 31
30 31
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
32 33
https://sites.google.com/site/davidsstatistics/home/notched-box-plots
32 33
8
4/2/20
IQR = 9.4-8.6=0.8
Max = Q3+1.5*IQR=9.4+1.2=10.6 import seaborn as sns
Min = Q1-1.5*IQR=8.6-1.2=7.4 sns.boxplot(y=df[0], x=df[3]).set_title('group by fuel-type')
34 35
34 35
36 37
9
4/2/20
39 40
41 42
41 42
10
4/2/20
corr = df.corr()
sns.heatmap(corr,
annot=True, fmt='.2f')
43 44
43 44
45 46
45 46
11
4/2/20
47 48
47 48
PairPlot
sns.pairplot(data=df_4features,hue=df.columns[25])
https://infogram.com/pag
e/choose-the-right-chart-
data-visualization
49 50
49 50
12
4/2/20
51
51
13