Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

(Data mining Project)

1.1 Reading the data


Top five Rows

Bottom Five Rows


Information about the Given Dataset

Insights:

 The data set contains 23066 observations and 19 variables.¶


 After looking at the data, it can be seen that there are 13 columns.

 There are 6 float, 7 int, 6 object datatype Columns are there

 There are some missing Values in some columns


 Also, we can see there are certain columns where NaN is written, let's
understand more about this data.
 Before analysing the histogram, we should make sure that whether all the
columns are relevant or not.
Summary about the Data
Checking the missing Values

Insights:

 There are some missing Values are appearing in the columns called ’CTR, ‘CPM’
and ‘CPM’ and there is no Duplicate Values.

1. 2 Missing Values are treated


Insights: The missing are values in CPC, CTR and CPM by using defined function,
and calling it .
CPM = (Total Campaign Spend / Number of Impressions) * 1,000
CPC = Total Cost (spend) / Number of Clicks

1.3. Checking the outliers:


Observations:

Here outliers are treated:


Observations: Outliers are treated using third method capping flooring method

1: 4 Scaling the data using z-score Method


z-scores can improve the performance and accuracy of some algorithms, such as
linear regression, logistic regression, k-means clustering, and principal component
analysis. Furthermore, z-scores make your data more interpretable and comparable since
they have a mean of zero and a standard deviation of one

Checking the outliers after scaling the data

Observations: after scaling the data there are no outliers except ‘Ad-size’, ‘fee’ column
this data now standardized for performing the PCA.

K-means Clustering has performed here upto 10


[299858.0000000001,
188021.88212355843,
138750.20118781904,
101038.12425527892,
64879.90747198436,
55705.68918071641,
49984.62646428969,
44590.74405627693,
40345.27722151931,
37652.9091388024,
299858.0000000001,
188022.07154238026,
138828.06525702652,
101038.12574043206,
64879.90747198436,
55705.68918071641,
49984.752108662455,
44590.648668953334,
40345.2752819258,
37375.43957426051]
Insights: here each data is record to one of the k clusters, according to their distance
from each cluster• So as to minimize a measure of dispersion within the clusters, and to
averaging of the data.

1.5 Performing the Hierarchical by constructing a Dendrogram using WARD and


Euclidean distance:

Constructing Dendogram by calling the dendogram function:

Viewing the last 10 merged clusters using truncate , given p=10, we get
1.6 Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm.

Observations: When we move from k=1 to k=2, we see that there is a significant drop in
the value, also when we move from k=2 to k=3, k=3 to k=4 there is a significant drop
aswell. But from k=4 to k=5, k=5 to k=6, the drop in values reduces significantly. In
otherwords, the wss is not significantly dropping beyond 4, so 4 is optimal number of
clusters.

1.7 - Clustering: silhouette scores for up to 10 clusters and identify optimum number of
clusters.
Observations: The silhouette_samples function computes the silhouette width for each and every row

silhouette_score is 0.3762959936337292 . we can conclude it is well distinguished set of clusters

Cluster Profiling:

1.8 Concluding the project by providing summary of my Obsrvations

 The data set contains 23066 observations and 19 variables.¶


 After looking at the data, it can be seen that there are 13 columns.
 There are 6 float, 7 int, 6 object datatype Columns are there
 There are some missing Values in some columns
 Also, we can see there are certain columns where NaN is written, let's understand
more about this data.
 Before analysing the histogram, we should make sure that whether all the columns
are relevant or not.

 There are no duplicate Values


 The missing are values in CPC, CTR and CPM by using defined function, and
calling it .
 CPM = (Total Campaign Spend / Number of Impressions) * 1,000

 CPC = Total Cost (spend) / Number of Clicks

 Ad – Length - skew : 0.33, Ad- Width - skew : 0.21, Ad Size - skew : 1.21,
Available_Impressions - skew : 1.25, Matched_Queries -skew : 1.21
Impressions - skew : 1.2, Clicks - skew : 1.2, Spend - skew : 1.17,
Fee - skew : -1.58, Revenue - skew : 1.19, CTR - skew : 0.57,CPM -skew : 0.63
CPC - skew : 1.2
 All most all the variables having outliers and right skew, Ad-lenth, Ad-width having
left skew and no outliers, Fee column having both the side outliers
 Outliers are treated using third method capping flooring method
 Scaling the data using z-score Method
 z-scores can improve the performance and accuracy of some algorithms, such as
linear regression, logistic regression, k-means clustering, and principal component
analysis. Furthermore, z-scores make your data more interpretable and
comparable since they have a mean of zero and a standard deviation of one
 after scaling the data there are no outliers except ‘Ad-size’, ‘fee’ column this data
now standardized for performing the PCA.

 Performed the Hierarchical by constructing a Dendrogram using WARD and


Euclidean distance

 When we move from k=1 to k=2, we see that there is a significant drop in the value,
also when we move from k=2 to k=3, k=3 to k=4 there is a significant drop aswell.
But from k=4 to k=5, k=5 to k=6, the drop in values reduces significantly. In
otherwords, the wss is not significantly dropping beyond 4, so 4 is optimal number
of clusters.
 The silhouette_samples function computes the silhouette width for each and every row

 silhouette_score is 0.3762959936337292 . we can conclude it is well distinguished set of


clusters
Part. 2

2.1 Reading the Data

Top 5 Rows
Bottom 5 Rows:

Checking Information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 59 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State Code 640 non-null int64
1 Dist.Code 640 non-null int64
2 No_HH 640 non-null int64
3 TOT_M 640 non-null int64
4 TOT_F 640 non-null int64
5 M_06 640 non-null int64
6 F_06 640 non-null int64
7 M_SC 640 non-null int64
8 F_SC 640 non-null int64
9 M_ST 640 non-null int64
10 F_ST 640 non-null int64
11 M_LIT 640 non-null int64
12 F_LIT 640 non-null int64
13 M_ILL 640 non-null int64
14 F_ILL 640 non-null int64
15 TOT_WORK_M 640 non-null int64
16 TOT_WORK_F 640 non-null int64
17 MAINWORK_M 640 non-null int64
18 MAINWORK_F 640 non-null int64
19 MAIN_CL_M 640 non-null int64
20 MAIN_CL_F 640 non-null int64
21 MAIN_AL_M 640 non-null int64
22 MAIN_AL_F 640 non-null int64
23 MAIN_HH_M 640 non-null int64
24 MAIN_HH_F 640 non-null int64
25 MAIN_OT_M 640 non-null int64
26 MAIN_OT_F 640 non-null int64
27 MARGWORK_M 640 non-null int64
28 MARGWORK_F 640 non-null int64
29 MARG_CL_M 640 non-null int64
30 MARG_CL_F 640 non-null int64
31 MARG_AL_M 640 non-null int64
32 MARG_AL_F 640 non-null int64
33 MARG_HH_M 640 non-null int64
34 MARG_HH_F 640 non-null int64
35 MARG_OT_M 640 non-null int64
36 MARG_OT_F 640 non-null int64
37 MARGWORK_3_6_M 640 non-null int64
38 MARGWORK_3_6_F 640 non-null int64
39 MARG_CL_3_6_M 640 non-null int64
40 MARG_CL_3_6_F 640 non-null int64
41 MARG_AL_3_6_M 640 non-null int64
42 MARG_AL_3_6_F 640 non-null int64
43 MARG_HH_3_6_M 640 non-null int64
44 MARG_HH_3_6_F 640 non-null int64
45 MARG_OT_3_6_M 640 non-null int64
46 MARG_OT_3_6_F 640 non-null int64
47 MARGWORK_0_3_M 640 non-null int64
48 MARGWORK_0_3_F 640 non-null int64
49 MARG_CL_0_3_M 640 non-null int64
50 MARG_CL_0_3_F 640 non-null int64
51 MARG_AL_0_3_M 640 non-null int64
52 MARG_AL_0_3_F 640 non-null int64
53 MARG_HH_0_3_M 640 non-null int64
54 MARG_HH_0_3_F 640 non-null int64
55 MARG_OT_0_3_M 640 non-null int64
56 MARG_OT_0_3_F 640 non-null int64
57 NON_WORK_M 640 non-null int64
58 NON_WORK_F 640 non-null int64
dtypes: int64(59)
memory usage: 295.1 KB

Insights: There are 640 rows and 59 columns. All columns are integer data type. There are no duplicates in this columns. There are
some missing values in the columns.

Missing Values are treated here:


2(i). Checking the highest gender ratio and lowest gender ratio sate wise

According the above graph Uttar Pradesh state has highest gender ratio and
Lakshadweep & Chandigarh are the lowest gender ratio.

2 (ii) Descriptive Table:


According to description table Raigarh district is the highest and Lahul & Spitiis the
lowest gender ration
I have chosen the ‘No_HH’, ‘ToT_M’, ‘ToT_F’, ‘ToT_WORK_M’,
‘ToT_WORK_M’ columns to perform the EDA

Univariate Analysis:

According to the above graph all variable are right skwed. It is showing
positiveness of variables.
Bivariate Analysis

The Above graph is shows that there is correlation ship between ‘ToT_M & ‘ToT_F’

The Above graph is shows that there is correlation ship between ‘ToT_WORK_M’ &
‘ToT_WORK_F’

2.2 Outliers treatment is not necessary unless the they are the result from a processing
mistake or wrong measurement. True outliers must be kept in the data
2.3 Checking the outliers before scaling the data

Data is scaled using z-score method


Checking the outliers after scaling the Data
After scaling the date it is not affected the outliers

2.4

covariance Matrix
Eigen vectors:

2.5

Cumulative explained variance ratio to find a cut off for selecting the number of PCs
2.6

Compare how the original features influence various PCs:


2.7

Heatmap, Compare how the original features influence various PCs

You might also like