Professional Documents
Culture Documents
DM Project FaizanAliSayyed 10sep
DM Project FaizanAliSayyed 10sep
MINING
DSBA-AS Module
0
1
Table of Contents
Home ................................................................................................................................................... 0
Index ................................................................................................................................................... 1
List of Figures .................................................................................................................................. 2
List of Tables .................................................................................................................................... 3
Clustering ............................................................................................................................................ 4
Problem statement ........................................................................................................................... 4
Data introduction information............................................................................................................ 6
Solution Clustering Clean Ads Data problem .................................................................................. 10
Summary........................................................................................................................................ 21
PCA ................................................................................................................................................... 22
Problem statement ......................................................................................................................... 22
Data introduction information.......................................................................................................... 23
Solution PCA India Census Data problem ...................................................................................... 28
Summary........................................................................................................................................ 42
List of Figures
Figure 1 Clean Ads Data Shape .............................................................................................................................. 7
Figure 2 Clean Ads Data info .................................................................................................................................. 7
Figure 3 Clean Ads Data head ................................................................................................................................ 7
Figure 4 Clean Ads Data tail ................................................................................................................................... 8
Figure 5 Clean Ads Data describe .......................................................................................................................... 8
Figure 6 Clean Ads Data null values ....................................................................................................................... 8
Figure 7 Clean Ads Data duplicated rows ............................................................................................................... 9
Figure 8 Revised Clean Ads Data info .................................................................................................................. 10
Figure 9 Box Plot: Clean Ads Data ........................................................................................................................ 11
Figure 10 Clean Ads Data Old describe 2 ............................................................................................................. 12
Figure 11 Old Shape ............................................................................................................................................. 12
Figure 12 Clustering Clean Data outlier treatement .............................................................................................. 13
Figure 13 Clean up data info after outlier treatment .............................................................................................. 14
Figure 14 Clean Ads Data description before scaling ........................................................................................... 15
Figure 15 Clean Ads Data description after scaling .............................................................................................. 15
Figure 16 Dendogram Ward –Eucledian ................................................................................................................ 16
Figure 17 Dendogram Ward-Eucledian (p=10) ..................................................................................................... 16
Figure 18 Wss values Clean Ads Data .................................................................................................................. 17
Figure 19 Elbow Plot Clean Ads Data ................................................................................................................... 17
Figure 20 Clean Ads Data Silhouette analysis ...................................................................................................... 18
Figure 21 Assigning clusters to the rows ............................................................................................................... 19
Figure 22 Cluster Distribution Clean Ads Data ..................................................................................................... 19
Figure 23 Clean Ads Data Cluster summary ......................................................................................................... 19
Figure 24 Dendogram hard to see labels for the 8 clusters .................................................................................. 20
Figure 25 Cutting the Dendogram with suitable clusters ....................................................................................... 20
Figure 26 India Census Data shape ...................................................................................................................... 23
Figure 27 India Census Data head ........................................................................................................................ 23
Figure 28 India Census Data tail .......................................................................................................................... 23
Figure 29 India Census Data info .......................................................................................................................... 24
Figure 30 Statistical summary India Census Data ................................................................................................ 25
Figure 31 India Census data null values ............................................................................................................... 25
Figure 32 India Census Data duplicate rows......................................................................................................... 25
Figure 33 India census Sample data with shortlisted variables ............................................................................ 28
Figure 34 No of areas in each state ...................................................................................................................... 29
Figure 35 Bivariate analysis numeric data India Census ...................................................................................... 29
Figure 36 Boxplot for all variables ......................................................................................................................... 30
Figure 37 Scaled data head for India census Data ............................................................................................... 32
Figure 38 Before vs after scaling Data info India Census .................................................................................... 33
Figure 39 Boxplot scaled variables India Census Data ......................................................................................... 34
Figure 40 p-value for Factor_analyzer .................................................................................................................. 35
Figure 41 KMO Index ............................................................................................................................................ 36
Figure 42 PCA decomposition ............................................................................................................................... 36
Figure 43 Eigen vector 1 ....................................................................................................................................... 36
Figure 44 Variable component .............................................................................................................................. 36
Figure 45 Cumulative variance.............................................................................................................................. 37
Figure 46 Scree plot 1 ........................................................................................................................................... 38
Figure 47 Shape of revised PCA ........................................................................................................................... 39
Figure 48 Principal Components 2 ........................................................................................................................ 39
Figure 49 Explained variance of component ......................................................................................................... 40
Figure 50 PCA Data frame head ........................................................................................................................... 40
Figure 51 Rectangular magnitude for columns in India census data variables ..................................................... 40
Figure 52 New Data India Census PCA shape ..................................................................................................... 41
Figure 53 Summary PCA India Census Data ........................................................................................................ 41
Figure 54 PCA Boxplot India Census .................................................................................................................... 41
List of Tables
List of Equations
The ads24x7 is a Digital Marketing company which has now got seed funding of
$10 Million. They are expanding their wings in Marketing Analytics. They collected
data from their Marketing Intelligence team and now wants you (their newly
appointed data analyst) to segment type of ads based on the features provided.
Use Clustering procedure to segment ads into homogeneous groups.
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the
Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that
the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the
Total Measured Ad Impressions refers to the 'Impressions' Column in the dataset.
The Data Dictionary and the detailed description of the formulas for CPM, CPC
and CTR are given in the sheet 2 of the Excel file- Clustering Clean ads_data
Sl. Column
Column Description
No Name
1 Timestamp The Timestamp of the particular Advertisement.
InventoryT The Inventory Type of the particular Advertisement. Format 1 to 7. This is a Categorical
2
ype Variable.
Ad -
3 The Length Dimension of the particular Adverstisement.
Length
4 Ad- Width The Width Dimension of the particular Advertisement.
5 Ad Size The Overall Size of the particular Advertisement. Length*Width.
6 Ad Type The type of the particular Advertisement. This is a Categorical Variable.
The platform in which the particular Advertisement is displayed. Web, Video or App. This
7 Platform
is a Categorical Variable.
Device The type of the device which supports the partciular Advertisement. This is a Categorical
8
Type Variable.
9 Format The Format in which the Advertisement is displayed. This is a Categorical Variable.
Available_I
How often the particular Advertisement is shown. An impression is counted each time an
10 mpression
Advertisement is shown on a search result page or other site on a Network.
s
Matched search queries data is pulled from Advertising Platform and consists of the exact
Matched_
11 searches typed into the search Engine that generated clicks for the particular
Queries
Advertisement.
Impression The impression count of the particular Advertisement out of the total available
12
s impressions.
It is a marketing metric that counts the number of times users have clicked on the
13 Clicks
particular advertisement to reach an online property.
It is the amount of money spent on specific ad variations within a specific campaign or ad
14 Spend
set. This metric helps regulate ad performance.
15 Fee The percentage of the Advertising Fees payable by Franchise Entities.
16 Revenue It is the income that has been earned from the particular advertisement.
CTR stands for "Click through rate". CTR is the number of clicks that your ad receives
divided by the number of times your ad is shown. Formula used here is CTR = Total
17 CTR Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total Measured
Clicks refers to the 'Clicks' Column and the Total Measured Ad Impressions refers to the
'Impressions' Column.
CPM stands for "cost per 1000 impressions." Formula used here is CPM = (Total
Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign Spend
18 CPM
refers to the 'Spend' Column and the Number of Impressions refers to the 'Impressions'
Column.
CPC stands for "Cost-per-click". Cost-per-click (CPC) bidding means that you pay for each
click on your ads. The Formula used here is CPC = Total Cost (spend) / Number of Clicks.
19 CPC
Note that the Total Cost (spend) refers to the 'Spend' Column and the Number of Clicks
refers to the 'Clicks' Column.
Table 1 CleanAds Data Dictionary
Clustering: Read the data and perform basic analysis such as printing a few
rows (head and tail), info, data summary, null values duplicate values, etc.
Data Shape
Data info
There are 4636 Null values in CTR, CPM, and CPC Columns
Duplicated rows
Summary:
There are 23066 rows and 19 columns, out of which 6 columns are float, 7 columns are
integer type, and 6 are categorical variables. There are no duplicate rows in the dataset, while
there are missing values for calculated columns CTR, CPC and CPM (4736 rows)
Treat missing values in CPC, CTR and CPM using the formula given.
Using the data dictionary, we create functions to define CPC, CTR and CPM,
CTR stands for "Click through rate". CTR is the number of clicks that your ad receives divided
by the number of times your ad is shown.
Note that the Total Measured Clicks refers to the 'Clicks' Column and the Total Measured Ad
Impressions refers to the 'Impressions' Column.
CPM stands for "cost per 1000 impressions."
Note that the Total Cost (spend) refers to the 'Spend' Column and the Number of Clicks refers
to the 'Clicks' Column.
Table 2 CTR CPM CPC Define
After extending the formula for CTR, CPM and CPC, we have zero null values.
Based on the above figure, we can see there are outliers for CPC, CPM, CTR, Revenue, Fee,
Spend, Clicks, impressions, Matched Queries, Available Impressions, and Ad size.
The only quantitative columns without outliers are Ad width and Ad length.
We do not need to treat CPM, CTR, CPC and Ad size as they are calculated fields dependent
on other columns. Fee can vary up to 100%, hence we will treat the remaining columns.
Sl. Column
Column Description
No Name
Available_I
How often the particular Advertisement is shown. An impression is counted each time an
10 mpression
Advertisement is shown on a search result page or other site on a Network.
s
Matched search queries data is pulled from Advertising Platform and consists of the exact
Matched_
11 searches typed into the search Engine that generated clicks for the particular
Queries
Advertisement.
Impression The impression count of the particular Advertisement out of the total available
12
s impressions.
It is a marketing metric that counts the number of times users have clicked on the
13 Clicks
particular advertisement to reach an online property.
It is the amount of money spent on specific ad variations within a specific campaign or ad
14 Spend
set. This metric helps regulate ad performance.
16 Revenue It is the income that has been earned from the particular advertisement.
Table 3 Clean Ads Data Dictionary for outliers to be treated
Note: We could have replaced the outliers with median or mean values, but instead to ensure
we see clear clusters without causing dense groups with median, we are removing the
outliers.
Perform z-score scaling and discuss how it affects the speed of the algorithm.
Data description before and after scaling
Using Z score.
When scaling is done on column values, they converge more quickly as the variance in the
values is reduced. Scaling increases the ability to compute algorithms, by assisting in
reducing one parameter in complexity of algorithms, as it involves additional computations to
transform the data.
Z scores make the data more interpretable and comparable, as they have a mean of 0 and a
standard deviation bucketized.
For checking the Optimal number of clusters we use WSS (Within Sum Of Square)
Wss scores
Based on above figures, as we move from K=1 to K=2, We see that there is a significant drop
in the value. Also when we move from k=2 up to k=8 there is a significant drop as well.
As we move from k=8 to k=9, the drop in values reduces significantly. Hence in this case, the
WSS is not significantly dropping beyond 8, so 8 is optimal number of clusters. Based on the
business problem, it is advisable to take with highest drop or second highest. For now we can
go either with 6 or 8.
The Silhouette Score is a measure of how similar an object is to its own cluster compared to
other clusters. It ranges from -1 to 1, with higher values indicating better clustering.
Based on above figure and Elbow plot, we can see the highest silhouette score is for 8
clusters, hence we decide the number of clusters to be 8.
Summary:
An optimum 8 clusters can be created for the specified sample filtered out by omitting
outliers and treating missing values in the dataset.
The Click on Ads and Impressions on Ad are both directly proportional to the Revenue.
The spend on Ads can see variation based on the specific campaign. It can impact the
revenue.
For ads24x7 is a Digital Marketing company, the sample data has helped in identifying
8 separate segments to be targeted with their campaigns.
Low CTR
Cluster 1 3349
Very high clicks
Very high fees
High revenue Top target group high engagement/ returns
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes -
2011 PCA for Female Headed Household Excluding Institutional Household. The
Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different
points of time in different parts of the country. In 1881 a Census was taken for the
entire country simultaneously. Since then, Census has been conducted every ten
years, without a break. Thus, the Census of India 2011 was the fifteenth in this
unbroken series since 1872, the seventh after independence and the second
census of the third millennium and twenty first century. The census has been
uninterruptedly continued despite of several adversities like wars, epidemics,
natural calamities, political unrest, etc. The Census of India is conducted under
the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary
Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates,
Main Workers and Marginal Workers classified by the four broad industrial
categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household
Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and
rural-urban residence. Census 2011 covered 35 States/Union Territories, 640
districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful
details without using Data Science Techniques. You are tasked to perform
detailed EDA and identify Optimum Principal Components that explains the most
variance in data. Use Sklearn only.
Note: The 24 variables given in the Rubric is just for performing EDA. You will
have to consider the entire dataset, including all the variables for performing PCA.
Read the data and perform basic checks like checking head, info, summary, nulls,
and duplicates, etc.
Reading the excel data for India Census
Data shape
Data Head
Data tail
Data info
Data summary:
Data dictionary
Name Description
State State Code
District District Code
Name Name
TRU1 Area Name
No_HH No of Household
TOT_M Total population Male
TOT_F Total population Female
M_06 Population in the age group 0-6 Male
F_06 Population in the age group 0-6 Female
M_SC Scheduled Castes population Male
F_SC Scheduled Castes population Female
M_ST Scheduled Tribes population Male
F_ST Scheduled Tribes population Female
M_LIT Literates population Male
F_LIT Literates population Female
M_ILL Illiterate Male
F_ILL Illiterate Female
TOT_WORK_M Total Worker Population Male
TOT_WORK_F Total Worker Population Female
MAINWORK_M Main Working Population Male
MAINWORK_F Main Working Population Female
MAIN_CL_M Main Cultivator Population Male
MAIN_CL_F Main Cultivator Population Female
MAIN_AL_M Main Agricultural Labourers Population Male
MAIN_AL_F Main Agricultural Labourers Population Female
MAIN_HH_M Main Household Industries Population Male
MAIN_HH_F Main Household Industries Population Female
MAIN_OT_M Main Other Workers Population Male
MAIN_OT_F Main Other Workers Population Female
MARGWORK_M Marginal Worker Population Male
MARGWORK_F Marginal Worker Population Female
MARG_CL_M Marginal Cultivator Population Male
MARG_CL_F Marginal Cultivator Population Female
MARG_AL_M Marginal Agriculture Labourers Population Male
MARG_AL_F Marginal Agriculture Labourers Population Female
MARG_HH_M Marginal Household Industries Population Male
MARG_HH_F Marginal Household Industries Population Female
MARG_OT_M Marginal Other Workers Population Male
MARG_OT_F Marginal Other Workers Population Female
MARGWORK_3_6_M Marginal Worker Population 3-6 Male
MARGWORK_3_6_F Marginal Worker Population 3-6 Female
Summary:
There are 640 rows and 61 columns, out of which 59 columns are integer type, and 2 are
categorical variables. There are no null values in the data set, and there are no duplicate rows
as well.
Perform detailed Exploratory analysis by creating certain questions like (i) Which
state has highest gender ratio and which has the lowest? (ii) Which district has
the highest & lowest gender ratio? Pick 5 variables out of the given 24 variables
below for EDA.
In order to analyze the data, we are keeping the following columns in the data set,
State code as categorical variable and No of households, Total Male and Total Female
population, Male and Female literate population.
Sr.
no. Name Description
1 State State Code
2 No_HH No of Household
3 TOT_M Total population Male
4 TOT_F Total population Female
5 M_LIT Literates population Male
Literates population
6 F_LIT Female
Table 6 Shortlisted columns Data Dictionary
We choose not to treat outliers for this case. Do you think that treating outliers for
this case is necessary?
No, treating outliers is not necessary in this case according to me, as the data presented is a
primary source data, collected from house to house, door to door and should easily be
accurate to large extent.
This data is basically classified as a binary form, where user belongs to or does belong to
literate/ illiterate category, belongs to working or non-working category, and similarly in age
group buckets. It can only account to only human errors in recording the data, but the
chances of that are very low.
These outliers are Valid outliers and shouldn’t be treated. Even if we intend to eliminate
outliers, it will eliminate a necessary representation of a state in the census data, which will
create fallacy.
Scale the Data using z-score method. Does scaling have any impact on outliers?
Compare boxplots before and after scaling and comment. his case is necessary?
In order to perform scaling on the given dataset, lets consider all the variables that are numeric.
Applying z score
Perform all the required steps for PCA (use sklearn only) Create the covariance
Matrix Get eigen values and eigen vector.
Find below steps for PCA (Principal Component Analysis)
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the
population.
As p=0, we can reject the Null hypothesis and agree that at least one pair of variables in the
data are correlated. Hence we can proceed with PCA.
KMO Test
Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected.
On the other hand, MSA > 0.7 is expected to provide a considerable reduction is the
dimension and extraction of meaningful components.
As KMO Index value is 0.807, there is a considerable reduction required to extract meaningful
information.
Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.
Obtaining the Eigen Vectors when the Principal Components are kept exactly as the number
of features in the scaled data
We can see above that more than 84% of the variance is explained by 5 Principal Components.
Around 93% of the variance is explained by 9 Principal Components.
Around 97% of the variance is explained by 14 Principal Components.
The number of components can be decided based upon the cumulative variance.
Compare PCs with Actual Columns and identify which is explaining most
variance. Write inferences about all the Principal components in terms of actual
variables.
Applying PCA for the number of decided components to get the loadings and component
output.
Now we can create a dataframe of component loading against each field and identify the
pattern
Concatenating the PCA data with the categorical variables, we see the revised, shape, head
and description for the variables