Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

GRADED PROJECT- DATA

MINING
DSBA-AS Module

Faizan Ali Sayyed

Faizan Ali Sayyed

0
1

Table of Contents
Home ................................................................................................................................................... 0
Index ................................................................................................................................................... 1
List of Figures .................................................................................................................................. 2
List of Tables .................................................................................................................................... 3
Clustering ............................................................................................................................................ 4
Problem statement ........................................................................................................................... 4
Data introduction information............................................................................................................ 6
Solution Clustering Clean Ads Data problem .................................................................................. 10
Summary........................................................................................................................................ 21
PCA ................................................................................................................................................... 22
Problem statement ......................................................................................................................... 22
Data introduction information.......................................................................................................... 23
Solution PCA India Census Data problem ...................................................................................... 28
Summary........................................................................................................................................ 42

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


2

List of Figures
Figure 1 Clean Ads Data Shape .............................................................................................................................. 7
Figure 2 Clean Ads Data info .................................................................................................................................. 7
Figure 3 Clean Ads Data head ................................................................................................................................ 7
Figure 4 Clean Ads Data tail ................................................................................................................................... 8
Figure 5 Clean Ads Data describe .......................................................................................................................... 8
Figure 6 Clean Ads Data null values ....................................................................................................................... 8
Figure 7 Clean Ads Data duplicated rows ............................................................................................................... 9
Figure 8 Revised Clean Ads Data info .................................................................................................................. 10
Figure 9 Box Plot: Clean Ads Data ........................................................................................................................ 11
Figure 10 Clean Ads Data Old describe 2 ............................................................................................................. 12
Figure 11 Old Shape ............................................................................................................................................. 12
Figure 12 Clustering Clean Data outlier treatement .............................................................................................. 13
Figure 13 Clean up data info after outlier treatment .............................................................................................. 14
Figure 14 Clean Ads Data description before scaling ........................................................................................... 15
Figure 15 Clean Ads Data description after scaling .............................................................................................. 15
Figure 16 Dendogram Ward –Eucledian ................................................................................................................ 16
Figure 17 Dendogram Ward-Eucledian (p=10) ..................................................................................................... 16
Figure 18 Wss values Clean Ads Data .................................................................................................................. 17
Figure 19 Elbow Plot Clean Ads Data ................................................................................................................... 17
Figure 20 Clean Ads Data Silhouette analysis ...................................................................................................... 18
Figure 21 Assigning clusters to the rows ............................................................................................................... 19
Figure 22 Cluster Distribution Clean Ads Data ..................................................................................................... 19
Figure 23 Clean Ads Data Cluster summary ......................................................................................................... 19
Figure 24 Dendogram hard to see labels for the 8 clusters .................................................................................. 20
Figure 25 Cutting the Dendogram with suitable clusters ....................................................................................... 20
Figure 26 India Census Data shape ...................................................................................................................... 23
Figure 27 India Census Data head ........................................................................................................................ 23
Figure 28 India Census Data tail .......................................................................................................................... 23
Figure 29 India Census Data info .......................................................................................................................... 24
Figure 30 Statistical summary India Census Data ................................................................................................ 25
Figure 31 India Census data null values ............................................................................................................... 25
Figure 32 India Census Data duplicate rows......................................................................................................... 25
Figure 33 India census Sample data with shortlisted variables ............................................................................ 28
Figure 34 No of areas in each state ...................................................................................................................... 29
Figure 35 Bivariate analysis numeric data India Census ...................................................................................... 29
Figure 36 Boxplot for all variables ......................................................................................................................... 30
Figure 37 Scaled data head for India census Data ............................................................................................... 32
Figure 38 Before vs after scaling Data info India Census .................................................................................... 33
Figure 39 Boxplot scaled variables India Census Data ......................................................................................... 34
Figure 40 p-value for Factor_analyzer .................................................................................................................. 35
Figure 41 KMO Index ............................................................................................................................................ 36
Figure 42 PCA decomposition ............................................................................................................................... 36
Figure 43 Eigen vector 1 ....................................................................................................................................... 36
Figure 44 Variable component .............................................................................................................................. 36
Figure 45 Cumulative variance.............................................................................................................................. 37
Figure 46 Scree plot 1 ........................................................................................................................................... 38
Figure 47 Shape of revised PCA ........................................................................................................................... 39
Figure 48 Principal Components 2 ........................................................................................................................ 39
Figure 49 Explained variance of component ......................................................................................................... 40
Figure 50 PCA Data frame head ........................................................................................................................... 40

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


3

Figure 51 Rectangular magnitude for columns in India census data variables ..................................................... 40
Figure 52 New Data India Census PCA shape ..................................................................................................... 41
Figure 53 Summary PCA India Census Data ........................................................................................................ 41
Figure 54 PCA Boxplot India Census .................................................................................................................... 41

List of Tables

Table 1 CleanAds Data Dictionary _____________________________________________________________ 6


Table 2 CTR CPM CPC Define ______________________________________________________________ 10
Table 3 Clean Ads Data Dictionary for outliers to be treated ________________________________________ 12
Table 4 Cluster Summary table ______________________________________________________________ 21
Table 5 India Census Data Dictionary _________________________________________________________ 27
Table 6 Shortlisted columns Data Dictionary ____________________________________________________ 28

List of Equations

Equation 1 Linear equation for first Principal Component ..................................................................................... 42

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


4

Clustering - Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of
$10 Million. They are expanding their wings in Marketing Analytics. They collected
data from their Marketing Intelligence team and now wants you (their newly
appointed data analyst) to segment type of ads based on the features provided.
Use Clustering procedure to segment ads into homogeneous groups.

The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the
Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.

CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that
the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the
Total Measured Ad Impressions refers to the 'Impressions' Column in the dataset.

The Data Dictionary and the detailed description of the formulas for CPM, CPC
and CTR are given in the sheet 2 of the Excel file- Clustering Clean ads_data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


5

Perform the following in given order:


 Read the data and perform basic analysis such as printing a few rows
(head and tail), info, data summary, null values duplicate values, etc.
 Treat missing values in CPC, CTR and CPM using the formula given. You
may refer to the Bank_KMeans Solution File to understand the coding
behind treating the missing values using a specific formula. You have to
basically create a user defined function and then call the function for
imputing.
 Check if there are any outliers.
 Do you think treating outliers is necessary for K-Means clustering? Based
on your judgement decide whether to treat outliers and if yes, which
method to employ. (As an analyst your judgement may be different from
another analyst).
 Perform z-score scaling and discuss how it affects the speed of the
algorithm.
 Perform clustering and do the following:
o Perform Hierarchical by constructing a Dendrogram using WARD
and Euclidean distance.
o Make Elbow plot (up to n=10) and identify optimum number of
clusters for k-means algorithm.
o Print silhouette scores for up to 10 clusters and identify optimum
number of clusters.
o Profile the ads based on optimum number of clusters using
silhouette score and your domain understanding
[Hint: Group the data by clusters and take sum or mean to identify
trends in clicks, spend, revenue, CPM, CTR, & CPC based on
Device Type. Make bar plots.]
 Conclude the project by providing summary of your learnings.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


6

Data Dictionary CleanAds Data:

Sl. Column
Column Description
No Name
1 Timestamp The Timestamp of the particular Advertisement.
InventoryT The Inventory Type of the particular Advertisement. Format 1 to 7. This is a Categorical
2
ype Variable.
Ad -
3 The Length Dimension of the particular Adverstisement.
Length
4 Ad- Width The Width Dimension of the particular Advertisement.
5 Ad Size The Overall Size of the particular Advertisement. Length*Width.
6 Ad Type The type of the particular Advertisement. This is a Categorical Variable.
The platform in which the particular Advertisement is displayed. Web, Video or App. This
7 Platform
is a Categorical Variable.
Device The type of the device which supports the partciular Advertisement. This is a Categorical
8
Type Variable.
9 Format The Format in which the Advertisement is displayed. This is a Categorical Variable.
Available_I
How often the particular Advertisement is shown. An impression is counted each time an
10 mpression
Advertisement is shown on a search result page or other site on a Network.
s
Matched search queries data is pulled from Advertising Platform and consists of the exact
Matched_
11 searches typed into the search Engine that generated clicks for the particular
Queries
Advertisement.
Impression The impression count of the particular Advertisement out of the total available
12
s impressions.
It is a marketing metric that counts the number of times users have clicked on the
13 Clicks
particular advertisement to reach an online property.
It is the amount of money spent on specific ad variations within a specific campaign or ad
14 Spend
set. This metric helps regulate ad performance.
15 Fee The percentage of the Advertising Fees payable by Franchise Entities.
16 Revenue It is the income that has been earned from the particular advertisement.
CTR stands for "Click through rate". CTR is the number of clicks that your ad receives
divided by the number of times your ad is shown. Formula used here is CTR = Total
17 CTR Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total Measured
Clicks refers to the 'Clicks' Column and the Total Measured Ad Impressions refers to the
'Impressions' Column.
CPM stands for "cost per 1000 impressions." Formula used here is CPM = (Total
Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign Spend
18 CPM
refers to the 'Spend' Column and the Number of Impressions refers to the 'Impressions'
Column.
CPC stands for "Cost-per-click". Cost-per-click (CPC) bidding means that you pay for each
click on your ads. The Formula used here is CPC = Total Cost (spend) / Number of Clicks.
19 CPC
Note that the Total Cost (spend) refers to the 'Spend' Column and the Number of Clicks
refers to the 'Clicks' Column.
Table 1 CleanAds Data Dictionary

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


7

Clustering: Read the data and perform basic analysis such as printing a few
rows (head and tail), info, data summary, null values duplicate values, etc.

Data Shape

Figure 1 Clean Ads Data Shape

Data info

Figure 2 Clean Ads Data info

Data head and tail

Figure 3 Clean Ads Data head

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


8

Figure 4 Clean Ads Data tail

Data Describe statistically

Figure 5 Clean Ads Data describe

Data null values

Figure 6 Clean Ads Data null values

There are 4636 Null values in CTR, CPM, and CPC Columns

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


9

Duplicated rows

Figure 7 Clean Ads Data duplicated rows

Data has no duplicate rows

Summary:

There are 23066 rows and 19 columns, out of which 6 columns are float, 7 columns are
integer type, and 6 are categorical variables. There are no duplicate rows in the dataset, while
there are missing values for calculated columns CTR, CPC and CPM (4736 rows)

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


10

Treat missing values in CPC, CTR and CPM using the formula given.

Using the data dictionary, we create functions to define CPC, CTR and CPM,

CTR stands for "Click through rate". CTR is the number of clicks that your ad receives divided
by the number of times your ad is shown.

CTR CTR = Total Measured Clicks / Total Measured Ad Impressions x 100.

Note that the Total Measured Clicks refers to the 'Clicks' Column and the Total Measured Ad
Impressions refers to the 'Impressions' Column.
CPM stands for "cost per 1000 impressions."

CPM = (Total Campaign Spend / Number of Impressions) * 1,000.


CPM
Note that the Total Campaign Spend refers to the 'Spend' Column and the Number of
Impressions refers to the 'Impressions' Column.
CPC stands for "Cost-per-click". Cost-per-click (CPC) bidding means that you pay for each click
on your ads.

CPC CPC = Total Cost (spend) / Number of Clicks.

Note that the Total Cost (spend) refers to the 'Spend' Column and the Number of Clicks refers
to the 'Clicks' Column.
Table 2 CTR CPM CPC Define

Revised Data info

Figure 8 Revised Clean Ads Data info

After extending the formula for CTR, CPM and CPC, we have zero null values.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


11

Check if there are any outliers.


Checking for outliers using Boxplot chart

Figure 9 Box Plot: Clean Ads Data

Based on the above figure, we can see there are outliers for CPC, CPM, CTR, Revenue, Fee,
Spend, Clicks, impressions, Matched Queries, Available Impressions, and Ad size.

The only quantitative columns without outliers are Ad width and Ad length.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


12

Do you think treating outliers is necessary for K-Means clustering? Based on


your judgement decide whether to treat outliers and if yes, which method to
employ
Yes. Treating Outliers is necessary for K-Means Clustering. We are going to treat outliers by
IQR Method.

We do not need to treat CPM, CTR, CPC and Ad size as they are calculated fields dependent
on other columns. Fee can vary up to 100%, hence we will treat the remaining columns.

Sl. Column
Column Description
No Name
Available_I
How often the particular Advertisement is shown. An impression is counted each time an
10 mpression
Advertisement is shown on a search result page or other site on a Network.
s
Matched search queries data is pulled from Advertising Platform and consists of the exact
Matched_
11 searches typed into the search Engine that generated clicks for the particular
Queries
Advertisement.
Impression The impression count of the particular Advertisement out of the total available
12
s impressions.
It is a marketing metric that counts the number of times users have clicked on the
13 Clicks
particular advertisement to reach an online property.
It is the amount of money spent on specific ad variations within a specific campaign or ad
14 Spend
set. This metric helps regulate ad performance.
16 Revenue It is the income that has been earned from the particular advertisement.
Table 3 Clean Ads Data Dictionary for outliers to be treated

Figure 10 Clean Ads Data Old describe 2

Figure 11 Old Shape

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


13

The step by step outlier treatment:

Note: We could have replaced the outliers with median or mean values, but instead to ensure
we see clear clusters without causing dense groups with median, we are removing the
outliers.

Figure 12 Clustering Clean Data outlier treatement

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


14

Figure 13 Clean up data info after outlier treatment

Note Dropped Categorical columns from the dataset.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


15

Perform z-score scaling and discuss how it affects the speed of the algorithm.
Data description before and after scaling
Using Z score.

Figure 14 Clean Ads Data description before scaling

Figure 15 Clean Ads Data description after scaling

When scaling is done on column values, they converge more quickly as the variance in the
values is reduced. Scaling increases the ability to compute algorithms, by assisting in
reducing one parameter in complexity of algorithms, as it involves additional computations to
transform the data.

Z scores make the data more interpretable and comparable, as they have a mean of 0 and a
standard deviation bucketized.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


16

Perform clustering and do the following:


 Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean
distance.
 Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm.
 Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
 Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding

Figure 16 Dendogram Ward –Eucledian

Using p=10 get only the last 10 merged clusters

Figure 17 Dendogram Ward-Eucledian (p=10)

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


17

For checking the Optimal number of clusters we use WSS (Within Sum Of Square)

Wss scores

Figure 18 Wss values Clean Ads Data

Figure 19 Elbow Plot Clean Ads Data

Based on above figures, as we move from K=1 to K=2, We see that there is a significant drop
in the value. Also when we move from k=2 up to k=8 there is a significant drop as well.

As we move from k=8 to k=9, the drop in values reduces significantly. Hence in this case, the
WSS is not significantly dropping beyond 8, so 8 is optimal number of clusters. Based on the
business problem, it is advisable to take with highest drop or second highest. For now we can
go either with 6 or 8.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


18

Figure 20 Clean Ads Data Silhouette analysis

The Silhouette Score is a measure of how similar an object is to its own cluster compared to
other clusters. It ranges from -1 to 1, with higher values indicating better clustering.

Based on above figure and Elbow plot, we can see the highest silhouette score is for 8
clusters, hence we decide the number of clusters to be 8.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


19

Conclude the project by providing summary of your learnings.

Figure 21 Assigning clusters to the rows

Figure 22 Cluster Distribution Clean Ads Data

Figure 23 Clean Ads Data Cluster summary

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


20

Figure 24 Dendogram hard to see labels for the 8 clusters

Figure 25 Cutting the Dendogram with suitable clusters

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


21

Summary:
 An optimum 8 clusters can be created for the specified sample filtered out by omitting
outliers and treating missing values in the dataset.
 The Click on Ads and Impressions on Ad are both directly proportional to the Revenue.
 The spend on Ads can see variation based on the specific campaign. It can impact the
revenue.
 For ads24x7 is a Digital Marketing company, the sample data has helped in identifying
8 separate segments to be targeted with their campaigns.

Cluster/ Segment Users Impressions Comments

Cluster 0 5816 Largest segment


Lowest clicks Hard to target

Low CTR
Cluster 1 3349
Very high clicks
Very high fees
High revenue Top target group high engagement/ returns

Highest search matches


High CTR, very low CPC, CPM
Cluster 2 1390 Underutilized group
Very high Impressions
Don’t need to spend too much as effective
Low CTR
CTR is quite low

Cluster 3 1743 Very high impressions


High CTR, Low CPM
Favorable normal segment
High Fees
High spend

Cluster 4 892 Moderate size


High CTR
Moderate CPM CPC Favorable moderate segment
Moderate spend
high fees

Cluster 5 115 Very high clicks


Very high fees
Low revenue
Highly engaged target group

Highest search matches


Highest CTR, Low CPM, CPC

Cluster 6 2463 Positive impressions and clicks Low target group


Low CTR Relatively costly to generate clicks

Cluster 7 102 Smallest segment


Lowest clicks
Most costly group to generate a click or an
Worst click rate
impression

Table 4 Cluster Summary table

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


22

PCA India Census Data:

PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes -
2011 PCA for Female Headed Household Excluding Institutional Household. The
Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different
points of time in different parts of the country. In 1881 a Census was taken for the
entire country simultaneously. Since then, Census has been conducted every ten
years, without a break. Thus, the Census of India 2011 was the fifteenth in this
unbroken series since 1872, the seventh after independence and the second
census of the third millennium and twenty first century. The census has been
uninterruptedly continued despite of several adversities like wars, epidemics,
natural calamities, political unrest, etc. The Census of India is conducted under
the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary
Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates,
Main Workers and Marginal Workers classified by the four broad industrial
categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household
Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and
rural-urban residence. Census 2011 covered 35 States/Union Territories, 640
districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful
details without using Data Science Techniques. You are tasked to perform
detailed EDA and identify Optimum Principal Components that explains the most
variance in data. Use Sklearn only.

Note: The 24 variables given in the Rubric is just for performing EDA. You will
have to consider the entire dataset, including all the variables for performing PCA.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


23

Read the data and perform basic checks like checking head, info, summary, nulls,
and duplicates, etc.
Reading the excel data for India Census

Data shape

Figure 26 India Census Data shape

Data Head

Figure 27 India Census Data head

Data tail

Figure 28 India Census Data tail

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


24

Data info

Figure 29 India Census Data info

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


25

Data summary:

Figure 30 Statistical summary India Census Data

Data null values check:

Figure 31 India Census data null values

No null values present in the data.

Data duplicate rows check:

Figure 32 India Census Data duplicate rows

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


26

Data dictionary

Name Description
State State Code
District District Code
Name Name
TRU1 Area Name
No_HH No of Household
TOT_M Total population Male
TOT_F Total population Female
M_06 Population in the age group 0-6 Male
F_06 Population in the age group 0-6 Female
M_SC Scheduled Castes population Male
F_SC Scheduled Castes population Female
M_ST Scheduled Tribes population Male
F_ST Scheduled Tribes population Female
M_LIT Literates population Male
F_LIT Literates population Female
M_ILL Illiterate Male
F_ILL Illiterate Female
TOT_WORK_M Total Worker Population Male
TOT_WORK_F Total Worker Population Female
MAINWORK_M Main Working Population Male
MAINWORK_F Main Working Population Female
MAIN_CL_M Main Cultivator Population Male
MAIN_CL_F Main Cultivator Population Female
MAIN_AL_M Main Agricultural Labourers Population Male
MAIN_AL_F Main Agricultural Labourers Population Female
MAIN_HH_M Main Household Industries Population Male
MAIN_HH_F Main Household Industries Population Female
MAIN_OT_M Main Other Workers Population Male
MAIN_OT_F Main Other Workers Population Female
MARGWORK_M Marginal Worker Population Male
MARGWORK_F Marginal Worker Population Female
MARG_CL_M Marginal Cultivator Population Male
MARG_CL_F Marginal Cultivator Population Female
MARG_AL_M Marginal Agriculture Labourers Population Male
MARG_AL_F Marginal Agriculture Labourers Population Female
MARG_HH_M Marginal Household Industries Population Male
MARG_HH_F Marginal Household Industries Population Female
MARG_OT_M Marginal Other Workers Population Male
MARG_OT_F Marginal Other Workers Population Female
MARGWORK_3_6_M Marginal Worker Population 3-6 Male
MARGWORK_3_6_F Marginal Worker Population 3-6 Female

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


27

MARG_CL_3_6_M Marginal Cultivator Population 3-6 Male


MARG_CL_3_6_F Marginal Cultivator Population 3-6 Female
MARG_AL_3_6_M Marginal Agriculture Labourers Population 3-6 Male
MARG_AL_3_6_F Marginal Agriculture Labourers Population 3-6 Female
MARG_HH_3_6_M Marginal Household Industries Population 3-6 Male
MARG_HH_3_6_F Marginal Household Industries Population 3-6 Female
MARG_OT_3_6_M Marginal Other Workers Population Person 3-6 Male
MARG_OT_3_6_F Marginal Other Workers Population Person 3-6 Female
MARGWORK_0_3_M Marginal Worker Population 0-3 Male
MARGWORK_0_3_F Marginal Worker Population 0-3 Female
MARG_CL_0_3_M Marginal Cultivator Population 0-3 Male
MARG_CL_0_3_F Marginal Cultivator Population 0-3 Female
MARG_AL_0_3_M Marginal Agriculture Labourers Population 0-3 Male
MARG_AL_0_3_F Marginal Agriculture Labourers Population 0-3 Female
MARG_HH_0_3_M Marginal Household Industries Population 0-3 Male
MARG_HH_0_3_F Marginal Household Industries Population 0-3 Female
MARG_OT_0_3_M Marginal Other Workers Population 0-3 Male
MARG_OT_0_3_F Marginal Other Workers Population 0-3 Female
NON_WORK_M Non Working Population Male
NON_WORK_F Non Working Population Female
Table 5 India Census Data Dictionary

Summary:

There are 640 rows and 61 columns, out of which 59 columns are integer type, and 2 are
categorical variables. There are no null values in the data set, and there are no duplicate rows
as well.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


28

Perform detailed Exploratory analysis by creating certain questions like (i) Which
state has highest gender ratio and which has the lowest? (ii) Which district has
the highest & lowest gender ratio? Pick 5 variables out of the given 24 variables
below for EDA.
In order to analyze the data, we are keeping the following columns in the data set,
State code as categorical variable and No of households, Total Male and Total Female
population, Male and Female literate population.

Figure 33 India census Sample data with shortlisted variables

Sr.
no. Name Description
1 State State Code
2 No_HH No of Household
3 TOT_M Total population Male
4 TOT_F Total population Female
5 M_LIT Literates population Male
Literates population
6 F_LIT Female
Table 6 Shortlisted columns Data Dictionary

Using the above data, lets analyse it obtains insights.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


29

Figure 34 No of areas in each state

Figure 35 Bivariate analysis numeric data India Census

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


30

Figure 36 Boxplot for all variables

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


31

We choose not to treat outliers for this case. Do you think that treating outliers for
this case is necessary?
No, treating outliers is not necessary in this case according to me, as the data presented is a
primary source data, collected from house to house, door to door and should easily be
accurate to large extent.

This data is basically classified as a binary form, where user belongs to or does belong to
literate/ illiterate category, belongs to working or non-working category, and similarly in age
group buckets. It can only account to only human errors in recording the data, but the
chances of that are very low.

These outliers are Valid outliers and shouldn’t be treated. Even if we intend to eliminate
outliers, it will eliminate a necessary representation of a state in the census data, which will
create fallacy.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


32

Scale the Data using z-score method. Does scaling have any impact on outliers?
Compare boxplots before and after scaling and comment. his case is necessary?

In order to perform scaling on the given dataset, lets consider all the variables that are numeric.

Applying z score

Figure 37 Scaled data head for India census Data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


33

Figure 38 Before vs after scaling Data info India Census

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


34

Figure 39 Boxplot scaled variables India Census Data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


35

Perform all the required steps for PCA (use sklearn only) Create the covariance
Matrix Get eigen values and eigen vector.
Find below steps for PCA (Principal Component Analysis)

1. Performing Outlier Treatment


2. Scaling of the data
3. Create Covariance Matrix
4. Extract Eigen Vector
5. Find Eigen Value
6. Create WSS Scree plot for variance
7. Find a cut off for selecting the number of PCs

Statistical tests to be done before PCA

Bartletts Test of Sphericity

Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the
population.

H0: All variables in the data are uncorrelated

HA: At least one pair of variables in the data are correlated

 If the null hypothesis cannot be rejected, then PCA is not advisable.


 If the p-value is small, then we can reject the null hypothesis and agree that there is
atleast one pair of vairbales in the data wihich are correlated hence PCA is
recommended.

Figure 40 p-value for Factor_analyzer

As p=0, we can reject the Null hypothesis and agree that at least one pair of variables in the
data are correlated. Hence we can proceed with PCA.

KMO Test

The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to


examine how appropriate PCA is.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


36

Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected.
On the other hand, MSA > 0.7 is expected to provide a considerable reduction is the
dimension and extraction of meaningful components.

Figure 41 KMO Index

As KMO Index value is 0.807, there is a considerable reduction required to extract meaningful
information.
Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.

Figure 42 PCA decomposition

Obtaining the Eigen Vectors when the Principal Components are kept exactly as the number
of features in the scaled data

Figure 43 Eigen vector 1

Obtaining variance of individual component

Figure 44 Variable component

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


37

Obtaining the Cumulative Sum of the Explained Variance

Figure 45 Cumulative variance

 We can see above that more than 84% of the variance is explained by 5 Principal Components.
 Around 93% of the variance is explained by 9 Principal Components.
 Around 97% of the variance is explained by 14 Principal Components.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


38

Figure 46 Scree plot 1

The number of components can be decided based upon the cumulative variance.

Hence, we can take number of components as 5 as the cumulative explained variance is


around 84%

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


39

Compare PCs with Actual Columns and identify which is explaining most
variance. Write inferences about all the Principal components in terms of actual
variables.
Applying PCA for the number of decided components to get the loadings and component
output.

Figure 47 Shape of revised PCA

Figure 48 Principal Components 2

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


40

Figure 49 Explained variance of component

Now we can create a dataframe of component loading against each field and identify the
pattern

Figure 50 PCA Data frame head

Figure 51 Rectangular magnitude for columns in India census data variables

Concatenating the PCA data with the categorical variables, we see the revised, shape, head
and description for the variables

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


41

Figure 52 New Data India Census PCA shape

Figure 53 Summary PCA India Census Data

Figure 54 PCA Boxplot India Census

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


42

Write linear equation for first PC.


Principal components are linear combinations of the original variables. Each PC is a linear
combination of all variables, or scaled variables, as the case may be. It is possible that some
of the coefficients are very small numbers or close to 0. We present the linear combinations
that make up the first 5 PC’s.

Equation 1 Linear equation for first Principal Component

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

You might also like