Professional Documents
Culture Documents
Data Science Training Report.
Data Science Training Report.
1
DataScience
2
DataScience
DECLARATION
I hereby certify that the work which is being presented in the report entitled “Data
Science” in fulfillment of the requirement for completion of one-month
industrial training in Department of Computer Science and Engineering of
“Institute of Engineering and Technology , Bundelkhand University ” is an
authentic record of my own work carried out during industrial training.
Aman Husain
191381030010
CSE , 7 th sem
3
DataScience
ACKNOWLEDGEMENT
The work in this report is an outcome of continuous work over a period and drew
intellectual support from Internshala and other sources. I would like to articulate
our profound gratitude and indebtedness to Internshala helped us in completion
of the training. I am thankful to Internshala Training Associates for teaching and
assisting me in making the training successful.
Aman Husain
191381030010
CSE , 7 th sem
4
DataScience
Introduction to Organization:
Internshala is an internship and online training platform, based in Gurgaon, India.
Founded by Sarvesh Agrawal, an IIT Madras alumnus, in 2010, the website
helps students find internships with organisations in India. The platform started
as a WordPress blog which aggregated internships across India and articles on
education, technology and skill gap in 2010. The website was launched in 2013.
Internshala launched its online trainings in 2014. The platform is used by 2.0
Mn + students and 70000+ companies. At the core of the idea is the belief that
internships, if managed well, can make a positive difference to the student, to
the employer, and to the society at large. Hence, the ad-hoc culture surrounding
internships in India should and would change. Internshala aims to be the driver
of this change.
5
DataScience
About Training:
The Data Science Training by Internshala is a 6-week online training program in which
Internshala aim to provide you with a comprehensive introduction to data
science. In this training program, you will learn the basics of python, statistics,
predictive modeling, and machine learning. This training program has video
tutorials and is packed with assignments, assessments tests, quizzes, and practice
exercises for you to get a hands-on learning experience. At the end of this
training program, you will have a solid understanding of data science and will be
able to build an end-to-end predictive model. For doubt clearing, you can post
your queries on the forum and get answers within 24 hours.
6
DataScience
Table of Content
Introduction to Organization About Training
Module-1: Introduction to Data Science
1.1. Data Science Overview
7
DataScience
8
DataScience
Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and probability to forecast
or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to purchase our new
One AI software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to act and adapt to
new data without being programmed to do so. The computer is able to act independently of
human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past and present data and
most commonly by analysis of trends. "Guessing" doesn't cut it. A forecast, unlike a
prediction, must have logic to it. It must be defendable. This logic is what differentiates it
from the magic 8 ball's lucky guess. After all, even a broken watch is right two times a day.
9
DataScience
10
DataScience
Netflix knew that significant numbers of people who liked Fincher also liked Wright. All this
information combined to suggest that buying the series would be a good investment for the
company.
11
DataScience
• Python is currently the most widely used multi-purpose, high-level programming language.
• Python allows programming in Object-Oriented and Procedural paradigms.
• Python programs generally are smaller than other programming languages like Java.
Programmers have to type relatively less and indentation requirement of the language, makes
them readable all the time.
• Python language is being used by almost all tech-giant companies like – Google, Amazon,
Facebook, Instagram, Dropbox, Uber… etc.
• The biggest strength of Python is huge collection of standard library which can be used for the
following:
• Machine Learning
• GUI Applications (like Kivy, Tkinter, PyQt etc. )
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like OpenCV, Pillow)
• Web scraping (like Scrapy, BeautifulSoup, Selenium)
• Test frameworks
• Multimedia
• Scientific computing
• Text processing and many more.
12
DataScience
O DESCRIPTION S
P Y
E N
R T
A A
T X
O
R
+ Addition: adds two operands x
+
y
- Subtraction: subtracts two operands x
-
y
* Multiplication: multiplies two operands x
*
y
/ Division (float): divides the first operand by the x
second /
y
// Division (floor): divides the first operand by the x
second /
/
y
Modulus: returns the remainder when first operand is
y
** Power : Returns first raised to power second x
*
*
13
DataScience
b. Relational Operators:
Relational operators compares the values. It either
returns True or False according to the condition.
O DESCRIPTION S
P Y
E N
R T
A A
T X
O
R
> Greater than: True if left operand is greater than the right x
>
y
< Less than: True if left operand is less than the right x
<
y
x
!
=
y
Greater than or equal to: True if left operand is greater than x
>
14
DataScience
15
DataScience
c. Logical operators:
Logical operators perform Logical AND, Logical OR and Logical
NOT operations.
OP DESCRIPTION SYNT
E A
R X
A
T
O
R
and Logical AND: True if both the operands are true x and
y
d. Bitwise operators:
Bitwise operators acts on bits and performs bit by bit operation.
OPERATOR DESCRIPTION SYNTAX
| Bitwise OR x| y
~ Bitwise NOT ~x
16
DataScience
e. Assignment operators:
Assignment operators are used to assign values to the variables.
O DESCRIPTION SYNTAX
P
E
R
A
T
O
R
Assign value of right side of expression to
to left operand
left operand
17
DataScience
a <<=
Performs Bitwise left shift on operands and
assign value to left operand b a= a <<
< b
<
=
18
DataScience
OP DESCRIPTION ASSOCIATIVI
E TY
R
A
T
O
R
() Parentheses left-to-right
** Exponent right-to-left
*/ Multiplication/division/modulus left-to-right
%
19
DataScience
OP DESCRIPTION ASSOC
E I
R A
A T
T I
O V
R I
T
Y
+- Addition/subtraction left-to-
r
i
g
h
t
<< Bitwise shift left, Bitwise shift right left-to-
> r
> i
g
h
t
Relational less than/less than or equal to
20
DataScience
• Python is case-sensitive, and so are Python identifiers. Name and name are two different
identifiers.
b. Assigning and Reassigning Python Variables:
• To assign a value to Python variables, you don‟t need to declare its type.
• You name it according to the rules stated in section 2a, and type the value after the equal
sign(=).
• You can‟t put the identifier on the right-hand side of the equal sign.
• Neither can you assign Python variables to a keyword.
21
DataScience
c. Multiple Assignment:
• You can assign values to multiple Python variables in one statement.
• You can assign the same value to multiple Python variables.
d. Deleting Variables:
• You can also delete Python variables using the keyword „del‟.
Data Types:
A. Python Numbers:
There are four numeric Python data types.
a. int
int stands for integer. This Python Data Type holds signed integers. We can use the type() function to
find which class it belongs to.
b. float
This Python Data Type holds floating-point real values. An int can only store the number 3, but float
can store 3.25 if you want.
c. long
This Python Data type holds a long integer of unlimited length. But this construct does not exist in
Python 3.x.
d. complex
This Python Data type holds a complex number. A complex number looks like this: a+bj Here, a and b
are the real parts of the number, and j is imaginary.
B. Strings:
A string is a sequence of characters. Python does not have a char data type, unlike C++ or Java. You
can delimit a string using single quotes or double-quotes.
a. Spanning a String Across Lines:
To span a string across multiple lines, you can use triple quotes.
b. Displaying Part of a String:
You can display a character from a string using its index in the string. Remember, indexing starts with
0.
c. String Formatters:
String formatters allow us to print characters and values at once. You can use the
% operator.
d. String Concatenation:
You can concatenate(join) strings using + operator. However, you cannot concatenate values of
different types.
C. Python Lists:
A list is a collection of values. Remember, it may contain different types of values.
22
DataScience
To define a list, you must put values separated with commas in square brackets. You don‟t need to
declare a type for a list either.
a. Slicing a List
You can slice a list the way you‟d slice a string- with the slicing operator. Indexing for a list begins
with 0, like for a string. A Python doesn‟t have arrays.
b. Length of a List
Python supports an inbuilt function to calculate the length of a list.
c. Reassigning Elements of a List
A list is mutable. This means that you can reassign elements later on.
d. Iterating on the List
To iterate over the list we can use the for loop. By iterating, we can access each element one by one
which is very helpful when we need to perform some operations on each element of list.
e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in DataFlair‟s tutorial on
Python Lists.
D. Python Tuples:
A tuple is like a list. You declare it using parentheses instead.
a. Accessing and Slicing a Tuple
You access a tuple the same way as you‟d access a list. The same goes for slicing it.
b. A tuple is Immutable
Python tuple is immutable. Once declared, you can‟t change its size or elements.
E. Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs separated by commas.
Separate keys and values by a colon(:).The type() function works with dictionaries too.
a. Accessing a Value
To access a value, you mention the key in square brackets.
b. Reassigning Elements
You can reassign a value to a key.
c. List of Keys
Use the keys() function to get a list of keys in the dictionary.
F. Bool:
A Boolean value can be True or False.
23
DataScience
G. Sets:
A set can have a list of values. Define it using curly braces. It returns only one instance of any value
present more than once. However, a set is unordered, so it doesn‟t support indexing. Also, it is
mutable. You can change its elements or add more. Use the add() and remove() methods to do
so.
H. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into another type. Python
supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
2.4. Conditional Statements
a. If statements
If statement is one of the most commonly used conditional statement in most of the programming
languages. It decides whether certain statements need to be executed or not. If statement
checks for a given condition, if the condition is true, then the set of code present inside the if
block will be executed.
The If condition evaluates a Boolean expression and executes the block of code only when the Boolean
expression becomes TRUE.
Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the condition
is true
b. If-else statements
The statement itself tells that if a given condition is true then execute the statements present inside if
block and if the condition is false then execute the else block.
Else block will execute only when the condition becomes false, this is the block where you will
perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code present inside the if
block if the condition becomes TRUE and executes a block of code present in the else block if
the condition becomes FALSE.
24
DataScience
Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if condition is true
else:
Block of code #Set of statements to execute if condition is false
c. elif statements
In python, we have one more conditional statement called elif statements. Elif statement is used to
check multiple conditions only if the given if condition false. It‟s similar to an if-else
statement and the only difference is that in else we will not check the condition but in elif we
will do check the condition.
Elif statements are similar to if-else statements but elif statements evaluate multiple conditions.
Syntax:
if (condition):
#Set of statement to execute if condition is true elif (condition):
#Set of statements to be executed when if condition is false and elif condition is true
else:
#Set of statement to be executed when both if and elif conditions are false
25
DataScience
if(condition):
#Statements to execute if condition is true if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if
else: condition is false
#Statements to execute if condition is false
e. elif Ladder
We have seen about the elif statements but what is this elif ladder. As the name itself suggests a
program which contains ladder of elif statements or elif statements which are structured in the
form of a ladder.
This statement is used to test multiple expressions.
Syntax:
if (condition):
#Set of statement to execute if condition is true elif (condition):
#Set of statements to be executed when if condition is false and elif condition is true
elif (condition):
#Set of statements to be executed when both if and first elif condition is false and second elif condition
is true
elif (condition):
#Set of statements to be executed when if, first elif and second elif conditions are false and third elif
statement is true
else:
#Set of statement to be executed when all if and elif conditions are false
2.5. Looping Constructs Loops:
a. while loop:
Repeats a statement or group of statements while a given condition is TRUE. It tests the condition
before executing the loop body.
26
DataScience
Syntax:
while expression:
statement(s)
for loop:
Executes a sequence of statements multiple times and abbreviates the code that manages the loop
variable.
Syntax:
for iterating_var in sequence:
statements(s)
b. nested loops:
You can use one or more loop inside any another while, for or do..while loop.
Syntax of nested for loop:
for iterating_var in sequence:
for iterating_var in sequence:
statements(s) statements(s)
Syntax of nested while loop:
while expression:
while expression:
statement(s) statement(s)
Loop Control Statements:
a. break statement:
Terminates the loop statement and transfers execution to the statement immediately following the loop.
b. continue statement:
Causes the loop to skip the remainder of its body and immediately retest its condition prior to
reiterating.
c. pass statement:
The pass statement in Python is used when a statement is required syntactically but you do not want
any command or code to execute.
2.6. Functions
27
DataScience
B. User-Defined Functions:
These are functions that are defined by the users for simplicity and to avoid repetition of code. It is
done by using def function.
2.7. Data Structure
Python has implicit support for Data Structures which enable you to store and access data. These
structures are called List, Dictionary, Tuple and Set.
2.8. Lists
Lists in Python are the most versatile data structure. They are used to store heterogeneous data items,
from integers to strings or even another list! They are also mutable, which means that their
elements can be changed even after the list is created.
Creating Lists
Lists are created by enclosing elements within [square] brackets and each item is separated by a
comma.
Creating lists in Python
Since each element in a list has its own distinct position, having duplicate values in a list is not a
problem.
Accessing List elements
To access elements of a list, we use Indexing. Each element in a list has an index related to it
depending on its position in the list. The first element of the list has the index 0, the next
element has index 1, and so on. The last element of the list has an index of one less than the
length of the list.
Indexing in Python lists
While positive indexes return elements from the start of the list, negative indexes return values from
the end of the list. This saves us from the trivial calculation which we would have to
otherwise perform if we wanted to return the nth element from the end of the list. So instead
of trying to return List_name[len(List_name)-1] element, we can simply write List_name[-1].
Using negative indexes, we can return the nth element from the end of the list easily. If we wanted to
return the first element from the end, or the last index, the associated index is -1. Similarly,
the index for the second last element will be -2, and so on. Remember, the 0th index will still
refer to the very first element in the list.
Appending values in Lists
28
DataScience
29
DataScience
2.9. Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that are immutable but
unordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a { curly } bracket separated by a semi-
colon. And each key-value pair is separated by a comma:
Using the key of the item, we can easily extract the associated value of the item:
Dictionaries are very useful to access items quickly because, unlike lists and tuples, a dictionary does
not have to iterate over all the items finding a value. Dictionary uses the item key to quickly
find the item value. This concept is called hashing.
30
DataScience
We can even access these values simultaneously using the items() method which returns the respective
key and value pair for each element of the dictionary.
31
DataScience
USing csv.reader(): At first, the CSV file is opened using the open() method in „r‟
mode(specifies read mode while opening a file) which returns the file object then it is read by
using the reader() method of CSV module that returns the reader object that iterates
throughout the lines in the specified CSV document.
Note: The ‘with‘ keyword is used along with the open() method as it simplifies exception handling and
automatically closes the CSV file.
import csv
import pandas
32
DataScience
DataFrame Methods:
FUNCTION DESCRIPTION
33
DataScience
value_counts() Method counts the number of times each unique value occurs within
the Series
isnull() Method creates a Boolean Series for extracting rows with null values
notnull() Method creates a Boolean Series for extracting rows with non-null
values
predefined range
34
DataScience
dtypes() Method returns a Series with the data type of each column. The
.iloc[] methods
35
DataScience
DataFrame
nsmallest() Method pulls out the rows with the smallest values in a column
nlargest() Method pulls out the rows with the largest values in a column
DataFrame
dimensions.
Returns 1 if Series,
otherwise
returns 2
if
DataFra
me
36
DataScience
dropna() Method allows the user to analyze and drop Rows/Columns with
fillna() Method manages and let the user replace NaN values with some
from a DataFrame
duplicated() Method creates a Boolean Series and uses it to extract rows that have
duplicate values
37
DataScience
38
DataScience
(ii) Median :
It is measure of central value of a sample set. In these, data set is ordered from lowest to highest value
and then finds exact middle.
For example,
39
DataScience
(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time in central set is
actually mode.
For example,
40
DataScience
Boxplot : It is based on the percentiles of the data as shown in the figure below. The top and bottom of
th th
the boxplot are 75 and 25 percentile of the data. The extended lines are known as whiskers
that includes the range of rest of the data. # BoxPlot Population In Millions
fig, ax1 = plt.subplots() fig.set_size_inches(9, 15)
Frequency Table : It is a tool to distribute the data into equally spaced ranges, segments and tells us
how many values fall in each segment.
Histogram: It is a way of visualizing data distribution through frequency table with bins on the x-axis
and data count on the y-axis.
Code – Histogram
41
DataScience
Output :
Density Plot: It is related to histogram as it shows data-values being distributed as continuous line. It
is a smoothed histogram version. The output below is the density plor superposed over
histogram.
Code destiny plot for the data
# Density Plot - Population
42
DataScience
ax3 = sns.distplot(data.Population, kde = True) ax3.set_ylabel("Density", fontsize = 15) ax3.set_xlabel("Murder Rate per
Million", fontsize = 15) ax3.set_title("Desnsity Plot - Population", fontsize = 20) Output :
Probability refers to the extent of occurrence of events. When an event occurs like throwing a ball,
picking a card from deck, etc ., then the must be some probability associated with that event.
43
DataScience
In terms of mathematics, probability refers to the ratio of wanted outcomes to the total number of
possible outcomes. There are three approaches to the theory of probability, namely:
1. Empirical Approach
2. Classical Approach
3. Axiomatic Approach
In this article, we are going to study about Axiomatic Approach.In this approach, we represent the
probability in terms of sample space(S) and other terms.
Basic Terminologies:
Random Event :- If the repetition of an experiment occurs several times under similar
conditions, if it does not produce the same outcome everytime but the outcome in a trial is one
of the several possible outcomes, then such an experiment is called random event or a
probabilistic event.
Elementary Event – The elementary event refers to the outcome of each random event
performed. Whenever the random event is performed, each associated outcome is known as
elementary event.
Sample Space – Sample Space refers to the set of all possible outcomes of a random
event.Example, when a coin is tossed, the possible outcomes are head and tail.
Event – An event refers to the subset of the sample space associated with a random event.
Occurrence of an Event – An event associated with a random event is said to occur if
any one of the elementary event belonging to it is an outcome.
Sure Event – An event associated with a random event is said to be sure event if it always
occurs whenever the random event is performed.
Impossible Event – An event associated with a random event is said to be impossible event if
it never occurs whenever the random event is performed.
Compound Event – An event associated with a random event is said to be compound event if
it is the disjoint union of two or more elementary events.
Mutually Exclusive Events – Two or more events associated with a random event are said to
be mutually exclusive events if any one of the event occurrs, it prevents the occurrence of all
other events.This means that no two or more events can occur simultaneously at the same
time.
44
DataScience
Exhaustive Events – Two or more events associated with a random event are said to be
exhaustive events if their union is the sample space.
Probability of an Event – If there are total p possible outcomes associated with a random experiment
and q of them are favourable outcomes to the event A, then the probability of event A is
denoted by P(A) and is given by
P(A) = q/p
45
DataScience
It can be observed from the above graph that the distribution is symmetric about its center, which is
also the mean (0 in this case). This makes the probability of events at equal deviations from
the mean, equally probable. The density is highly centered around the mean, which translates
to lower probabilities for values away from the mean.
Probability Density Function –
The probability density function of the general normal distribution is given as-
In the above formula, all the symbols have their usual meanings, is the Standard Deviation
and is the Mean. It is easy to get
overwhelmed by the above formula while trying to understand everything in one glance, but
we can try to break it down into smaller pieces so as to get an intuition as
to what is going on. The z-score is a measure of how many standard deviations
away a data point is
46
DataScience
The exponent of in the above formula is the square of the z-score times . This is actually in
accordance to the observations that we made above. Values away from the mean have a lower
probability compared to the values near the mean. Values away from the mean will have a
higher z-score and consequently a lower probability since the exponent is negative. The
opposite is true for values closer to the mean.
This gives way for the 68-95-99.7 rule , which states that the percentage of values that lie
within a band around the mean in a normal distribution with a width of two, four and six
standard deviations, comprise 68%, 95% and 99.7% of all the values. The figure given below
shows this rule-
The effects of and on the distribution are shown below. Here is used to reposition the center of the
distribution and consequently move the graph left or right, and is used to flatten or inflate the
curve-
47
DataScience
48
DataScience
49
DataScience
If t-value is large => the two groups belong to different groups. If t-value is small => the two groups
belong to same group.
There are three types of t-tests, and they are categorized as dependent and independent t-tests.
1. Independent samples t-test: compares the means for two groups.
2. Paired sample t-test: compares means from the same group at different times (say, one year
apart).
3. One sample t-test test: the mean of a single group against a known mean.
50
DataScience
Here,
x‟ and y‟ = mean of given sample set n = total no of sample
xi and yi = individual sample of set
Example –
51
DataScience
52
DataScience
It involves the historical or past data from an authorized source over which predictive analysis is to be
performed.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of data cleaning, we
remove un-necessary and erroneous data. It involves removing the redundant data and
duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly in order to identify
some patterns or new outcomes from the data set. In this stage, we discover useful
information and conclude by identifying some patterns or trends.
5. Build Predictive Model:
In this stage of predictive analysis, we use various algorithms to build predictive models based on the
patterns observed. It requires knowledge of python, R, Statistics and MATLAB and so on. We
also test our hypothesis using standard statistic models.
6. Validation:
It is a very important step in predictive analysis. In this step, we check the efficiency of our model by
performing various tests. Here we provide sample input sets to check the validity of our
model. The model needs to be evaluated for its accuracy in this stage.
7. Deployment:
In deployment we make our model work in a real environment and it helps in everyday discussion
making and make it available to use.
8. Model Monitoring:
Regularly monitor your models to check performance and ensure that we have proper results. It is
seeing how model predictions are performing against actual data sets.
4.4. Hypothesis Generation
A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis
that an algorithm would come up depends upon the data and also depends upon the
restrictions and bias that we have imposed on the data. To better understand the Hypothesis
Space and Hypothesis consider the following coordinate that shows the distribution of some
data:
53
DataScience
54
DataScience
Access modes govern the type of operations possible in the opened file. It refers to how the file will be
used once it‟s opened. These modes also define the location of the File Handle in the file. File
handle is like a cursor, which defines from where the data has to be read or written in the file.
Different access modes for reading a file are –
1. Read Only („r‟) : Open text file for reading. The handle is positioned at the beginning of the
file. If the file does not exists, raises I/O error. This is also the default mode in which file is
opened.
2. Read and Write („r+‟) : Open the file for reading and writing. The handle is positioned at the
beginning of the file. Raises I/O error if the file does not exists.
55
DataScience
3. Append and Read („a+‟) : Open the file for reading and writing. The file is created if it does
not exist. The handle is positioned at the end of the file. The data being written will be
inserted at the end, after the existing data.
4.8. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data
set). Here you need to identify predictor variables, target variable, data type of variables and
category of variables.
56
DataScience
Note: Univariate analysis is also used to highlight missing and outlier values. In
the upcoming part of this series, we will look at methods to handle missing and outlier values.
Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we
should look at scatter plot. It is a nifty way to find out the relationship between two variables.
The pattern of scatter plot indicates the relationship between variables. The relationship can
be linear or non-linear.
57
DataScience
Scatter plot shows the relationship between two variable but does not indicates the strength of
relationship amongst them. To find the strength of the relationship, we use Correlation.
Correlation varies between -1 and +1.
● 0: No correlation
Various tools have function or functionality to identify correlation between variables. In Excel,
function CORREL() is used to return the correlation between two variables and SAS uses
procedure PROC CORR to identify the correlation. These function returns Pearson
Correlation value to identify the relationship between two variables:
58
DataScience
In above example, we have good positive relationship(0.65) between two variables X and Y.
Categorical & Categorical: To find the relationship between two categorical variables, we can use
following methods:
● Two-way table: We can start analyzing the relationship by creating a two-way table of count
and count%. The rows represents the category of one variable and the columns represent the
categories of the other variable. We show count or count% of observations available in each
combination of row and column categories.
● Stacked Column Chart: This method is more of a visual form of Two-way table.
● Chi-Square Test: This test is used to derive the statistical significance of relationship
between the variables. Also, it tests whether the evidence in the sample is strong enough to
generalize that the relationship for a larger population as well. Chi-square is based on the
difference between the expected and observed frequencies in one or more categories in the
two-way table. It returns probability for the computed chi-square distribution with the degree
of freedom.
Probability of 0: It indicates that both categorical variable are dependent Probability of 1: It shows that
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95%
confidence. The chi-square test statistic for a test of independence of two categorical variables
is found by:
where O represents the observed frequency. E is the expected frequency under the
null hypothesis and computed by:
From previous two-way table, the expected count for product category 1 to be of small size is 0.22. It
is derived by taking the row total for Size (9) times the column
59
DataScience
total for Product category (2) then dividing by the sample size (81). This is procedure is conducted for
each cell. Statistical Measures used to analyze the power of relationship are:
Different data science language and tools have specific methods to perform chi- square test. In SAS,
we can use Chisq as an option with Proc freq to perform this test.
Categorical & Continuous: While exploring relation between categorical and continuous variables,
we can draw box plots for each level of categorical variables. If levels are small in number, it
will not show the statistical significance. To look at the statistical significance we can perform
Z-test, T-test or ANOVA.
● Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically
● ANOVA:- It assesses whether the average of more than two groups is statistically different.
60
DataScience
Notice the missing values in the image shown above: In the left scenario, we have not treated missing
values. The inference from this data set is that the chances of playing cricket by males is
higher than females. On the other hand, if you look at the second table, which shows data
after treatment of missing values (based on gender), we can see that females have higher
chances of playing cricket compared to males.
We looked at the importance of treatment of missing values in a dataset. Now, let‟s identify the
reasons for occurrence of these missing values. They may occur at two stages:
1. Data Extraction: It is possible that there are problems with extraction process. In such cases,
we should double-check for correct data with data guardians. Some hashing procedures can
also be used to make sure data extraction is correct. Errors at data extraction stage are
typically easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct. They
can be categorized in four types:
61
DataScience
o Missing completely at random: This is a case when the probability of missing variable is
same for all observations. For example: respondents of data collection process decide that
they will declare their earning after tossing a fair coin. If an head occurs, respondent declares
his / her earnings & vice versa. Here each observation has equal chance of missing value.
o Missing at random: This is a case when variable is missing at random and missing ratio
varies for different values / level of other input variables. For example: We are collecting data
for age and female has higher missing value compare to male.
o Missing that depends on unobserved predictors: This is a case when the missing values are
not random and are related to the unobserved input variable. For example: In a medical study,
if a particular diagnostic causes discomfort, then there is higher chance of drop out from the
study. This missing value is not at random unless we have included “discomfort” as an input
variable for all patients.
o Missing that depends on the missing value itself: This is a case when the probability of
missing value is directly correlated with missing value itself. For example: People with higher
or lower income are likely to provide non-response to their earning.
1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is missing. Simplicity is
one of the major advantage of this method, but this method reduces the power of model
because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables of interest are
present. Advantage of this method is, it keeps as many cases available for analysis. One of the
disadvantage of this method, it uses different sample size for different
variables.
62
DataScience
o Deletion methods are used when the nature of missing data is “Missing completely at
random” else non random missing values can bias the model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with
estimated ones. The objective is to employ known relationships that can be identified in the
valid values of the data set to assist in estimating the missing values. Mean / Mode / Median
imputation is one of the most frequently used methods. It consists of replacing the missing
data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative
attribute) of all known values of that variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all non missing
values of that variable then replace missing value with mean or median. Like in above table,
variable “Manpower” is missing so we take average of all non missing values of
“Manpower” (28.33) and then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and
“Female” (25) individually of non missing values then replace the missing value based on
gender. For “Male“, we will replace missing values of manpower with 29.75 and for
“Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for handling missing
data. Here, we create a predictive model to estimate values that will substitute the missing
data. In this case, we divide our data set into two sets: One set with no missing values for the
variable and another one with missing values. First data set become training data set of the
model while second data set with missing values is test data set and variable with missing
values is treated as target variable. Next, we create a model to predict target variable based on
other attributes of the training data set and populate missing values of test data set.We can use
regression, ANOVA, Logistic regression and various modeling technique to perform this.
There are 2 drawbacks for this approach:
o The model estimated values are usually more well-behaved than the true values
o If there are no relationships with attributes in the data set and the attribute with missing values,
then the model will not be precise for estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute
are imputed using the given number of attributes that are most similar to the attribute whose
values are missing. The similarity of two attributes is determined using a distance function. It
is also known to have certain advantage & disadvantages.
o Advantages:
▪ k-nearest neighbour can predict both qualitative & quantitative attributes
▪ Creation of predictive model for each attribute with missing data is not required
63
DataScience
● Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
● Use capping methods. Any value which out of range of 5th and 95th percentile can be
considered as outlier
● Data points, three or more standard deviation away from mean are considered outlier
● Outlier detection is merely a special case of the examination of data for influential data points
and it also depends on the business understanding
● Bivariate and multivariate outliers are typically measured using either an index of influence or
leverage, or distance. Popular indices such as Mahalanobis‟ distance and Cook‟s D are
frequently used to detect outliers.
● In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential
observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and
others.
Most of the ways to deal with outliers are similar to the methods of missing values like deleting
observations, transforming them, binning them, treat them as a separate group, imputing
values and other statistical methods. Here, we will discuss the common techniques used to
deal with outliers:
64
DataScience
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or
outlier observations are very small in numbers. We can also use trimming at both ends to
remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of
a value reduces the variation caused by extreme values. Binning is also a form of variable
transformation. Decision Tree algorithm allows to deal with outliers well due to binning of
variable. We can also use the process of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median,
mode imputation methods. Before imputing values, we should analyse if it is natural outlier or
artificial. If it is artificial, we can go with imputing values. We can also use statistical model
to predict values of outlier observation and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the
statistical model. One of the approach is to treat both groups as two different groups and build
individual model for both groups and then combine the output.
65
DataScience
● Symmetric
distribution is preferred over skewed distribution as it is easier to interpret and generate inferences.
Some modeling techniques requires normal distribution of variables. So, whenever we have a
skewed distribution, we can use transformations which reduce skewness. For right skewed
distribution, we take square / cube root or logarithm of variable and for left skewed, we take
square / cube or exponential of variables.
66
DataScience
67
DataScience
Any change in the coefficient leads to a change in both the direction and the steepness of the logistic
function. It means positive slopes result in an S-shaped curve and negative slopes result in a
Z-shaped curve.
4.18. Decision Trees
Decision Tree : Decision tree is the most powerful and popular tool for classification and prediction.
A Decision tree is a flowchart like tree structure, where each internal node denotes a test on
an attribute, each branch represents an outcome of the test, and each leaf node (terminal node)
holds a class label.
Decision Tree Representation :
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. An instance is classified by starting at the root node
of the tree,testing the attribute specified by this node,then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure.This process is then
repeated for the subtree rooted at the new node.
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
● Decision trees are able to generate understandable rules.
● Decision trees perform classification without requiring much computation.
● Decision trees are able to handle both continuous and categorical variables.
● Decision trees provide a clear indication of which fields are most important for prediction or
classification.
The weaknesses of decision tree methods :
● Decision trees are less appropriate for estimation tasks where the goal is to predict the value
of a continuous attribute.
● Decision trees are prone to errors in classification problems with many class and relatively
small number of training examples.
● Decision tree can be computationally expensive to train. The process of growing a decision
tree is computationally expensive. At each node, each candidate splitting field must be sorted
before its best split can be found. In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights. Pruning algorithms can also be
expensive since many candidate sub-trees must be formed and compared.
68
DataScience
4.19. K-means
k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity
between the items and groups them into the clusters. K-means clustering algorithm works in
three steps. Let‟s see what are these three steps.
Let us understand the above steps with the help of the figure because a good picture is better than the
thousands of words.
● Figure 1 shows the representation of data of two different items. the first item has shown in
blue color and the second item has shown in red color. Here I am choosing the value of K
randomly as 2. There are different methods by which we can choose the right k values.
● In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will
69
DataScience
notice there, then you will see that some of the red points are now moved to the blue points. Now,
these points belong to the group of blue color items.
● The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move to its
centroid and again some of the red points get converted to blue points.
● The same process is happening in figure 4. This process will be continued until and unless we
get two completely different clusters of these groups.
One of the most challenging tasks in this clustering algorithm is to choose the right values of k. What
should be the right k-value? How to choose the k-value? Let us find the answer to these
questions. If you are choosing the k values randomly, it might be correct or may be wrong. If
you will choose the wrong value then it will directly affect your model performance. So there
are two methods by which you can select the right value of k.
1. Elbow Method.
2. Silhouette Method.
70
DataScience
When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases,
the within-cluster sum of square value will decrease.
Finally, we will plot a graph between k-values and the within-cluster sum of the square to get the k
value. we will examine the graph carefully. At some point, our graph will decrease abruptly.
That point will be considered as a value of k.
Silhouette
Method
The silhouette method is somewhat different. The elbow method it also picks up the range of the k
values and draws the silhouette graph. It calculates the silhouette coefficient of every point. It
calculates the average distance of points within its cluster a (i) and the average distance of the
points to its next closest cluster called b (i).
71
DataScience
Note : The a (i) value must be less than the b (i) value, that is ai<<bi.
Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient by using the
below formula.
Now, we can calculate the silhouette coefficient of all the points in the clusters and plot the silhouette
graph. This plot will also helpful in detecting the outliers. The plot of the silhouette is
between -1 to 1.
72
DataScience
Also, check for the plot which has fewer outliers which means a less negative value. Then choose that
value of k for your model to tune.
Advantages of K-means
73