Block 1

Basics of Data Science
UNIT 1 INTRODUCTION TO DATA SCIENCE

1.0 Introduction
1.1 Objective
1.2 Data Science - Definition
1.3 Types of Data
1.3.1Statistical Data Types
1.3.2 Sampling
1.4 Basic Methods of Data Analysis
1.4.1 Descriptive Analysis
1.4.2 Exploratory Analysis
1.4.3 Inferential Analysis
1.4.4 Predictive Analysis
1.5 Common Misconceptions of Data Analysis
1.6 Applications of Data Science
1.7 Data Science Life cycle
1.8 Summary
1.9 Solutions/Answers
1.0 INTRODUCTION
The Internet and communication technology has grown tremendously in the past decade
leading to generation of large amount of unstructured data. This unstructured data
includes data such as, unformatted textual, graphical, video, audio data etc., which is
being generated as a result of people use of social media and mobile technologies. In
addition, as there is a tremendous growth in the digital eco system of organisation, large
amount of semi-structured data, like XML data, is also being generated at a large rate.
All such data is in addition to the large amount of data that results from organisational
databases and data warehouses. This data may be processed in real time to support
decision making process of various organisations. The discipline of data science focuses
on the processes of collection, integration and processing of large amount of data to
produce useful decision making information, which may be useful for informed
decision making.
This unit introduces you to the basic concept of data sciences. This unit provides an
introduction to different types of data used in data science. It also points to different
types of analysis that can be performed using data science. Further, the Unit also
introduces some of the common mistakes of data science.
1.1 OBJECTIVES
At the end of this unit you should be able to:
• Define the term data science in the context of an organization
• explain different types of data
• list and explain different types of analysis that can be performed on data
• explain the common mistakes about data size
• define the concept of data dredging
• List some of the applications of data sites
• Define the life cycle of data science
1
Introduction to Data Science
1.2 DATA SCIENCE-DEFINITION
Data Science is a multi-disciplinary science with an objective to perform data analysis
to generate knowledge that can be used for decision making. This knowledge can be
in the form of similar patterns or predictive planning models, forecasting models etc.
A data science application collects data and information from multiple heterogenous
sources, cleans, integrates, processes and analyses this data using various tools and
presents information and knowledge in various visual forms.
As stated earlier data science is a multi-disciplinary science, as shown in Figure 1.
Programming
Visualization Modelling and
Machine Learning simulation,
Computing Mathematics
Big data
Database System Statistics
Data
Science
Domain
Knowledge
Figure 1: Data Science
What are the advantages of Data science in an organisation? The following are some
of the areas in which data science can be useful.
• It helps in making business decisions such as deciding the health of
companies with whom they plan to collaborate,
• It may help in making better predictions for the future such as making
strategic plans of the company based on present trends etc.
• It may identify similarities among various data patterns leading to
applications like fraud detection, targeted marketing etc.
In general, data science is a way forward for business decision making, especially in
the present day world, where data is being generate at the rate of Zetta bytes.
Data Science can be used in many organisations, some of the possible usage of data
science are as given below:
• It has great potential for finding the best dynamic route from a source to
destination. Such application may constantly monitor the traffic flow and
predict the best route based on collected data.
• It may bring down the logistic costs of an organization by suggesting the best
time and route for transporting foods
• It can minimize marketing expenses by identifying the similar group buying
patterns and performing selective advertising based on the data obtained.
• It can help in making public health policies, especially in the cases of
disasters.
2
• It can be useful in studying the environmental impact of various
developmental activities
• It can be very useful in savings of resources in smart cities
1.3 TYPES OF DATA
Type of data is one of the important aspect, which determines the type of
analysis that has to be performed on data. In data science, the following are the
different types of data, that are required to be processed:
1. Structured Data
2. Semi-Structured Data
3. Unstructured data
4. Data Streams
Structured Data
Since the start of the era of computing, computer has been used as a data
processing device. However, it was not before 1960s, when businesses started
using computer for processing their data. One of the most popular language of
that era was Common Business-Oriented Language (COBOL). COBOL had a
data division, which used to represent the structure of the data being processed.
This was followed by a disruptive seminal design of technology by a E.F.
Codd. This lead to creation of relational database management systems
(RDBMS). RDBMS allows structured storage, retrieval and processing of
integrated data of an organisation that can be securely shared among several
applications. The RDBMS technology also supported secure transaction, thus,
became a major source of data generation. Figure 2 shows the sample structure
of data that may be stored in a relational database system. One of the key
characteristics of structured data is that it can be associated with a schema. In
addition, each schema element may be related to a specific data type.
Customer (custID, custName, custPhone, custAddress, custCategory, custPAN, custAadhar)

Account (AccountNumber,custIDoffirstaccountholder,AccountType, AccountBalance)
JointHolders (AccountNumber, custID)
Transaction(transDate, transType, AccountNumber, Amountoftransaction)
Figure 2: A sample schema of structured data
The relational data is structured data and large amount of this structured data is
being collected by various organisations, as backend to most applications. In
90s, the concept of a data warehouse was introduced. A data warehouse is a
time-invariant, subject-oriented aggregation of data of an organisation that can
be used for decision making. A data in a data warehouse is represented using
dimension tables and fact tables. The dimensional tables classifies the data of
fact tables. You have already studied various schemas in the context of data
warehouse in MCS221. The data of data warehouse is also structured in nature
and can be used for analytical data processing and data mining. In addition,
many different types of database management systems have been developed,
which mostly store structured data.
3
However, with the growth of communication and mobile technologies many
different applications became very popular leading to generation of very large
amount of semi-structured and unstructured data. These are discussed next.
Semi-structured Data
As the name suggest Semi-structured has some structure in it. The structure of
semi-structured data is due to the use of tags or key/value pairs The common
form of semi-structured data is produced through XML, JSON objects, Server
logs, EDI data, etc. The example of semi-structured data is shown in the Figure
3.
<Book> "Book": {
<title>Data Science and Big Data</title> "Title": "Data
<author>R Raman</author> Science",
<author>C V Shekhar</author>
"Price": 5000,
<yearofpublication>2020</yearofpublicatio
n>
</Book> "Year": 2020
}
Figure 3: Sample semi-structured data
Unstructured Data
The unstructured data does not follow any schema definition. For example, a
written text like content of this Unit is unstructured. You may add certain
headings or meta data for unstructured data. In fact, the growth of internet has
resulted in generation of Zetta bytes of unstructured data. Some of the
unstructured data can be as listed below:
• Large written textual data such as email data, social media data etc..
• Unprocessed audio and video data
• Image data and mobile data
• Unprocessed natural speech data
• Unprocessed geographical data
In general, this data requires huge storage space, newer processing methods
and faster processing capabilities.
Data Streams
A data stream is characterised by a sequence of data over a period of time.
Such data may be structured, semi-structured or unstructured, but it gets
generated repeatedly. For example, IoT devices like weather sensors will
generate data stream of pressure, temperature, wind direction, wind speed,
humidity etc for a particular place where it is installed. Such data is huge for
many applications are required to be processed in real time. In general, not all
the data of streams is required to be stored and such data is required to be
processed for a specific duration of time.
4
1.3.1 Statistical Data Types Basics of Data Science
There are two distinct types of data that can be used in statistical analysis.
These are – Categorical data and Quantitative data
Categorical or qualitative Data:
Categorical data is used to define the category of data, for example, occupation
of a person may take values of the categories “Business”, “Salaried”. “Others”
etc. The categorical data can be of two distinct measurement scales called
Nominal and Ordinal, which are given in Figure 4. If the categories are not
related, then categorical data is of Nominal data type, for example, the
Business category and Salaried categories have no relationship, therefore it is
of Nominal type. However, a categorical variable like age category, defining
age in categories “0 or more but less than 26”, “26 or more but less than 46”,
“46 or more but less than 61”, “More than 61”, has a specific relationship. For
example, the person in age category “More than 61” are elder to person in any
other age category.
Quantitative Data:
Quantitative data is the numeric data, which can be used to define different
scale of data. The qualitative data is also of two basic types –discrete, which
represents distinct numbers like 2, 3, 5,… or continuous, which represent a
continuous values of a given variable, for example, your height can be
measured using continuous scale.
Measurement scale of data
Data are raw facts, for example, student data may include name, Gender, Age,
Height of student, etc. The name typically is a distinguishing data that tries to
distinctly identify two data items, just like primary key in a database.
However, the name data or any other identifying data may not be useful for
performing data analysis in data science. The data such as Gender, Age, Height
may be used to answer queries of the kind: Is there a difference in the height of
boys and girls in the age range 10-15 years? One of the important question is
how do you measure the data so that it is recorded consistently? Stanley
Stevens, a psychologist, defined the following four characteristics that any
scale that can be measured:
• Every representation of the measure should be unique, this is referred

to as identify of a value (IDV).
• The second characteristics is the magnitude (M), which clearly can be
used to compare the values, for example, a weight of 70.5 kg is more
than 70.2 kg.
• Third characteristics is about equality in intervals (EI) used to represent
the data, for example, the difference between 25 and 30 is 5 intervals,
which is same as the difference between 41 to 46, which are also 5
intervals.
• The final characteristics is about a defined minimum or zero
value(MZV), for example, in Kelvin scale, temperature have an
5
absolute zero value, whereas, the Intelligent quotient cannot be defined
as zero.
Based on these characteristics four basic measurement scales are defined.

Figure 4 defines these measurements, their characteristics and examples.
Measurement Characteristics Example

Scale IDV M EI MZV
Nominal Yes No No No Gender
F - Female
M- Male
Ordinal Yes For rank No No A hypothetical Income
ordering Category:
1 - “0 or more but less than 26”
2 -“26 or more but less than 46”
3 - “46 or more but less than
61”
4 - “More than 61”
Interval Yes Yes Yes No IQ, Temperature in Celsius
Ratio Yes Yes Yes Yes Temperature in K, Age
Figure 4: Measurement Scales of Data
1.3.2 Sampling
In general the size of data that is to be processed today is quite large. This
leads you to the question, whether you would use the entire data or some
representative sample of this data. In several data science techniques sample
data is used to develop an exploratory model also. Thus, even in the data
science sample is one of the ways, which can enhance the speed of exploratory
data analysis. The population in this case may be the entire set of data that you
may be interested. Figure 5 shows the relationships between population and
sample. One of the question, which is asked in this context is what should be
the size of a good sample. You may have to find the answer in the literature.
However, you may please note that a good sample is representative of its
population.
Sample
Population
Figure 5: Population and Sample
6
One of the key objectives of statistics, which uses sample data, is to determine Basics of Data Science
the statistic of the sample and find the probability that the statistic developed
for the sample would determine the parameters of population with a specific
percentage of accuracy. Please note the terms stated above are very important
and explain in the following table:
Term Used for Example

Statistic Statistic is computed for the Sample Sample mean (𝑥̅ ),
Sample Standard deviation
(s),
Sample size (n)
Parameter Parameters are predicted from sample Population mean (µ),
and are about the Population Population Standard
deviation (σ),
Population size (N)
Next, we discuss different kind of analysis that can be performed on data.
Check Your Progress 1:

1. Define the term data science.
2. Differentiate between structured, semi-structured, unstructured and

stream data.
3. What would be the measurement scale for the following? Give reason
in support of your answer.
Age, AgeCategory, Colour of eye, Weight of students of a class, Grade
of students, 5-point Likert scale
1.4 BASIC METHODS OF DATA ANALYSIS

The data for data science is obtained from several data sources. This data is first
cleaned of errors, duplication, aggregated and then presented in a form that can be
analysed by various methods. In this section, we define some of the basic methods
used for analysing data. These are: Descriptive analysis, Exploratory data analysis and
Inferential data analysis.
1.4.1 Descriptive Analysis
Descriptive analysis is used to present basic summaries about data; however, it makes
no attempt to interpret the data. These summaries may include different statistical
values and certain graphs. Different types of data are described using different ways.
The following example illustrates this concept:
Example 1: Consider the data given in the following Figure 6. Show the summary of
categorical data in this Figure.
Enrolment Number Gender Height

S20200001 F 155
S20200002 F 160
S20200003 M 179
S20200004 F 175
7
Introduction to Data Science S20200005 M 173
S20200006 M 160
S20200007 M 180
S20200008 F 178
S20200009 F 167
S20200010 M 173
Figure 6: Sample Height Data
Please note that enrolment number variable need not be used in analysis, so no
summary data for enrolment number is to be found.
Descriptive of Categorical Data:

The Gender is a categorical variable in Figure 6. The summary in this case would
be in terms of frequency table of various categories. For example, for the given
data the frequency distribution would be:
Gender Frequency Proportion Percentage
Female (F) 5 0.5 50%
Male (M) 5 0.5 50%
In addition, you can draw bar chart or pie chart for describing the data of Gender
variable. The pie chart for such data is shown in Figure 7. Details of different
charts are explained in Unit 4. In general, you draw a bar graph, in case the number
of categories is more.
Figure 7: The Pie Chart
Descriptive of Quantitative Data:

The height is a quantitative variable. The descriptive of quantitative data is given
by the following two ways:
1. Describing the central tendencies of the data
2. Describing the spread of the data.
Central tendencies of Quantitative data: Mean and Median are two basic measures
that define the centre of data though using different ways. They are defined below
with the help of an example.
Example 2: Find the mean and median of the following data:
Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

x 4 21 25 10 18 9 7 14 11 19 14
The mean can be computed as:

∑
𝑥̅ = 𝑥'𝑛
For the given data 𝑥̅ =
(4 + 21 + 25 + 10 + 18 + 9 + 7 + 14 + 11 + 19 + 14)'
11
Mean 𝑥̅ = 13.82
The median of the data would be the mid value of the sorted data. First data is sorted
and the median is computed using the following formula:
If n is even, then
8
! ! Basics of Data Science
median = [(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 + 𝑉𝑎𝑙𝑢𝑒𝑜𝑓(( + 1)#$ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛]/2
" "
If n is odd, then
!%&
median =(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
"
For this example, the sorted data is as follows:

x 4 7 9 10 11 14 14 18 19 21 25
So, the median is:

&&%& #$
median = (𝑉𝑎𝑙𝑢𝑒𝑜𝑓( "
) ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = 14
You may please note that outliers, which are defined as values highly different from
most other values, can impact the mean but not the median. For example, if one
observation in the data, as shown in example 2 is changed as:

x 4 7 9 10 11 14 14 18 19 21 100
Then the median will still remain 14, however, mean will change to 20.64, which is
quite different from the earlier mean. Thus, you should be careful about the presence
of outliers while data analysis.
Interestingly, mean and mode may be useful in determining the nature of data. The
following table describes these conditions:
Relationship Comments about A possible Graph of Data

between mean and observations Distribution
mode
Almost Equal values The distribution of
of mean and median data may be
symmetric in nature
Mean Mode Median

Mean >> Median The distribution
may be left-skewed
Mean Median Mode
9
Mean << Median The distribution
may be right-
skewed
Mode Median Mean
Figure 8: Mean and Median for possible data
The concept of data distribution is explained in the next Unit.
Mode: Mode is defined as the most frequent value of a set of observation. For
example, in the data of example 2, the value 14, which occurs twice, is the mode. The
mode value need not be a mid-value rather it can be any value of the observations. It
just communicates the most frequently occurring value only. In a frequency graph,
mode is represented by peak of data. For example, in the graphs shown in Figure 8,
the value corresponding to the peaks is the mode.
Spread of Quantitative data: Another important aspect of defining the quantitative

data is its spread or variability of observed data. Some of the measures for spread of
data are given in the Figure 9.
Measure Description Example (Please

refer to Data of
Example 2)
Range Minimum to Maximum Value 4 to 25
Variance Sum of the squares of difference Try both the formula
between the observations and its sample and then match the
mean, which is divided by (n-1), as the answer:
difference of nth value can be determined 40.96
from (n-1) computed difference, as
overall sum of differences has to be
zero. Formula of Variance for sample is:
!
"
1
𝑠 = A(𝑥 − 𝑥̅ )"
(𝑛 − 1)
'(&
However, in case you are determining
the Population variance, then you can
use to following formula:
!
1
σ = A(𝑥 − µ)"
"
𝑁
'(&
Standard Standard deviation is one of the most Try both the formula
Deviation used measure for finding the spread or and then match the
variability of data. It can be computed answer:
as: 6.4
For Sample:
10
!
1
𝑠= E A(𝑥 − 𝑥̅ )"
(𝑛 − 1)
'(&
For Population:
!
1
σ = E A(𝑥 − µ)"
𝑁
'(&
5-Point For creating 5-point summary first, you Use Sorted data of
Summary and need to sort the data. The five point Example 2
Interquartile summary is defined as follows:
Range (IQR) Minimum Value (Min) Min = 4
st
1 Quartile<=25% values (Q1) Q1=(9+10)/2=9.5
2nd Quartile is median (M) M = 14
3rd Quartile is <=75% values (Q3) Q3=(18+19)/2=18.5
Maximum Value (Max) Max=25
IQR is the difference between 3rd and 1st IQR= 18.5-9.5=9
quartiles values.
Figure 9: The Measure of Spread or Variability
The IQR can also be used to identify suspected outliers. In general, a suspected outlier
can exist in the following two ranges:
Observation/values less than Q1 – 1.5´IQR
Observation/values more than Q3+1.5 ´ IQR
For the example 2,
IQR is 9, therefore the outliers may be: Values < (9.5 – 9) or Values < 0.5.
or Values > (18.5 – 9) or Values > 27.5.
Thus, there is no outlier in the initial data of Example 2.
For the qualitative data, you may draw various plots, such as histogram, box plot etc.
These plots are explained in Unit 4 of this block.
Check Your Progress 2

1. Age category of student is a categorical data. What information would you
like to show for its descriptive analysis.
2. Age is a quantitative data; how will you describe its data?
3. How can you find that given data is left skewed?
4. What is IQR? Can it be used to find outliers?
1.4.2 Exploratory Analysis
Exploratory data analysis was suggested by John Turkey of Princeton University in

1960, as a group of methods that can be used to learn possibilities of relationships
amongst data. After you have obtained relevant data for analysis, instead of
performing the final analysis, you may like to explore the data for possible
relationships using exploratory data analysis. In general, graphs are some of the best
ways to perform exploratory analysis. Some of the common methods that you can
perform during exploratory analysis are as follows:
1. As a first step, you may perform the descriptive of various categorical and
qualitative variables of your data. Such information is very useful in
11
Introduction to Data Science determining the suitability of data for the purpose of analysis. This may also
help you in data cleaning, modification and transformation of data.
a. For the qualitative data, you may create frequency tables and bar
charts to know the distribution of data among different categories. A
balanced distribution of data among categories is most desirable.
However, such distribution may not be possible in actual situations.
Several methods has been suggested to deal with such situations.
Some of those will be discussed in the later units.
b. For the quantitative data, you may compute the mean, median,
standard deviation, skewness and kurtosis. The kurtosis value relates
to peaks (determined by mode) in the data. In addition, you may also
draw the charts like histogram to look into frequency distribution.
2. Next, after performing the univariate analysis, you may try to perform some
bi-variate analysis. Some of the basic statistics that you can perform for bi-
variate analysis includes the following:
a. Make two way table between categorical variables and make related
stacked bar charts. You may also use chi-square testing find any
significant relationships.
b. You may draw side-by-side box plots to check if the data of various
categories have differences.
c. You may draw scatterplot and check the correlation coefficient, if that
exists between two variables.
3. Finally, you may like to look into the possibilities of multi-variate
relationships amongst data. You may use dimensionality reduction by using
techniques feature extraction or principle component analysis, you may
perform clustering to identify possible set of classes in the solution space or
you may use graphical tools, like bubble charts, to visualize the data.
It may be noted that exploratory data analysis helps in identifying the possibilities of
relationships amongst data, but does not promises that a causal relationship may exist
amongst variables. The causal relationship has to be ascertained through qualitative
analysis. Let us explain the exploratory data analysis with the help of an example.
Example 3: Consider the sample data of students given in Figure 6 about Gender and
Height. Let us explore this data for an analytical question: Does Height depends on
Gender?
You can perform the exploratory analysis on this data by drawing a side-by-side box
plot for Male and Female students height. This box plot is shown in Figure 10.
Figure 10: Exploratory Data Analysis
Please note that box plot of Figure 10 shows that on an average height of male
students is more than the female student. Does this result applies, in general for the
population? For answering this question, you need to find the probability of
12
occurrence of such sample data. need to determine the probability , therefore, Basics of Data Science
Inferential analysis may need to be performed.
1.4.3 Inferential Analysis

Inferential analysis is performed to answer the question that what is the probability of
that the results obtained from an analysis can be applied to the entire population. A
detailed discussion on various terms used in inferential analysis in the context of
statistical analysis had been done in Unit 2. You can perform many different types of
statistical tests on data. Some of these well-known tools for data analysis are listed in
the Figure 11.
Test Why Performed?
Univariate Analysis: Z-Test or T-test To determine, if the computed value of
mean of a sample can be applicable for
the population and related confidence
interval.
Bivariate Chi-square test To test the relationship between two
categorical variables or groups of data
Two sample T-Test To test the difference between the means
of two variables or groups of data
One way ANOVA To test the difference in mean of more
than two variables or groups of data
F-Test It can be used to determine the equality
of variance of two or more groups of
data
Correlation analysis Determines the strength of relationship
between two variables
Regression analysis Examines the dependence of one
variable over a set of independent
variables
Decision Trees Supervised learning used for
classification
Clustering Non-supervised Learning
Figure 11: Some tools for data analysis
You may have read about many of these tests in Data Warehousing and Data Mining
and Artificial Intelligence and Machine Leaning course. In addition, you may refer to
further readings for these tools. The following example explains the importance of
Inferential analysis.
Example 4:Figure 10 in Example 3 shows the box plot of height of male and female
students. Can you infer from the boxplot and the sample data (Figure 6), if there is
difference in the height of male and female students.
In order to infer, if there is a difference between the hight of two groups (Male and
Female Students), a two-sample t-test was run on the data. The output of this t-test is
shown in Figure 12.
t-Test (two tail): Assuming Unequal Variances
Female Male
Mean 167 173
Variance 94.5 63.5
Observations 5 5
Computed t-value -1.07
p-value 0.32
Critical t-value 2.30
Figure 12: The Output of two sample t-test (two tail)
13
Figure 12 shows that the mean height of the female students is 167 cm, whereas for
the male students it is 173 cm. The variance of female candidates is 94.5, whereas for
male candidate it is 63.5. Each group is interpreted on the basis of 5 observations. The
computed t-value is -1.07and p-value is 0.32. As the p-value is greater than 0.05,
therefore you can conclude that you cannot conclude that the average male student
height is different from average female student height.
1.4.4 Predictive Analysis
With the availability of large amount of data and advanced algorithms for mining and
analysis of large data have led the way to advanced predictive analysis. The predictive
analysis of today uses tools from Artificial Intelligence, Machine Learning, Data
Mining, Data Stream Processing, data modelling etc. to make prediction for strategic
planning and policies of organisations. Predictive analysis uses large amount of data
to identify potential risks and aid the decision-making process. It can be used in
several data intensive industries like electronic marketing, financial analysis,
healthcare applications, etc. For example, in the healthcare industry, predictive
analysis may be used to determine the support for public health infrastructure
requirements for the future based on the present health data.
Advancements in Artificial intelligence, data modeling, machine learning has also led
to Prescriptive analysis. The prescriptive analysis aims to take predictions one step
forward and suggest solutions to present and future issues.
A detailed discussion on these topics is beyond the scope of this Unit. You may refer
to further readings for more information on these.
1.5 COMMON MISCONCEPTIONS OF DATA

ANALYSIS
In this section, we discuss three misconception that can affect the result of a data
science. These misconceptions are explained with the help of an example, only.
Correlation is not Causation: Correlation analysis establishes relationship between

two variables. For example, consider three variables, namely attendance of student
(attend), marks obtained by student (marks) and weekly hours spent by a student for
the study (study). While analysing data, you found that there is a strong correlation
between the variables attend and marks. However, does it really mean that higher
attendance causes students to obtain better marks? There is another possibility that
both study and marks, as well as study and attend are correlated. A motivated student
may be spending higher number of hours at home, which may lead to better marks.
Similarly, a motivated student who is putting a greater number of hours in his/her
study may be attending to school regularly. Thus, the correlation between study to
marks and study to attend results in anon-existing correlation attend to marks. This
situation is shown in Figure 13.
14
Study
Causes Causes
Attend Mark
Observed, but s
not a Cause
Figure 13: Correlation does not mean causation
Simpsons Paradox: Simson paradox is an interesting situation, which

sometimes leads to wrong interpretations. Consider two Universities, say
University 1 and University 2 and the pass out data of these Universities:
University Student Passed Passed % Student Failed Failed % Total

U1 4850 97% 150 3% 5000
U2 1960 98% 40 2% 2000
Figure 14: The Results of the Universities
As you may observe from the data as above, the University U2 is performing better as
far as passing percentage is concerned. However, during a detailed data inspection, it
was noted that both the Universities were running Basic Adult Literacy Programme,
which in general has a slightly poor result. In addition, the data of the literacy
Programme is to be compiled separately. Therefore, the be data for the University
would be:
General Programmes:
U1 1480 98.7% 20 3% 1500
U2 1480 98.7% 20 2% 1500
Adult Literacy Programme:
U1 3370 96.3% 130 3.7% 3500
U2 480 96% 20 2% 500
Figure 15: The result after a new grouping is added
You may observe that due to the additional grouping due to adult literacy programme,
the corrected data shows that U1 is performing better than U2, as the pass out rate for
General programme is same and pass out rate for Adult literacy programme is better
from U1. You must note the changes in the percentages. This is the Simpson’s
paradox.
Data Dredging: Data Dredging, as the name suggest, is extensive analysis of very
large data sets. Such analysis results in generation of large number of data
associations. Many of those associations may not be casual, thus, requires further
exploration through other techniques. Therefore, it is essential that every data
association with large data set should be investigated further before reporting them as
conclusion of the study.
15
1.6 APPLICATIONS OF DATA SCIENCE
Data Science is useful in analysing large data sets to produce useful information that
can be used for business development and can help in decision making process. This
section highlights some of the applications of data science.
Applications using Similarity analysis

These applications use similarity analysis of data using various algorithms, resulting
into classification or clustering of data into several categories. Some of these
applications can be:
• Spam detection system: This system classifies the emails into spam and non-
spam categories. It analyses the IP addresses of mail origin, word patterns
used in mails, word frequency etc. to classify a mail as spam or not.
• Financial Fraud detection system: This is one of the major applications for
online financial services. Basic principle is once again to classify the
transactions as safe or unsafe transactions based on various parameters of the
transactions.
• Recommendation of Products: Several e-commerce companies have the data
of your buying patterns, information about your searches to their portal and
other information about your account with them. This information can be
clustered into classes of buyers, which can be used to recommend various
products for you.
Applications related to Web Searching

These applications primarily help you in finding content on the web more effectively.
Some of the applications in this section would be the search algorithms used by the
various search engines. These algorithms attempt to find the good websites based on
the search terms. They may use tools related to semantic of your term, indexing of
important website and terms, link analysis etc. In addition, the predictive text use of
browser is also an example of use of
Applications related to Healthcare System

The data science can be extremely useful for healthcare applications. Some of the
applications may involve processing and analysing images for neonatal care or to
detect possibilities of tumors, deformities, problems in organs etc. In addition, there
can be applications to establishing relationships of diseases to certain factors, creating
recommendations for public health based on public health data. Genomic analysis,
creation and testing of new drugs etc. The possibilities of use of streaming data for
monitoring the patients is also a potential area for use of data science in healthcare.
Applications related to Transport sector

These applications may investigate the possibilities of finding best routes – air, road
etc., for example, many e-commerce companies need to plan the most economical
ways of logistic support from their warehouses to the customer. Finding the best
dynamic route from a source to destination with dynamic load on road networks etc.
This application will be required to process the streams of data.
In general, data science can be used for the benefit of society. It should be used
creatively to improve the effective resource utilization, which may lead to sustainable
development. The ultimate goal of data science applications should be to help us
protect our environment and human welfare.
16
1.7 DATA SCIENCE LIFE CYCLE
So far, we have discussed about various aspects of data science in the previous
sections. In this section, we discuss about the life cycle of a data science based
application. In general, a data science application development may involve the
following stages:
Data Science Project Requirements Analysis Phase

The first and foremost step for data science project would be to identify the objectives
of a data science project. This identification of objectives is also coupled with the
study of benefits of the project, resource requirements and cost of the project. In
addition, you need to make a project plan, which includes project deliverables and
associated time frame. In addition, the data that is required to be used for the project is
also decided. This phase is similar as that of requirement study and project planning
and scheduling.
Data collection and Preparation Phase

In this phase, first all the data sources are identified, followed by designing the
process of data collection. It may be noted that data collection may be a continuous
process. Once the data sources are identified then data is checked for duplication of
data, consistency of data, missing data, and availability timeline of data. In addition,
data may be integrated, aggregated or transformed to produce data for a defined set of
attributes, which are identified in the requirements phase.
Descriptive data analysis

Next, the data is analysed using univariate and bivariate analysis techniques. This will
generate descriptive information about the data. This phase can also be used to
establish the suitability and validly of data as per the requirements of data analysis.
This is a good time to review your project requirements vis-à-vis collected data
characteristics.
Data Modelling and Model Testing

Next, a number of data models based on the data are developed. All these data models
are then tested for their validity with test data. The accuracy of various models are
compared contrasted and a final model is proposed for data analysis.
Model deployment and Refinement

The tested best model is used to address the data science problem, however, this
model must be constantly refined, as the decision making environment keeps
changing and new data sets and attributes may change with time. The refinement
process goes through all the previous steps again.
Thus, in general, data science project follows a spiral of development. This is shown
in Figure 16.
17
Data Science
Project
Requirements
Analysis Phase
Model
deployment Data collection
and and Preparation
Phase
Refinement
Data Modelling Descriptive

and Model data analysis
Testing
Figure 16: A sample Life cycle of Data Science Project

1. What are the advantage of using boxplot?
2. How is inferential analysis different to exploratory analysis?
3. What is Simpson’s paradox?
1.8 SUMMARY
This Unit introduces basic statistical and analytical concepts of data science. This Unit
first introduces you to the definition of the data science. Data science as a discipline
uses concepts from computing, mathematics and domain knowledge. The types of
data for data science is defined in two different ways. First, it is defined on the basis
of structure and generation rate of data, next it is defined as the measures that can be
used to capture the data. In addition, the concept of sampling has been defined in this
Unit.
This Unit also explains some of the basic methods used for analysis, which includes
descriptive, exploratory, inferential and predictive. Few interesting misconceptions
related to data science has also been explained with the help of example. This unit
also introduces you to some of the applications of data science and data science life
cycle. In the ever-advancing technology, it is suggested to keep reading about newer
data science applications
1.9 SOLUTIONS/ANSWERS
1. Data science integrates the principles of computer science and

mathematics and domain knowledge to create mathematical models
18
that shows relationships amongst data attributes. In addition, data Basics of Data Science
science uses data to perform predictive analysis.
2. Structured data has a defined dimensional structure clearly identified
by attributes, for example, tables, data cubes, etc. Semi-structure data
has some structure due to use of tags, however, the structure may be
flexible, for example, XML data. Unstructured data has no structure at
all, like long texts. Data streams on the other hand may be structured,
semi-structured or unstructured data that are being produced
continuously.
3. Age category would be a categorical data, it will be of ordinal scale, as
there are differences among categories, but that difference cannot be
defined quantitatively. Weight of the students of a class is ration scale.
Grade is also a measure of ordinal scale. 5-point Likert scale is also
ordinal data.
1. Descriptive of categorical data may include the total number of

observations, frequency table and bar or pie chart.
2. The descriptive of age may include mean, median, mode, skewness,
kurtosis, standard deviation and histogram or box plot.
3. For left skewed data mean is substantially higher than median and
mode.
4. The difference between the Quartile 3 and Quartile 1 is interquartile
range (IQR). In general, suspected outliers are at a distance of 1.5 times
IQR higher than 3rd quartile or 1.5 times IQR lower than 1st quartile.
1. Box plots shows 5-point summary of data. A well spread box plot is an
indicator of normally distributed data. Side-by-side box blots can be
used to do a comparison of scale data values of two or more categories.
2. Inferential analysis also computes p-value, which determines if the
result obtained by exploratory analysis are significant enough, such that
results may be applicable for the population.
3. Simpson’s paradox signifies that grouped data sometimes statistics may
produce results that are contrary to when same statistics is applied to
ungrouped data.
19
UNIT 2 PORTABILITY AND STATISTICS FOR
DATA SCIENCE
2.0 Introduction
2.1 Objectives
2.2 Probability
2.2.1 Conditional Probability
2.2.2 Bayes Theorem
2.3 Random Variables and Basic Distributions
2.3.1 Binomial Distribution
2.3.2 Probability Distribution of Continuous Random Variable
2.3.3 The Normal Distribution
2.4 Sampling Distribution and the Central Limit Theorem
2.5 Statistical Hypothesis Testing
2.5.1 Estimation of Parameters of the Population
2.5.2 Significance Testing of Statistical Hypothesis
2.5.3 Example using Correlation and Regression
2.5.4 Types of Errors in Hypothesis Testing
2.6 Summary
2.7 Solution/Answers
2.0 INTRODUCTION
In the previous unit of this Block, you were introduced to the basic concepts of
data science, which include the basic types of data, basic methods of data
analysis and applications and life cycle of data science. This Unit introduces
you to the basic concepts related to probability and statistics related to data
science.
It introduces the concept of conditional probability and Bayes Theorem. It is
followed by discussion on the basic probability distribution, highlighting their
significance and use. These distributions includes Binomial and Normal
distributions, the two most used distributions from discrete and continuous
variables respectively. The Unit also introduces you to the concept of sampling
distribution and central limit theorem. Finally, this unit covers the concepts of
statistical hypothesis testing with the help of an example of correlation. You
may refer to further readings for more details on these topics, if needed.
2.1 OBJECTIVES
After going through thus unit, you should be able to:

• Compute the conditional probability of events;
• Use Bayes theorem in problem solving
• Explain the concept of random variable;
• Explain the characteristics of binomial and normal distribution;
• Describes the sampling distribution and central limit theorem;
• State the statistical hypothesis; and
• Perform significance testing.
1
Portability and Statistics
for Data Science
2.2 PROBABILITY
Probability is a possible measure of occurrence of a specific event amongst a

group of events, if the occurrence of the events is observed for a large number
of trials. For example, possibility of getting 1, while rolling a fair die is 1/6. You
may please note that you need to observe this event by repeatedly rolling a die
for a large number of trials to arrive at this probability or you may determine the
probability subjectively by finding the ratio of this outcome and the total number
of possible outcomes, which may be equally likely. Thus, the probability of an
event (E) can be computed using the following formula:
!"#$%& () ("*+(#%, -. */% ,%* () 011 2(,,-$1% ("*+(#%, */0* &%,"1* -. %3%.* 4
𝑃(𝐸) = !"#$%& ("*+(#%, -. */% ,%* () 011 2(,,-$1% ("*+(#%,
(1)
In the equation above, the set of all possible outcomes is also called sample
space. In addition, it is expected that all the outcomes are equally likely to occur.
Consider that you decided to roll two fair dice together at the same time. Will
the outcome of first die will affect the outcome of the second die? It will not, as
both the outcomes are independent of each other. In other words both the trials
are independent, if the outcome of the first trial does not affect the outcome of
second trail and vice-versa; else the trails are dependent trials.
How to compute the probability for more than one events in a sample space. Let
us explain this with the help of example.
Example 1: A fair die having six equally likely outcomes is to be thrown, then:
(i) What is the sample space: {1, 2, 3, 4, 5, 6}
(ii) An Event Ais die shows 2, then outcome of event A is {2}; and
probability P(A)=1/6
(iii) An Event Bis die shows odd face, then Event B is {1, 3, 5}; and
probability of Event B is P(B) = 3/6= 1/2
(iv) An Event C is die shows even face, then Event C is {2, 4, 6}; and
probability of Event C is P(C) = 3/6= 1/2
(v) Event A and B are disjoint events, as no outcomes between them is
common. So are the Event B and C. But event A and C are not
disjoint.
(vi) Intersection of Events A and B is a null set {}, as they are disjoint
events, therefore, probability that both events A and B both occur,
viz. P(AÇB) = 0. However, intersection of A and C is {2}, therefore,
P(AÇ C) = 1/6.
(vii) The union of the Events A and B would be {1, 2, 3. 5}, therefore, the
probability that event A or event B occurs, viz. P(AÈB)=4/6=2/3.
Whereas, the Union of events B and C is {1, 2, 3, 4, 5, 6}, therefore,
P(BÈC)=6/6=1.
Please note that the following formula can be derived from the above example.
Probability of occurrence of any of the two events A or B (also called union of
events) is:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) (2)
2
For example 1, you may compute the probability of occurrence of event A or C
as:
𝑃(𝐴 ∪ 𝐶) = 𝑃(𝐴) + 𝑃(𝐶) − 𝑃(𝐴 ∩ 𝐶)
= 1/6 + 1/2 – 1/6 = 1/2.
In the case of disjoint events, since 𝑃(𝐴 ∩ 𝐵) is zero, therefore, the equation
(2) will reduce to:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) (3)
Probability of Events in independent trials:

This is explained with the help of the following example.
Example 2: A fair die was thrown twice, what is the probability of getting a 2
in the first throw and 4 or 5 in the second throw.
The probability of getting a 2 in the first throw (say event X) is P(X) = 1/6
The probability of getting {4, 5} in the second throw (say event Y) is P(Y) =
2/6.
Both these events are independent of each other, therefore, you need to use the
formula for intersection of independent events, which is:
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋) × 𝑃(𝑌) (4)
! # !
Therefore, the probability 𝑃(𝑋 ∩ 𝑌) = " × " = !$
This rule is applicable even with more than two independent events. However,
this rule will not apply if the events are not independent.
2.2.1 Conditional Probability
Conditional probability is defined for the probability of occurrence of an event,

when another event has occurred. Conditional probability addresses the
following question.
Given two events X and Y with the probability of occurrences P(X) and
P(Y) respectively. What would be the probability of occurrence of X if
the other event Y has actually occurred?
Let us analyse the problem further. Since the event Y has already occurred,
therefore, sample space reduces to the sample space of event Y. In addition, the
possible outcomes for occurrence of X could be the occurrences at the
intersection of X and Y, as that is the only space of X, which is part of sample
space of Y. Figure 1 shows this with the help of a Venn diagram.
Sample Space
if Event Y
X Y occurred
Possible
Outcome of X
Initial Sample Space after Y has
occurred
Figure 1: The conditional Probability of event A given that event B has occurred
You can compute the conditional probability using the following equation.
%(' ∩ *)
𝑃(𝑋/𝑌) = %(*) (5)
3
for Data Science Where 𝑃(𝑋/𝑌) is the conditional probability of occurrence of event X, if event
Y has occurred.
For example, in example 1, what is the probability of occurrence of event A, if
event C has occurred?
You may please note that P(AÇC) = 1/6 and P(C)=1/2, therefore, the conditional
probability 𝑃(𝐴/𝐶) would be:
𝑃(𝐴 ∩ 𝐶) 13
𝑃(𝐴/𝐶) = = 6 = 133
𝑃(𝐶) 13
2
What would be conditional probability of disjoint events? You may find the
answer, by computing the 𝑃(𝐴/𝐵) for the Example 1.
What would be the conditional probability for Independent events?

The equation (5) of conditional probability can be used to derive a very
interesting results, as follows:
You can rewrite equation (5) as,
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋/𝑌) × 𝑃(𝑌) (5a)
Similarly, you can rewrite equation (5) for 𝑃(𝑌/𝑋) as,
%(' ∩ *)
𝑃(𝑌/𝑋) = %(') or 𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋) (5b)
Using equation 5(a) and equation 5(b) you can conclude the following:
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋/𝑌) × 𝑃(𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋) (6)
Independent events are a special case for the conditional probability. As the two
events are independent of each other, therefore, occurrence of the any one of the
event does not change the probability or occurrence of the second event.
Therefore, for independent events X and Y
𝑃(𝑋/𝑌) = 𝑃(𝑋) 𝑎𝑛𝑑 𝑃(𝑌/𝑋) = 𝑃(𝑌) (7)
In fact, the equation (7) can be used to determine the independent events
2.2.2 Bayes Theorem
Bayes theorem is one of the important theorem, which deals with the
conditional probability. Mathematically, Bayes theorem can be written using
equation (6) as,
𝑃(𝑋/𝑌) × 𝑃(𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋)
%(*/')×%(')
Or 𝑃(𝑋/𝑌) = (8)
%(*)
Example 3:Assume that you have two bags namely Bag A and Bag B. Bag A
contains 5 green and 5 red balls; whereas, Bag B contains 3 green and 7 red
balls. Assume that you have drawn a red ball, what is the probability that this
red ball is drawn from Bag B.
In this example,
Let the event X be “Drawing a Red Ball”. The probability of drawing a red ball
can be computed as follows;
You may select a Bag and then draw a ball. Therefore, the probability
will be computed as:
(Probability of selection of Bag A) ´(Probability of selection of red ball
in Bag A) + (Probability of selection of Bag B) ´ (Probability of
selection of red ball in Bag B)
4
P(Red)= (1/2´5/10 + 1/2 ´ 7/10) = 3/5
Let the event Y be “Selection of Bag B from the two Bags, assuming equally
likely selection of Bags. Therefore, P(BagB)=1/2.
In addition, if Bag B is selected then the probability of drawing Red ball
P(Red/BagB)=7/10, as Bag B has already been selected and it has 3 Green and
7 Red balls.
As per the Bayes Theorem:
%(./0/1231)×%(1231)
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = %(./0)
! "
× 4
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = "# $
% = !#
&
Bayes theorem is a powerful tool to revise your estimate provided a given
event has occurred. Thus, you may be able to change your predictions.
1. Is 𝑃(𝑌/𝑋) = 𝑃(𝑌/𝑋)?
2. How can you use probabilities to find, if two events are independent.
3. The MCA batches of University A and University B consists of 20 and
30 students respectively. University A has 10 students who have
obtained more than 75% marks and University B has 20 such students.
A recruitment agency selects one of these student who has more than
75% marks out of the two Universities. What is the probability that the
selected student is from University A?
2.3 RANDOM VARIABLES AND BASIC

DISTRIBUTIONS
In statistics, in general, you perform random experiments to study particular

characteristics of a problem situation. These random experiments, which are
performed in almost identical experimental setup and environment, determine
the attributes or factors that may be related to the problem situation. The
outcome of these experiments can take different values based on the probability
and are termed as the random variables. This section discusses the concept of
random variables.
Example 4: Consider you want to define the characteristics of random

experiment toss of the coin, say 3 tosses, you selected an outcome “Number of
Heads” as your variable, say X. You may define the possible outcomes of sample
space for the tosses as:
Outcomes HHH HHT HTH HTT THH THT TTH TTT

Number of Heads 3 2 2 1 2 1 1 0
(X)
Figure 2: Mapping of outcomes of sample space to Random variable.
Using the data of Figure 2, you can create the following frequency table, which
can also be converted to probability.
5
for Data Science
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
Figure 3: The Frequency and Probability of Random Variable X
The Random variables are of two kinds:

• Discrete Random Variable
• Continuous Random Variable
Discrete random variables, as the name suggests, can take discrete values only.
Figure 3 shows a discrete random variable X. A discrete random variable, as a
convention, may be represented using a capital alphabets. The individual values
are represented using lowercase alphabet, e.g., for the discrete variable X of
Figure 3, the discrete values are x0, x1,x2 andx3. Please note that their values are
also 0, 1, 2 and 3 respectively. Similarly, to represent individual probability, you
may use the value names p0, p1,p2 andp3. Please also note that the sum of all these
probabilities is 1, e.g. in Figure 3, p0+p1+p2 +p3 = 1.
Probability Distribution of Discrete Random Variable
For the discrete random variable X, which is defined as the number of head in
three tosses of coin, the pair (xi,pi), for i=0 to 3, defines the probability
distribution of the random variable X. Similarly, you can define the probability
distribution for any discrete random variable. The probability distribution of a
discrete random variable has two basic properties:
• The pi should be greater than or equal to zero, but always less than or
equal to 1.
• The sum of all pi should be 1.
Figure 4 shows the probability distribution of X (the number of heads in three
tosses of coin) in graphical form.
0.4 0.375 0.375
0.35
0.3
0.25
PROBABILITY
0.2
0.15 0.125 0.125
0.1
0.05
0
0 1 2 3
NUMBER OF HEADS (X)
Figure 4: Probability Distribution of Discrete Random Variable X (Number of heads in 3 tosses of

coin)
6
Another important value defined in probability distribution is the mean or
expected value, which is computed using the following equation (9)for random
variable X:
𝜇 = ∑6578 𝑥5 × 𝑝5 (9)
Thus, the mean or expected number of heads in three trials would be:
𝜇 = 𝑥8 × 𝑝8 + 𝑥! × 𝑝! + 𝑥# × 𝑝# + 𝑥9 × 𝑝9
! 9 9 ! !#
𝜇 = 0 × $ + 1 × $ + 2 × $ + 3 × $ = $ = 1.5
Therefore, in a trail of 3 tosses of coins, the mean number of heads is 1.5.
2.3.1 Binomial Distribution
Binomial distribution is a discrete distribution. It shows the probability

distribution of a discrete random variable. The Binomial distribution involves
an experiment involving Bernoulli trails, which has the following
characteristics:
• A number of trails are conducted, say n.
• There can be only two possible outcomes of a trail – Success(say s) or
Failure (say f).
• Each trail is independent of all the other trails.
• The probability of the outcome Success (s), as well as failure (f), is same
in each and every independent trial.
For example, in the experiment of tossing three coins, the outcome success is
getting a head in a trial. One possible outcome for this experiment is THT, which
is one of outcome of the sample space shown in Figure 2.
You may please note that in case of n=3, the for the random variable X, which
represents the number of heads, the success is getting a Heads, while failure is
getting a Tails. Thus, THT is actually Failure, Success, Failure. The probability
for such cases, thus, can be computed as shown earlier. In general, in Binomial
distribution, the probability of r successes is represented as:
𝑃(𝑋 = 𝑟) 𝑜𝑟𝑝: = 6𝐶: × 𝑠 : × 𝑓 6;: (10)
Where s is the probability of success and f is the probability of failure in each
trail. The value of 6𝐶: is computed using the combination formula:
6 6!
𝐶: = :!(6;:)! (11)
For the case of three tosses of the coins, where X is represented as number of
heads in the three tosses of coins n = 3 and both s and f are 1/2, the probability
as per Binomial Distribution would be:
9 8 9;8
3! 1 8 1 9 1
𝑃(𝑋 = 0) 𝑜𝑟 𝑝8 = 𝐶8 × 𝑠 × 𝑓 = ×H I ×H I =
0! (3 − 0)! 2 2 8
! #
3! 1 1 3
𝑃(𝑋 = 1) 𝑜𝑟 𝑝! = 9𝐶! × 𝑠 ! × 𝑓 9;! = ×H I ×H I =
1! (3 − 1)! 2 2 8
# !
3! 1 1 3
𝑃(𝑋 = 2) 𝑜𝑟 𝑝# = 9𝐶# × 𝑠 # × 𝑓 9;# = ×H I ×H I =
2! (3 − 2)! 2 2 8
9 8
3! 1 1 1
𝑃(𝑋 = 3) 𝑜𝑟 𝑝9 = 9𝐶9 × 𝑠 9 × 𝑓 9;9 = ×H I ×H I =
3! (3 − 3)! 2 2 8
Which is same as Figure 2 and Figure 3.
Finally, the mean and standard deviation of Binomial distribution for n trails,
each having a probability of success as s, can be defined using the following
formulas:
7
for Data Science 𝜇 =𝑛×𝑠 (12a)
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) (12b)
Therefore, for the variable X which represents number of heads in three tosses
of coin, the mean and standard deviation are:
!
𝜇 = 𝑛 × 𝑠 = 3 × # = 1.5
! ! √9
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) = M3 × # × (1 − #) = #
Distribution of a discrete random variable, thus, is able to compute the

probability of occurrence of specific number successes, as well as the mean or
expected value of a random probability experiment.
2.3.2 Probability Distribution of Continuous Random Variable
A continuous variable is measured using scale or interval measures. For

example, height of the students of a class can be measured using an interval
measure. You can study the probability distribution of a continuous random
variable also, however, it is quite different from the discrete variable
distribution. Figure 5 shows a sample histogram of the height of 100 students of
a class. You may please notice it is typically a grouped frequency distribution.
Frequency of 'Height'
27
14 14
11 11
8
6
5
3
0 1
(145, 150] (155, 160] (165, 170] (175, 180] (185, 190]
[140, 145] (150, 155] (160, 165] (170, 175] (180, 185] (190, 195]
Figure 5: Histogram of Height of 100 students of a Class
The mean of the height was 166 and the standard distribution was about 10.The
probability for a student height is in between 165 to 170 interval is 0.27.
In general, for large data set continuous random variable distribution is

represented as a smooth curve, which has the following characteristics:
• The probability in each interval would be between 0 and 1. To compute the
probability in an interval you need to compute the area of the curve between
the starting and end points of that interval.
• The total area of the curve would be 1.
2.3.3 The Normal Distribution

An interesting frequency distribution of continuous random variable is the
Normal Distribution, which was first demonstrated by a German Scientist C.F.
8
Gauss. Therefore, it is sometime also called the Gaussian distribution. The
Normal distribution has the following properties:
• The normal distribution can occur in many real life situations, such as height
distribution of people, marks of students, intelligence quotient of people etc.
• The curve looks like a bell shaped curve.
• The curve is symmetric about the mean value (μ).Therefore, about half of
the probability distribution curve would lie towards the left of the mean and
other half would lie towards the right of the mean.
• If the standard deviation of the curve is σ, then about 68% of the data values
would be in the range (μ-σ) to (μ+σ) (Refer to Figure 6)
• About 95% of the data values would be in the range (μ-2σ) to (μ+2σ) (Refer
to Figure 6)
• About 99.7% of the data values would be in the range (μ-3σ) to (μ+3σ) (Refer
to Figure 6).
• Skewness and Kurtosis of normal distribution is closer to zero.
• The probability density of standard normal distribution is represented using
a mathematical equation using parameters μ and σ. You may refer to the
equation in the further readings.
𝜇 − 3𝜎 𝜇 − 2𝜎 𝜇 − 𝜎 𝜇 𝜇+𝜎 𝜇+2𝜎 𝜇+3𝜎

68%
95%
99.7%
Figure 6: Normal Distribution of Data
Computing probability using Normal Distribution:

The Normal distribution can be used to compute the z-score, which computes
the distance of a value x from its mean in terms of its standard deviation.
For a given continuous random variable X and its value x; and normal probability
distribution with parameters μ and σ; the z-score would be computed as:
(>;?)
𝑧= @ (13)
You can find the cumulative probabilities at a particular z-value using Normal
distribution, for example, the shaded portion of the Figure 7 shows the
cumulative probabilities at z= 1.3, the probability of the shaded portion at this
point is 0.9032
9
for Data Science 𝐴𝑟𝑒𝑎 𝑢𝑛𝑑𝑒𝑟 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒
= 0.9032
µ 𝜇 + 1.3𝜎
𝜇+𝜎
Figure 7: Computing Probability using Normal Distribution
Standard Normal Distribution is a standardized form of normal distribution,

which allows comparison of various different normal curves. A standard normal
curve would have the value of mean (μ) as zero and standard deviation (σ) as 1.
The z-score for standard normal distribution would be:
(>;8)
𝑧= ! =𝑥
Therefore, for standard normal distribution the z-score is same as value of x.
This means that 𝑧 = ±2contains the 95% area under the standard normal curve.
In addition to Normal distribution a large number of probability distributions

have been studied. Some of these distributions are – Poisson distribution,
Uniform Distribution, Chi-square distribution etc. Each of these distribution is
represented by a characteristics equation involving a set of parameters. A
detailed discussion on these distributions is beyond the scope of this Unit. You
may refer to Further Reading for more details on these distributions.
2.4 SAMPLING DISTRIBUTION AND THE

CENTRAL LIMIT THEOREM
With the basic introduction, as above, next we discuss one of the important
aspect of sample and population called sampling distribution. A typical
statistical experiment may be based on a specific sample of data that may be
collected by the researcher. Such data is termed as the primary data. The
question is – Does the statistical results obtained by you using the primary data
can be applied to the population? If yes, what may be the accuracy of such a
collection? To answer this question, you must study the sampling distribution.
Sampling distribution is also a probability distribution, however, this
distribution shows the probability of choosing a specific sample from the
population. In other words, a sampling distribution is the probability distribution
of means of the random samples of the population. The probability in this
distribution defines the likelihood of the occurrence of the specific mean of the
sample collected by the researcher. Sampling distribution determines whether
the statistics of the sample falls closer to population parameters or not. The
following example explains the concept of sampling distribution in the context
of a categorical variable.
10
Example 5: Consider a small population of just 5 person, who vote for a question
“Data Science be made the Core Course in Computer Science? (Yes/No)”. The
following table shows the population:
P1 P2 P3 P4 P5 Population Parameter (proportion) (p)

Yes Yes No No No 0.4
Figure 8: A hypothetical population
Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Proportion (𝑝̂ )
P1, P2, P3 0.67
P1, P2, P4 0.67
P1, P2, P5 0.67
P1, P3, P4 0.33
P1, P3, P5 0.33
P1, P4, P5 0.33
P2, P3, P4 0.33
P2, P3, P5 0.33
P2, P4, P5 0.33
P3, P4, P5 0.00
Frequency of all the sample proportions is:
𝑝̂ Frequency
0 1
0.33 6
0.67 3
Figure 9: Sampling proportions
The mean of all these sample proportions = (0´1 + 0.33´6+0.67´3)/10

= 0.4 (ignoring round off errors)
8
6
FREQUENCY
6
4
2 3
1
0
0 0.33 0.67
SAMPLE PROPORTION
Figure 10: The Sampling Proportion Distribution
Please notice the nature of the sampling proportions distribution, it looks closer
to Normal distribution curve. In fact, you can find that out by creating an
example with 100 data points and sample size 30.
Given a sample size n and parameter proportion p of a particular category, then

the sampling distribution for the given sample size would fulfil the following:
11
for Data Science 𝑚𝑒𝑎𝑛 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 = 𝑝 (14a)
A×(!;A)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = M 6
(14b)
Let us extend the sampling distribution to interval variables. Following example

explains different aspects sampling distribution:
Example 6: Consider a small population of age of just 5 person. The following
table shows the population:
P1 P2 P3 P4 P5 Population mean (μ)

20 25 30 35 40 30
Figure 8: A hypothetical population
Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Mean (𝑥̅ )
P1, P2, P3 25
P1, P2, P4 26.67
P1, P2, P5 28.33
P1, P3, P4 28.33
P1, P3, P5 30
P1, P4, P5 31.67
P2, P3, P4 30
P2, P3, P5 31.67
P2, P4, P5 33.33
P3, P4, P5 35
Figure 11: Mean of Samples
The mean of all these sample means = 30, which is same as population mean μ.
The histogram of the data is shown in Figure 12.
2.5
2
Frequency
1.5
0.5
0
(26.5, 28] (29.5, 31] (32.5, 34]
[25, 26.5] (28, 29.5] (31, 32.5] (34, 35.5]
Mean Value
Figure 12: Frequency distribution of sample means
Given a sample size n and population mean μ, then the sampling distribution for
the given sample size would fulfil the following:
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 = 𝜇 (15a)
@
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 = (15b)
√6
12
Therefore, the z-score computation for sampling distribution will be as per the
following equation:
Note: You can obtain this equation from equation (13), as this is a
distribution of means, therefore, x of equation (13) is 𝑥̅ , and standard
deviation of sampling distribution is given by equation (15b).
(>̅ ;?)
𝑧= @ (15c)
C
√6
Please note that the histogram of the mean of samples is close to normal
distribution.
Such exponentiations led to the Central limit Theorem, which proposes the following:
Central Limit Theorem: Assume that a sample of size is drawn from a population that
has the mean μ and standard deviation σ. The central limit theorem states that with the
increase in n, the sampling distribution, i.e. the distribution of mean of the samples,
approaches closer to normal distribution.
However, it may be noted that the central limit theorem is applicable only if you have
collected independent random samples, where the size of sample is sufficiently large,
yet it is less than 10% of the population. Therefore, the Example 5 and Example 6 are
not true representations for the theorem, rather are given to illustrate the concept.
Further, it may be noted that the central limit theorem does not put any constraint on
the distribution of population. Equation 15 is a result of central limit theorem.
Does the Size of sample have an impact on the accuracy of results?

Consider that a population size is100,000 and you have collected a sample of size
n=100, which is sufficiently large to fulfil the requirements of central limit theorem.
Will there be any advantage of taking a higher sample size say n=400? The next section
addresses this issue in detail.

1. A fair dice is thrown 3 times, compute the probability distribution of the
outcome number of times an even number appears on the dice.
2. What would be the probability of getting different number of heads, if a fail

coin is tossed 4 times.
3. What would be the mean and standard deviation for the random variable of
Question 2.
4. What is the mean and standard deviation for standard normal distribution?
5. A country has the population of 1 billion, out of which 1% are the students of
class 10th. A representation sample of 10000 students of class 10 were asked a
question “Is Mathematics difficult or easy?”. Assuming that the population
proportion of this question was reported to be 0.36, what would be possible
standard deviation of the sampling distribution?
6. Given a quantitative variable, what is the mean and standard deviation of

sampling distribution?
13
for Data Science
2.5 STATISTICAL HYPOTHESIS TESTING
In general, statistical analysis is mainly used in the two situations:

S1. To determine if students of class 12 plays some sport, a sample random
survey collected the data from 1000 students. Of these 405 students,
stated that they play some sport. Using this information, can you infer
that students of class 12 give less importance to sports? Such a decision
would require you to estimate the population parameters.
S2. In order to study the effect of sports on the performance of class 12th
marks, a study was performed. It performed random sampling and
collected the data of 1000 students, which included information of
Percentage of marks obtained by the student and hours spent by the
student in sports per week during class 12th. This kind of decision can be
made through hypothesis testing.
In this section, let us analyse both these situations.
2.5.1 Estimation of Parameters of the Population

One of the simplest ways to estimate the parameter value as a point estimation.
Key characteristics of this estimate should be that it should be unbiased, such as
mean or median that lies towards the centre of the data; and should have small
standard deviation, as far as possible. For example, a point estimate for situation
S1 above would be that 40.5% students play some sports. This point estimate,
however, may not be precise and may have some margin of error. Therefore, a
better estimation would be to define an interval that contains the value of the
parameter of the population. This interval, called confidence interval, includes
the point estimate along with possible margin of error. The probability that the
chosen confidence interval contains the population parameter is normally chosen
as 0.95. This probability is called the confidence level. Thus, you can state with
95% confidence that the confidence interval contains a parameter. Is the value
of confidence level as 0.95 arbitrary? As you know that sampling distribution
for computing proportion is normal if the sample size (n) is large. Therefore, to
answer the question asked above, you may study Figure 13 showing the
probability distribution of sampling distribution.
𝑝 − 2 StDev ^p p 𝑝 + 2 StDev
Figure 13: Confidence Level 95% for a confidence interval (non-shaded area).
Since you have selected a confidence level of 95%, you are expecting that
proportion of the sample (𝑝̂ )can be in the interval–(population proportion (p) -
14
2´(Standard Deviation)) to (population proportion (p) + 2´(Standard
Deviation)), as shown in Figure 13. The probability of occurrence of 𝑝̂ in this
interval is 95% (Please refer to Figure 6). Therefore, the confidence level is 95%.
In addition, note that you do not know the value of p that is what you are
estimating, therefore, you would be computing 𝑝̂ . You may observe in Figure
13, that the value of p will be in the interval (𝑝̂ - 2´(Standard Deviation)) to (𝑝̂
+ 2´(Standard Deviation)). The standard deviation of the sampling distribution
can be computed using equation (14b). However, as you are estimating the value
of p, therefore, you cannot compute the exact value of standard deviation.
Rather, you can compute standard error, which is
computed by estimating the standard deviation using the sample proportion (𝑝̂ ),
by using the following formula:
AD×(!;AD)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟(𝑆𝑡𝐸𝑟𝑟) = M 6
Therefore, the confidence interval is estimated as (𝑝̂ – 2´StErr) to (𝑝̂ + 2´StErr).
In general, for a specific confidence level, you can specify a specific z-score
instead of 2. Therefore, the confidence interval, for large n, is: (𝑝̂ – z´StErr) to
(𝑝̂ + z´StErr)
In practice, you may use confidence level of 90% or 95% and 99%. The z-score
used for these confidence levels are 1.65, 1.96 (not 2) and 2.58respectively.
Example 7: Consider the statement S1 of this section and estimate the
confidence interval for the given data.
For the sample the probability that class 12th students play some sport is:
𝑝̂ = 405/1000=0.405
The sample size (n) = 1000
AD×(!;AD) 8.F8G×(!;8.F8G)
𝑆𝑡𝐸𝑟𝑟 = M 6
= M !888
= 0.016
Therefore, the Confidence Interval for the confidence level 95% would be:
(0.405 – 1.96 ´0.016) to (0.405 + 1.96 ´0.016)
0.374 to 0.436
Therefore, with a confidence of 95%, you can state that the students of class 12th,
who plays some sport is in the range 37.4% to 43.6%
How can you reduce the size of this interval? You may please observe that
StErris inversely dependent on the square root of the sample size. Therefore,
you may have to increase the sample size to approximately 4 times to reduce the
standard error to approximately half.
Confidence Interval to estimate mean
You can find the confidence interval for estimating mean in a similar manner,
as you have done for the case of proportions. However, in this case you need
estimate the standard error in the estimated mean usingthe variation of equation
15b, as follows:
H
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 =
√6
; where s is the standard deviation of the sample
Example 8: The following table lists the height of a sample of 100 students of
class 12 in centimetres. Estimate the average height of students of class 12.
170 164 168 149 157 148 156 164 168 160
149 171 172 159 152 143 171 163 180 158
167 168 156 170 167 148 169 179 149 171
164 159 169 175 172 173 158 160 176 173
15
for Data Science 159 160 162 169 168 164 165 146 156 170
163 166 150 165 152 166 151 157 163 189
176 185 153 181 163 167 155 151 182 165
189 168 169 180 158 149 164 171 189 192
171 156 163 170 186 187 165 177 175 165
167 185 164 156 143 172 162 161 185 174
Figure 14: Random sample of height of students of class 12 in centimetres
The sample mean and sample standard deviation is computed and is shown
below:
Sample Mean (𝑥̅ ) = 166; Standard Deviation of sample (s) = 11
Therefore, the estimated height confidence interval of the mean height of the
students of class 12thcan be computed as:
Mean height (𝑥̅ ) = 166
The sample size (n) = 100
!!
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 = =1.1
√!88
The Confidence Interval for the confidence level 95% would be:
(166 – 1.96 ´ 1.1) to (166 + 1.96 ´ 1.1)
163.8 to 168.2
Thus, with a confidence of 95%, you can state that average height of class 12th
students is in between 163.8 to 168.2 centimetres.
You may please note that in example 8, we have used t-distribution for means,
as we have used sample’s standard deviation rather than population standard
deviation. The t-distribution of means is slightly more restrictive than z-
distribution. The t-value is computed in the context of sampling distribution by
the following equation:
(>̅ ;?)
𝑡= H (16)
C
√6
2.5.2 Significance Testing of Statistical Hypothesis
In this section, we will discuss about how to test the statement S2, given in
section 2.5. A number of experimental studies are conducted in statistics, with
the objective to infer, if the data support a hypothesis or not. The significance
testing may involve the following phases:
1.Testing Pre-condition on Data:
Prior to preforming the test of significance, you should check the pre-conditions
on the test. Most of the statistical test require random sampling, large size of
data for each possible category being tested and normal distribution of the
population.
2. Making the statistical Hypothesis: You make statistical hypothesis after the
parameters . of the population. There are two basic hypothesis in statistical
testing – the Null Hypothesis and the Alternative Hypothesis.
Null Hypothesis: Null hypothesis either defines a particular value for the
parameter or specifies there is no difference or no change in the specified
parameters. It is represented as H0.
Alternative Hypothesis: Alternative hypothesis specifies the values or difference
in parameter values. It is represented as either H1 or Ha. We use the convention
Ha.
For example, for the statement S2 of Section 2.5, the two hypothesis would be:
16
H0: There is no effect of hours of study on the marks percentage of 12th class.
Ha: The marks of class 12th improves with the hours of study of the student.
Please note that the hypothesis above is one sided, as your assumption is that the
marks would increase with hours of study. The second one sided hypothesis may
relate to decrease in marks with hours of study. However, most of the cases the
hypothesis will be two sided, which just claims that a variable will cause
difference in the second. For example, two sided hypothesis for statement S2
would be hours of study of students makes a difference (it may either increase
or decrease ) the marks of students of class 12th. In general, one sided tests are
called one tailed tests and two sided tests are called two tailed tests.
In general, alternative hypothesis relates to the research hypothesis. Please also

note that the alternative hypothesis given above is one way hypothesis as it only
states the effect in terms of increase of marks. In general, you may have
alternative hypothesis which may be two way (increase or decrease; less or more
etc.).
3. Perform the desired statistical analysis:

Next, you perform the exploratory analysis and produce a number of charts to
explore the nature of the data. This is followed by performing a significance
statistical test like chi-square, independent sample t-test, ANOVA, non-
parametric tests etc., which is decided on the basis of size of the sample, type
and characteristics of data. These tests generate assumes the null hypothesis to
be True. A test may generate parameter values based on sample and the
probability called p-value, which is an evidence against the null hypothesis. This
is shown in Figure 15.
𝑝𝑉𝑎𝑙𝑢𝑒 = 0.025 𝑝𝑉𝑎𝑙𝑢𝑒 = 0.025
Figure 15: p-value of test statistics
4. Analysing the results:

In this step, you should analyse your results. As stated in Unit 1, you must not
just draw your conclusion based on statistics, but support it with analytical
reasoning.
Example 9: We demonstrate the problem of finding a relationship between

study hours and Marks percentage (S2 of section 2.5), however, by using only
sample data of 10 students (it is hypothetical data and just used for the
illustration purpose), which is given as follows:
Weekly Study Hours (wsh) 96 92 63 76 89 80 56 70 61 81

Marks Percentage (mp) 21 19 7 11 16 17 4 9 7 18
17
for Data Science In order to find such a relationship, you may like to perform basic exploratory
analysis. In this case, let us make a scatter plot between the two variables, taking
wsh as an independent variable and mp as a dependent variable. This scatter plot
is shown in Figure 16
120
Marks Percentage (mp)

100
80
60
40
20
0
0 5 10 15 20 25
Weekly Study Hours (wsh)
Figure 16: Scatter plot of Weekly Study Hours vs. Marks Percentage.
The scatter plot of Figure 16 suggests that the two variables may be associated.
But how to determine the strength of this association? In statistics, you use
Correlation, which may be used to determine the strength of linear association
between two quantitative variables. This is explained next.
2.5.3 Example using Correlation and Regression
As stated correlation is used to determine the strength of linear association. But

how the correlation is measured?
Consider two quantitative variables x and y, and a set of n pairs of values of these
variables (for example, the wsh and mp values as shown in example 9), you can
compute a correlation coefficient, denoted by r using the following equation:
(()(*) (-)-*)
∑.
/0"K L×M N
,( ,-
𝑟>I = (6;!)
(16)
The following are the characteristics of the correlation coefficient (r):
• The value of r lies between +1 and -1.
• A positive value of r means that value of y increases with increase in
value of x and the value of y decreases with decrease in value of x.
• A negative value of r means that value of y increases with decrease in
value of x and the value of y decreases with increase in value of x.
• If the value of r is closer to +1 or -1, then it indicates that association is
a strong linear association.
• Simple scaling one of the variable does not change the correlation.
• Correlation does not specify the dependent and independent variables.
• Please remember a correlation does not mean cause. You have to
establish it with reasoning.
The data of Example 9 shows a positive correlation. It can be computed as

follows:
Mean of wsh = 12.9; Standard Deviation of wsh (Sample) = 5.98980616
Mean of mp = 76.4; Standard Deviation of mp (Sample) = 13.7210301
$."!SFF89F
𝑟OHP,RA = (!8;!) = 0.95771559
Therefore, the data shows strong positive correlation.
18
You may also use any statistical tool to find the correlation, we used MS-Excel,
which gave the following output of correlation:
Weekly Study Hours (wsh) Marks Percentage (mp)

Weekly Study Hours (wsh) 1
Marks Percentage (mp) 0.957715593 1
Figure 17: The Correlation coefficient
As the linear correlation between wsh and mp variables is strong, therefore, you
may like to find a line, called linear regression line, that may describe this
association. The accuracy of regression line, in general, is better for higher
correlation between the variables.
Single Linear Regression:
A single linear regression predicts a response variable or dependent variable (say
y) using one explanatory variable or independent variable (say x). The equation
of single linear regression can be defined by using the following equation:
𝑦A:/05TU/0 = 𝑎 + 𝑏𝑥 (17)
Here, ypredicted is the predicted value of response variable (y), x is the explanatory
variable, a is the intercept with respect to y and b is called the slope of the
regression line. In general, when you fit a linear regression line to a set of data,
there will be certain difference between the ypredicted and the observed value of
data (say yobserved). This difference between the observed value and the predicted
value, that is (yobserved - ypredicted), is called the residual. One of the most used
method of finding the regression line is the method of least square, which
minimises the sum of squares of these residuals. The following equations can be
used for computing residual:
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦VWH/:X/0 − 𝑦A:/05TU/0 (18)
The objective of least square method in regression is to minimise the sum of
squares of the residual of all the n observed values. This sum is given in the
following equation:
#
𝑆𝑢𝑚𝑂𝑓𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ∑657!i𝑦VWH/:X/0 − 𝑦A:/05TU/0 j (19)
Another important issue with regression model is to determine the predictive

power of the model, which is computed using the square of the correlation (r2).
The value of r2 can be computed as follows:
• In case, you are not using regression, then you can predict the value of y
using the mean. In such a case, the difference in predicted value and
observed value would be given by the following equation:
𝐸𝑟𝑟𝑜𝑟𝑈𝑠𝑖𝑛𝑔𝑀𝑒𝑎𝑛 = 𝑦VWH/:X/0 − 𝑦l (20)
• The total of sum of square of this error can be computed using the
following equation:
𝑇𝑜𝑡𝑎𝑙𝑆𝑢𝑚𝑂𝑓𝑆𝑞𝑢𝑎𝑟𝑒 = ∑657!(𝑦VWH/:X/0 − 𝑦l)# (21)
The use of regression line reduces the error in prediction of the value of y.
Equation (19) represents this square error. Thus, use of regression results helps
in reducing the error. The proportion r2 is actually the predictive power of the
regression and is represented using the following equation:
$
∑. Y)$ ;∑.
/0"(I12,34536 ; I /0"ZI12,34536 ; I7436/8936 [
𝑟# = ∑. Y)$
(22)
/0"(I12,34536 ; I
As stated earlier, r2 can also be computed by squaring the value of r.
19
for Data Science On performing regressing analysis on the observed data of Example 9, the
statistics as shown in Figure 18 is generated.
Regression Statistics
Multiple R 0.9577
R Square 0.9172
Adjusted R Square 0.9069
Standard Error 4.1872
Observations 10.0000
ANOVA
df SS F Significance F
Regression 1.0000 1554.1361 88.6407 0.0000
Residual 8.0000 140.2639
Total 9.0000 1694.4000
Coefficients Standard Error t Stat P-value

Intercept 48.0991 3.2847 14.6435 0.0000
Weekly Study Hours (wsh) 2.1939 0.2330 9.4149 0.0000
Figure 18: A Selected Regression output
The regression analysis results, as shown above are discussed below:

• Assumptions for the regression model:
o Data sample is collected using random sampling.
o For every value of x, the value of y in the population
§ is normally distributed
§ has same standard deviation
o The mean value if y in the population follows regression equation
(17)
• Various Null hypothesis related to regression are:
o For the analysis of variance (ANOVA) output in the regression:
§ H0A: All the coefficients of model are zero, therefore, the
model cannot predict the value of y.
o For the Intercept:
§ H0I: Intercept =0.
o For the wsh:
§ H0wsh: wsh =0.
• The Significance F in ANOVA is 0, therefore, you can reject the Null
hypothesis H0A and determine that the this model can predict the value
of y. Please note high F value supports this observation.
• The p-value related to intercept and wsh are almost 0, therefore, you can
reject the Null hypothesis H01 and H0wsh.
• The regression line has the equation:
𝑚𝑝2&%5-+*%5 = 48.0991 + 2.1939 × 𝑤𝑠ℎ
• You can compute the sum of squares (SS) using Equation (19) and
Equation (21).
• The degree of freedom in the context of statistics is the number of data
items required to compute the desired statistics.
20
• The term “Multiple R” in Regression Statistics defines the correlation
between the dependent variable (say y) with the set of independent or
explanatory variables in the regression model. Thus, multiple R is similar
to correlation coefficient (r), expect that it is used when multiple
regression is used. Most of the software express the results in terms of
Multiple R, instead of r, to represent the regression output. Similarly, R
Square is used in multiple regression, instead of r2. The proposed model
has a large r2, therefore, can be considered for deployment.
You can go through further readings for more details on all the terms discussed
above.
Figure 19 shows the regression line for the data of Example 9. You may please
observe that residuals is the vertical difference between the Marks Percentage
and Predicted marks percentage. These residuals are shown in Figure 20.
Weekly Study Hours (wsh) Line Fit Plot

Marks Percentage (mp) y = 2.1939x + 48.099
Predicted Marks Percentage (mp)
Linear (Predicted Marks Percentage (mp))
100
90
Marks Percentage (mp)
80
70
60
50
40
0 5 10 15 20 25
Figure 19: The Regression Line
Weekly Study Hours (wsh) Residual

Plot
10
5
Residuals
0
-5 0 5 10 15 20 25
-10
Figure 20: The Residual Plot

21
for Data Science
2.5.4 Types of Errors in Hypothesis Testing
In the section 2.5.1 and section 2.5.2, we have discussed about testing the Null
hypothesis. You either Reject the Null hypothesis and accepts alternative
hypothesis based on the computed probability or p-value; or you fail to Reject
the Null hypothesis. The decisions in such hypothesis testing would be:
• You reject Null hypothesis for a confidence interval 95% based on the p-
value, which lies in the shaded portion, that is p-value < 0.05 for two tailed
hypothesis (that is both the shaded portions in Figure 15, each area of
probability 0.025). Please note that in case of one tailed test, you would
consider only one shaded area of Figure 15, therefore, you would be
considering p-value < 0.05 in only one of the two shaded areas.
• You fail to reject the Null hypothesis for confidence interval 95%, when p-
value > 0.05.
The two decisions as stated above could be incorrect, as you are considering a
confidence interval of 95%. The following Figure shows this situation.
Final Decision
The Actual H0 is Rejected, that is, you You fail to reject H0, as you do not
Scenario have accepted the have enough evidence to accept
Alternative hypothesis the Alternative hypothesis
H0 is True This is called a TYPE-I error You have arrived at a correct
decision
H0 is False You have arrived at a correct This is called a TYPE-II error
decision
For example, assume that a medicine is tested for a disease and this medicine is
NOT a cure of the disease. You would make the following hypotheses:
H0: The medicine has no effect for the disease
Ha: The medicine improves the condition of patient.
However, if the data is such that for a confidence interval of 95% the p-value is
computed to be less than 0.05, then you will reject the null hypothesis, which is
Type-I error. The chances of Type-I errors for this confidence interval is 5%.
This error would mean that the medicine will get approval, even though it has
no effect on curing the disease.
However, now assume that a medicine is tested for a disease and this medicine
is a cure of the disease. Hypotheses still remains the same, as above. However,
if the data is such that for a confidence interval of 95% the p-value is computed
to be more than 0.05, then you will not be able to reject the null hypothesis,
which is Type-II error. This error would mean that a medicine which can cure
the disease will not be accepted.

1. A random sample of 100 students were collected to find their opinion about
whether practical sessions in teaching be increased? About 53 students voted for
increasing the practical sessions. What would be the confidence interval of the
population proportions of the students who would favour increasing the
population percentage. Use confidence levels 90%, 95% and 99%.
22
2. The Weight of 20 students, in Kilograms, is given in the following table
65 75 55 60 50 59 62 70 61 57
62 71 63 69 55 51 56 67 68 60
Find the estimated weight of the student population.
3. A class of 10 students were given a validated test prior and after completing
a training course. The marks of the students in those tests are given as under:
Marks before Training (mbt) 56 78 87 76 56 60 59 70 61 71
Marks after training (mat) 55 79 88 90 87 75 66 75 66 78
With a significance level of 95% can you say that the training course was useful?
2.6 SUMMARY
This Unit introduces you to the basic probability and statistics related to data
science. The unit first introduces the concept of conditional probability, which
defines the probability of an event given a specific event has occurred. This is
followed by discussion on the Bayes theorem, which is very useful in finding
conditional probabilities. Thereafter, the unit explains the concept of discrete
and continuous random variables. In addition, the Binomial distribution and
normal distribution were also explained. Further, the unit explained the
concept of sampling distribution and central limit theorem, which forms the
basis of the statistical analysis. The Unit also explain the use of confidence
level and intervals for estimating the parameters of the population. Further, the
unit explains the process of significance testing by taking an example related to
correlation and regression. Finally, the Unit explains the concept of errors in
hypothesis testing. You may refer to further readings for more details on these
concepts.
2.7 SOLUTION/ANSWERS
☞ Check Your Progress – 1

1. Is 𝑃(𝑌/𝑋) = 𝑃(𝑌/𝑋), No. Please check in Example 3, the probability
P(Red/BagB) is 7/10, whereas, P(BagB/Red)is 7/12.
2. Consider two independent events A and B, first compute P(A) and P(B).
The probability of any one of these events to occur would be computed
by equation (2), which is:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
The probability of occurrence of both the events will be computed using
the equation (4), which is:
𝑃(𝑋 ∩ 𝑌) = 𝑃(𝑋) × 𝑃(𝑌)
3. Let us assume Event X, as “A student is selected from University A”.
Assuming, any of the University can be selected with equal probability,
P(UniA) = 1/2.
Let the Event Y, as “A student who has obtained more that 75% marks
! !8 ! #8 4
is selected”. This probability 𝑃(𝑆𝑡𝐷𝑖𝑠) = # × #8 + # × 98 = !#
!8 !
In addition, 𝑃(𝑆𝑡𝐷𝑖𝑠/𝑈𝑛𝑖𝐴) = #8 = #
23
! !
for Data Science 𝑃(𝑆𝑡𝐷𝑖𝑠/𝑈𝑛𝑖𝐴) × 𝑃(𝑈𝑛𝑖𝐴) #
×# 3
𝑃(𝑈𝑛𝑖𝐴/𝑆𝑡𝐷𝑖𝑠) = = 4 =
𝑃(𝑆𝑡𝐷𝑖𝑠) 7
!#
1. As the probability of getting the even number (E) or odd number (O) is equal
in each two of dice, the following eight outcomes may be possible:
Outcomes EEE EEO EOE EOO OEE OEO OOE OOO
Number of 3 2 2 1 2 1 1 0
times Even
number appears
(X)
Therefore, the probability distribution would be:
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
2. This can be determined by using the Binomial distribution with X=0, 1, 2, 3 and
4, as follows (s and f both are 1/2):
4! 1 8 1 F 1
𝑃(𝑋 = 0) 𝑜𝑟 𝑝8 = F𝐶8 × 𝑠 8 × 𝑓 F;8 = ×H I ×H I =
0! (4 − 0)! 2 2 16
!
4! 1 1 9 4
𝑃(𝑋 = 1) 𝑜𝑟 𝑝! = F𝐶! × 𝑠 ! × 𝑓 F;! = ×H I ×H I =
1! (4 − 1)! 2 2 16
#
4! 1 1 # 6
𝑃(𝑋 = 2) 𝑜𝑟 𝑝# = F𝐶# × 𝑠 # × 𝑓 F;# = ×H I ×H I =
2! (4 − 2)! 2 2 16
9
4! 1 1 ! 4
𝑃(𝑋 = 3) 𝑜𝑟 𝑝9 = F𝐶9 × 𝑠 9 × 𝑓 F;9 = ×H I ×H I =
3! (4 − 3)! 2 2 16
4! 1 F 1 8 1
𝑃(𝑋 = 4) 𝑜𝑟 𝑝F = F𝐶F × 𝑠 F × 𝑓 F;F = ×H I ×H I =
4! (4 − 4)! 2 2 16
3. The number of tosses (n) = 4 and s = ½, therefore,

!
𝜇 =𝑛×𝑠 =4×#=2
! !
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) = M4 × # × (1 − #) = 1
4. Mean = 0 and Standard deviation = 1.
5. Standard deviation of sampling distribution =

2×(892) ;.=>×(89;.=>) ;.>×;.?
6 = 6 = = 0.0048
. 8;;;; 8;;
The large size of sample results in high accuracy of results.
6. 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 = 𝜇
@
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 =
√6

1. The value of sample proportion 𝑝̂ = 53/100 = 0.53
AD×(!;AD) 8.G9×(!;8.G9)
Therefore, 𝑆𝑡𝐸𝑟𝑟 = M 6
= M !88
= 0.05
24
The Confidence interval for 90%:
(0.53 ± 1. 65 ´0.05), which is 0.4475 to 0.6125
(0.53 ± 1. 96 ´0.05), which is 0.432 to 0.628
(0.53 ± 2.58 ´0.05), which is 0.401 to 0.659
2. Sample Mean (𝑥̅ ) = 61.8; Standard Deviation of sample (s) = 6.787
Sample size (n) = 20
".4$4
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 = =1.52
√#8
The Confidence Interval for the confidence level 95% would be:
(61.8 ± 1.96 ´ 1.52) = 58.8 to 64.8
3. Analysis: This kind of problem would require to find, if there is significant

difference in the mean of the test results before and after the training course. In
addition, the data size of the sample is 10 and the same group of person are tested
twice, therefore, paired sample t-test may be used to find the difference of the
mean. You can follow all the steps for this example of hypothesis testing.
1. Testing Pre-condition on Data:
• The students who were tested through this training course were
randomly selected.
• The population test scores, in general, are normally distributed.
• The sample size is small, therefore, a robust test may be used.
2. The Hypothesis
H0: llllll
𝑚𝑏𝑡 = llllll
𝑚𝑎𝑡
llllll
H1: 𝑚𝑏𝑡 < 𝑚𝑎𝑡 llllll
3. The results of the analysis are given below
(Please note H1 is one sided hypothesis, as you are trying to find if
training was useful for the students)
t-Test: Paired Two Sample for Means
Marks before Training (mbt) Marks after training (mat)
Mean 67.4 75.9
Variance 112.9333333 124.1
Observations 10 10
df 9
t Stat -2.832459252
P(T<=t) one-tail 0.009821702
t Critical one-tail 1.833112933
4. Analysis of results: The one tail p-value suggests that you reject the
null hypothesis. The difference in the means of the two results is
significant enough to determine that the scores of the student have
improved after the training.
25
UNIT 3 DATA PREPARATION FOR ANALYSIS
3.0 Introduction
3.1 Objectives
3.2 Need for Data Preparation
3.3 Data preprocessing
3.3.1 Data Cleaning
3.3.2 Data Integration
3.3.3 Data Reduction
3.3.4 Data Transformation
3.4 Selection and Data Extraction
3.5 Data Curation
3.5.1 Steps of Data Curation
3.5.2 Importance of Data Curation
3.6 Data Integration
3.6.1 Data Integration Techniques
3.6.2 Data Integration Approaches
3.7 Knowledge Discovery
3.8 Summary
3.9 Solutions/Answers
3.10 Further Readings
3.0 INTRODUCTION
In the previous unit of this Block, you were introduced to the basic concepts of
conditional probability, Bayes Theorem and probability distribution including
Binomial and Normal distributions. The Unit also introduces you to the concept
of the sampling distribution, central limit theorem and statistical hypothesis
testing. This Unit introduces you to the process of data preparation for Data
Analysis. Data preparation is one of the most important processes, as it leads to
good quality data, which will result in accurate results of the data analysis. This
unit covers data selection, cleaning, curation, integration, and knowledge
discovery from the stated data. In addition, this unit gives you an overview of
data quality and how its preparation for analysis is done. You may refer to further
readings for more details on these topics.
3.1 OBJECTIVES
After finishing this unit, you will be to be able to:

• Describe the meaning of "data quality"
• Explain basic techniques for data preprocessing
• Use the technique of data selection and extraction
• Define data curation and data integration
• Describe the knowledge discovery.
1
Data Preparation for
Analysis
3.2 NEED FOR DATA PREPARATION
In the present time, data is one of the key resources for a business. Data is
processed to create information; information is integrated to create knowledge.
Since knowledge is power, it has evolved into a modern currency, which is
valued and traded between parties. Everyone wants to discuss the knowledge
and benefits they can gain from data. Data is one of the most significant
resources available to marketers, agencies, publishers, media firms, and others
today for a reason. But only high-quality data is useful. We can determine a data
set's reliability and suitability for decision-making by looking at its quality.
Degrees are frequently used to gauge this quality. The usefulness of the data for
the intended purpose and its completeness, accuracy, timeliness, consistency,
validity, and uniqueness are used to determine the data's quality. In simpler
terms, data quality refers to how accurate and helpful the data are for the task at
hand. Further, data quality also refers to the actions that apply the necessary
quality management procedures and methodologies to make sure the data is
useful and actionable for the data consumers. A wide range of elements,
including accuracy, completeness, consistency, timeliness, uniqueness, and
validity, influence data quality. Figure 1 shows the basic factors of data quality.
COMPLETENESS
UNIQUENESS ACCURACY
DATA
QUALITY
VALIDITY TIMELINESS
CONSISTENCY
Figure 1: Factors of Data Quality
These factors are explained below:
• Accuracy - The data must be true and reflect events that actually take
place in the real world. Accuracy measures determine how closely the
figures agree with the verified right information sources.
• Completeness - The degree to which the data is complete determines
how well it can provide the necessary values.
• Consistency - Data consistency is the homogeneity of the data across
applications, networks, and when it comes from several sources. For
example, identical datasets should not conflict if they are stored in
different locations.
2
• Timeliness - Data that is timely is readily available whenever it is
needed. The timeliness factor also entails keeping the data accurate; to
make sure it is always available and accessible and updated in real-time.
• Uniqueness - Uniqueness is defined as the lack of duplicate or redundant
data across all datasets. The collection should contain zero duplicate
records.
• Validity - Data must be obtained in compliance with the firm's defined
business policies and guidelines. The data should adhere to the
appropriate, recognized formats, and all dataset values should be within
the defined range.
Consider yourself a manager at a company, say XYZ Pvt Ltd, who has been tasked
with researching the sales statistics for a specific organization, say ABC. You
immediately get to work on this project by carefully going through the ABC
company's database and data warehouse for the parameters or dimensions (such
as the product, price, and units sold), which may be used in your study. However,
your enthusiasm suffers a major problem when you see that several of the
attributes for different tuples do not have any recorded values. You want to
incorporate the information in your study on whether each item purchased was
marked down, but you find that this data has not been recorded. According to users
of this database system, the data recorded for some transactions were containing
mistakes, such as strange numbers and anomalies.
The three characteristics of data quality—accuracy, completeness, and

consistency—are highlighted in the paragraph above. Large databases and data
warehouses used in the real world frequently contain inaccurate, incomplete, and
inconsistent data. What may be the causes of such erroneous data in the
databases? There may be problems with the data collection tools, which may
result in mistakes during the data-entering process. Personal biases, for example,
when users do not want to submit personal information, they may purposefully
enter inaccurate data values for required fields (for example, by selecting the
birthdate field's presented default value of "January 1"). Disguised missing data
is what we call this. There may also be data transfer errors. The use of
synchronized data transit and consumption may be constrained by technological
limitations, such as a short buffer capacity. Unreliable data may result from
differences in naming conventions, data codes, or input field types (e.g., date).
In addition, cleansing of data may be needed to remove duplicate tuples.
Incomplete data can be caused by a variety of circumstances. Certain properties,

such as customer information for sales transaction data, might not always be
accessible. It is likely that some data was omitted since it was not thought to be
important at the time of input. A misinterpretation or malfunctioning technology
could prevent the recording of essential data. For example, the data, which did
not match the previously stored data, was eliminated. Furthermore, it is likely
that the data's past alterations or histories were not documented. In particular,
for tuples with missing values for some properties, it could be required to infer
missing data.
3
Analysis
3.3 DATA PREPROCESSING
Preprocessing is the process of taking raw data and turning it into information
that may be used. Data cleaning, data integration, data reduction and data
transformation, and data discretization are the main phases of data preprocessing
(see Figure 2).
DATA Pre-
processing
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Figure 2: Data pre-processing
3.3.1 Data Cleaning
Data cleaning is an essential step in data pre-processing. It is also referred to as

scrubbing. It is crucial for the construction of a good model. The step that is
required but frequently overlooked by everyone is data cleaning. Real-world
data typically exhibit incompleteness, noise, and inconsistency. In addition to
addressing discrepancies, this task entails filling in missing numbers, smoothing
out noisy data, and eliminating outliers. Errors are decreased, and data quality is
enhanced via data cleansing. Although it might be a time-consuming and
laborious operation, it is necessary to fix data inaccuracies and delete bad entries.
a. Missing Values
Consider you need to study customer and sales data for ABC
Company. As you pointed out, numerous tuples lack recorded
values for a number of characteristics, including customer
income. The following techniques can be used to add the values
that are lacking for this attribute.
i. Ignore the tuple: Typically, this is carried out in the
absence of a class label (assuming the task involves
4
classification). This method is particularly detrimental
when each attribute has a significantly different percentage
of missing values. By disregarding the remaining
characteristics in the tuple, we avoid using their values.
ii. Manually enter the omitted value: In general, this
strategy is time-consuming and might not be practical for
huge data sets with a substantial number of missing values.
iii. Fill up the blank with a global constant: A single
constant, such as "Unknown" or “−∞”, should be used to
replace all missing attribute values. If missing data are
replaced with, say, "Unknown," the analysis algorithm can
mistakenly think that they collectively comprise valid data.
So, despite being simple, this strategy is not perfect.
iv. To fill in the missing value, use a measure of the
attribute's central tendency (such as the mean or
median): The median should be used for skewed data
distributions, while the mean can be used for normal
(symmetric) data distributions. Assume, for instance, that
the ABC company’s customer income data distribution is
symmetric and that the mean income is INR 50,000/-. Use
this value to fill in the income value that is missing.
v. For all samples that belong to the same class as the
specified tuple, use the mean or median: For instance, if
we were to categorize customers based on their credit risk,
the mean income value of customers who belonged to the
same credit risk category as the given tuple might be used
to fill in the missing value. If the data distribution is skewed
for the relevant class, it is best to utilize the median value.
vi. Fill in the blank with the value that is most likely to be
there: This result can be reached using regression,
inference-based techniques using a Bayesian
formalization, or decision tree induction. As an example,
using the other characteristics of your data's customers, you
may create a decision tree to forecast the income's missing
numbers.
b. Noisy Data
Noise is the variance or random error in a measured variable. It
is possible to recognize outliers, which might be noise,
employing tools for data visualization and basic statistical
description techniques (such as scatter plots and boxplots). How
can the data be "smoothed" out to reduce noise given a numeric
property, like price, for example? The following are some of the
data-smoothing strategies.
i. Binning: Binning techniques smooth-sorted data values
by looking at their "neighbourhood" or nearby values.
The values that have been sorted are divided into various
"buckets" or bins. Binding techniques carry out local
smoothing since they look at the values' surroundings.
When smoothing by bin means, each value in the bin is
changed to the bin's mean value. As an illustration,
suppose a bin contains three numbers 4, 8 and 15. The
average of these three numbers in the bin is 9.
5
Analysis Consequently, the value nine replaces each of the bin's
original values.
Similarly, smoothing by bin medians, which substitutes
the bin median for each bin value, can be used. Bin
boundaries often referred to as minimum and maximum
values in a specific bin can also be used in place of bin
values. This type of smoothing is called smoothing by
bin boundaries. In this method, the nearest boundary
value is used to replace each bin value. In general, the
smoothing effect increases with increasing breadth. As an
alternative, bins may have identical widths with constant
interval ranges of values.
ii. Regression: Regression is a method for adjusting the data values to a
function and may also be used to smooth out the data. Finding the "best"
line to fit two traits (or variables) is the goal of linear regression, which
enables one attribute to predict the other. As an extension of linear
regression, multiple linear regression involves more than two features
and fits the data to a multidimensional surface.
iii. Outlier analysis: Clustering, for instance, the grouping of comparable
values into "clusters," can be used to identify outliers. It makes sense to
classify values that are outliers as being outside the set of clusters.
iv. Data discretization, a data transformation and data reduction technique,
is an extensively used data smoothing technique. The number of distinct
values for each property is decreased, for instance, using the binning
approaches previously discussed. This functions as a form of data
reduction for logic-based data analysis methods like decision trees,
which repeatedly carry out value comparisons on sorted data. Concept
hierarchies are a data discretization technique that can also be applied to
smooth out the data. The quantity of data values that the analysis process
must process is decreased by a concept hierarchy. For example, the price
variable, which represents the price value of commodities, may be
discretized into “lowly priced”, “moderately priced”, and “expensive”
categories.
Steps of Data Cleaning

The following are the various steps of the data cleaning process.
1. Remove duplicate or irrelevant observations- Duplicate data may be
produced when data sets from various sources are combined, scraped, or data is
obtained from clients or other departments.
2. Fix structural errors-When measuring or transferring data; you may come
across structural mistakes such as unusual naming practices, typographical
errors, or wrong capitalization. Such inconsistencies may lead to mislabeled
categories or classes. For instance, "N/A" and "Not Applicable", which might
be present on any given document, may create two different classifications.
Rather, they should be studied under the same heading or missing values.
3. Managing Unwanted outliers -Outliers might cause problems in certain
models. Decision tree models, for instance, are more robust to outliers than
linear regression models. In general, we should not eliminate outliers unless
there is a compelling reason to do so. Sometimes removing them can improve
performance, but not always. Therefore, the outlier must be eliminated for a
6
good cause, such as suspicious measurements that are unlikely to be present in
the real data.
4. Handling missing data-Missing data is a deceptively difficult issue in
machine learning. We cannot just ignore or remove the missing observation.
They must be carefully treated since they can indicate a serious problem. Data
gaps resemble puzzle pieces that are missing. Dropping it is equivalent to
denying that the puzzle slot is there. It is like trying to put a piece from another
puzzle into this one. Furthermore, we need to be aware of how we report
missing data. Instead of just filling it in with the mean, you can effectively let
the computer choose the appropriate constant to account for missingness by
using this flagging and filling method.
5. Validate and QA-You should be able to respond to these inquiries as part of
fundamental validation following the data cleansing process, for example:
o Does the data make sense?
o Does the data abide by the regulations that apply to its particular field?
o Does it support or refute your hypothesis? Does it offer any new
information?
o Can you see patterns in the data that will support your analysis?
o Is there a problem with the data quality?
Methods of Data Cleaning

The following are some of the methods of data cleaning.
1. Ignore the tuples: This approach is not particularly practical because it
can only be used when a tuple has multiple characteristics and missing
values.
2. Fill in the missing value: This strategy is neither practical nor very
effective. Additionally, the process could take a long time. The missing
value must be entered into the approach. The most common method for
doing this is by hand, but other options include attribute mean or using
the value with the highest probability.
3. Binning method: This strategy is fairly easy to comprehend. The values
nearby are used to smooth the sorted data. The information is
subsequently split into a number of equal-sized parts. The various
techniques are then used to finish the task.
4. Regression: With the use of the regression function, the data is
smoothed out. Regression may be multivariate or linear. Multiple
regressions have more independent variables than linear regressions,
which only have one.
5. Clustering: The group is the primary target of this approach. Data are
clustered together in a cluster. Then, with the aid of clustering, the
outliers are found. After that, the comparable values are grouped into a
"group" or "cluster".
3.3.2 Data Integration
Data from many sources, such as files, data cubes, databases (both relational and
non-relational), etc., must be combined during this procedure. Both
homogeneous and heterogeneous data sources are possible. Structured,
7
Analysis unstructured, or semi-structured data can be found in the sources. Redundancies
and inconsistencies can be reduced and avoided with careful integration.
a. Entity Identification Problem: Data integration, which gathers data from

several sources into coherent data storage, like data warehousing, will likely
be required for your data analysis project. Several examples of these sources
include various databases, data cubes, and flat files.
During data integration, there are a number of things to consider. Integration
of schemas and object matching might be challenging. How is it possible to
match comparable real-world things across different data sources? The entity
identification problem describes this. How can a computer or data analyst be
sure that a client's ID in one database and their customer number in another
database refer to the same attribute? Examples of metadata for each attribute
include the name, definition, data type, permissible range of values, and null
rules for handling empty, zero, or null values. Such metadata can be used to
avoid errors during the integration of the schema. The data may also be
transformed with the aid of metadata. For example, in two different instances
of data of an organization, the code for pay data might be "H" for high
income and "S" for small income in one instance of the database. The same
pay code in another instance of a database maybe 1 and 2.
When comparing attributes from one database to another during integration,
the data structure must be properly considered. This is done to make sure
that any referential constraints and functional dependencies on attributes
present in the source system are also present in the target system. For
instance, a discount might be applied to the entire order by one system, but
only to certain items by another. If this is not found before integration, things
in the target system can be incorrectly dismissed.
b. Redundancy and Correlation Analysis: Another crucial problem in data
integration is redundancy. If an attribute (like annual income, for example)
can be "derived" from another attribute or group of data, it may be redundant.
Inconsistent attributes or dimension names can also bring redundancies in
the final data set.
Correlation analysis can identify some redundancies. Based on the available
data, such analysis can quantify the strength of relationships between two
attributes. We employ the chi-square test (χ2) for finding relationships between
nominal data. Numeric attributes can be analyzed using the correlation
coefficient and covariance, which look at how one attribute's values differ from
those of another.
c. Tuple Duplication: Duplication should be identified at the tuple level in
addition to being caught between attributes (e.g., when, for a particular
unique data entry case, there are two or more identical tuples). Additional
data redundant sources include – the use of denormalized tables, which are
frequently used to increase performance by reducing joins; faulty data entry
or updating some (not all) redundant data occurrences, etc. Inconsistencies
frequently appear between different duplicates. For example, there may be
inconsistency as the same purchaser's name may appear with multiple
addresses within the purchase order database. This might happen if a
database for purchase orders has attributes for the buyer's name and address
rather than a foreign key to this data.
d. Data Value Conflict Detection and Resolution: Data value conflicts must
be found and resolved as part of data integration. As an illustration,
attribute values from many sources may vary for the same real-world thing.
8
Variations in representation, scale, or encoding may be the cause of this. In
one system, a weight attribute might be maintained in British imperial
units, while in another, metric units. For a hotel chain, the cost of rooms in
several cities could include various currencies, services (such as a
complimentary breakfast) and taxes. Similarly, every university may have
its own curriculum and grading system. When sharing information among
them, one university might use the quarter system, provide three database
systems courses, and grade students from A+ to F, while another would use
the semester system, provide two database systems courses, and grade
students from 1 to 10. Information interchange between two such
universities is challenging because it is challenging to establish accurate
course-to-grade transformation procedures between the two universities.
An attribute in one system might be recorded at a lower abstraction level
than the "identical" attribute in another since the abstraction level of
attributes might also differ. As an illustration, an attribute with the same
name in one database may relate to the total sales of one branch of a
company, however, the same result in another database can refer to the
company's overall regional shop sales.
3.3.3 Data Reduction
In this phase, data is trimmed. The number of records, attributes, or dimensions

can be reduced. When reducing data, one should keep in mind that the outcomes
from the reduced data should be identical to those from the original data.
Consider that you have chosen some data for analysis from ABC Company’s data
warehouse. The data set will probably be enormous! Large-scale complex data
analysis and mining can be time-consuming, rendering such a study impractical
or unfeasible. Techniques for data reduction can be applied to create a condensed
version of the data set that is considerably smaller while meticulously retaining
the integrity of the original data. In other words, mining the smaller data set
should yield more useful results while effectively yielding the same analytical
outcomes. This section begins with an overview of data reduction tactics and
then delves deeper into specific procedures. Data compression, dimensionality
reduction, and numerosity reduction are all methods of data reduction.
a. Dimensionality reduction refers to the process of lowering the number of
random variables or qualities. Principal components analysis and wavelet
transformations are techniques used to reduce data dimensions by
transforming or rescaling the original data. By identifying and eliminating
duplicated, weakly relevant, or irrelevant features or dimensions, attribute
subset selection is a technique for dimensionality reduction.
b. Numerosity reduction strategies substitute different, more compact forms
of data representation for the original data volume. Both parametric and non-
parametric approaches are available. In parametric techniques, a model is
employed to estimate the data, which frequently necessitates the
maintenance of only the data parameters rather than the actual data. (Outliers
may also be stored.) Examples include log-linear models and regression.
Nonparametric methods include the use of histograms, clustering, sampling,
and data cube aggregation to store condensed versions of the data.
c. Transformations are used in data compression to create a condensed or
"compressed" version of the original data. Lossless data compression is used
when the original data can be recovered from the compressed data without
9
Analysis any information being lost. Alternatively, lossy data reduction is employed
when we can only precisely retrieve a fraction of the original data. There are
a number of lossless string compression algorithms; however, they typically
permit only a small amount of data manipulation. Techniques for reducing
numerosity and dimensions can also be categorized as data compression
methods.
There are other additional structures for coordinating data reduction techniques.
The time saved by analysis on a smaller data set should not be "erased" or
outweighed by the computational effort required for data reduction.
Data Discretization: - It is regarded as a component of data reduction. The
notional qualities take the place of the numerical ones. By converting values to
interval or concept labels, data discretization alters numerical data. These
techniques enable data analysis at various levels of granularity by automatically
generating concept hierarchies for the data. Binding, histogram analysis, decision
tree analysis, cluster analysis, and correlation analysis are examples of
discretization techniques. Concept hierarchies for nominal data may be produced
based on the definitions of the schema and the distinct attribute values for every
attribute.
3.3.4 Data Transformation
This procedure is used to change the data into formats that are suited for the
analytical process. Data transformation involves transforming or consolidating
the data into analysis-ready formats. The following are some data transformation
strategies:
a. Smoothing, which attempts to reduce data noise. Binning, regression, and
grouping are some of the methods.
b. Attribute construction (or feature construction), wherein, in order to aid
the analysis process, additional attributes are constructed and added from the
set of attributes provided.
c. Aggregation, where data is subjected to aggregation or summary procedures
to calculate monthly and yearly totals; for instance, the daily sales data may
be combined to produce monthly or yearly sales. This process is often used
to build a data cube for data analysis at different levels of abstraction.
d. Normalization, where the attribute data is resized to fit a narrower range:
−1.0 to 1.0; or 0.0 to 1.0.
e. Discretization, where interval labels replace the raw values of a numeric
attribute (e.g., age) (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth,
adult, senior). A concept hierarchy for the number attribute can then be
created by recursively organizing the labels into higher-level concepts. To
meet the demands of different users, more than one concept hierarchy might
be built for the same characteristic.
f. Concept hierarchy creation using nominal data allows for the
extrapolation of higher-level concepts like a street to concepts like a city or
country. At the schema definition level, numerous hierarchies for nominal
qualities can be automatically created and are implicit in the database
structure.

1. What is meant by data preprocessing?
10
2. Why is preprocessing important?
3. What are the 5 characteristics of data processing?
4. What are the 5 major steps of data preprocessing?
5. What is data cleaning?
6. What is the importance of data cleaning?
7. What are the main steps of Data Cleaning?
3.4 SELECTION AND DATA EXTRACTION
The process of choosing the best data source, data type, and collection tools is
known as data selection. Prior to starting the actual data collection procedure,
data selection is conducted. This concept makes a distinction between selective
data reporting (excluding data that is not supportive of a study premise) and
active/interactive data selection (using obtained data for monitoring
activities/events or conducting secondary data analysis). Data integrity may be
impacted by how acceptable data are selected for a research project.
The main goal of data selection is to choose the proper data type, source, and
tool that enables researchers to effectively solve research issues. This decision
typically depends on the discipline and is influenced by the research that has
already been done, the body of that research, and the availability of the data
sources.
Integrity issues may arise, when decisions about which "appropriate" data to
collect, are primarily centred on cost and convenience considerations rather than
the data's ability to successfully address research concerns. Cost and
convenience are unquestionably important variables to consider while making a
decision. However, researchers should consider how much these factors can
skew the results of their study.
Data Selection Issues

When choosing data, researchers should be conscious of a few things, including:
• Researchers can appropriately respond to the stated research questions when
the right type and appropriate data sources are used.
• Appropriate methods for obtaining a representative sample are used.
• The appropriate tools for gathering data are used. It is difficult to separate
the choice of data type and source from the tools used to get the data. The
type/source of data and the procedures used to collect it should be
compatible.
Types and Sources of Data: Different data sources and types can be displayed in
a variety of ways. There are two main categories of data:
• Quantitative data are expressed as numerical measurements at the interval
11
Analysis and ratio levels.
• Qualitative data can take the form of text, images, music, and video.
Although preferences within scientific disciplines differ as to which type of data

is preferred, some researchers employ information from quantitative and
qualitative sources to comprehend a certain event more thoroughly.
Researchers get data from people that may be qualitative (such as by studying
child-rearing techniques) or quantitative (biochemical recording markers and
anthropometric measurements). Field notes, journals, laboratory notes,
specimens, and firsthand observations of people, animals, and plants can all be
used as data sources. Data type and source interactions happen frequently.
Choosing the right data is discipline-specific and primarily influenced by the
investigation's purpose, the body of prior research, and the availability of data
sources. The following list of questions will help you choose the right data type
and sources:
• What is the research question?
• What is the investigation's field of study? (This establishes the guidelines for
any investigation. Selected data should not go beyond what is necessary for
the investigation.
• According to the literature (prior research), what kind of information should
be gathered?
• Which form of data—qualitative, quantitative, or a combination of both—
should be considered?
Data extraction is the process of gathering or obtaining many forms of data
from numerous sources, many of which may be erratically organized or wholly
unstructured. Consolidating, processing and refining data enable information to
be kept in a centralized area so that it can be altered. These venues could be
local, online, or a combination of the two.
Data Extraction and ETL
To put the importance of data extraction into perspective, it is helpful to quickly
assess the ETL process as a whole.
1. Extraction: One or more sources or systems are used to collect the data.
Relevant data is located, identified, and then prepared for processing or
transformation during the extraction phase. One can finally analyse them for
business knowledge by combining various data types through extraction.
2. Transformation: It can be further refined after the data has been effectively
extracted. Data is cleansed, sorted, and structured as part of the
transformation process. Duplicate entries will be removed, missing values
will be filled in or removed, and audits will be performed, for example, in
order to offer data that is reliable, consistent, and usable.
3. Loading: The high-quality, converted data is subsequently sent to a single,
centralized target location for storage and analysis.
Data extraction tools

The tools listed below can be used for tasks other than a simple extraction. They
can be grouped into the following categories:
1. Scrape storm: One data extraction tool you may consider is the scrape
12
storm. It is software that uses AI to scrape the web or gather data. It is
compatible with Windows, Mac, or Linux operating systems and has a
simple and straightforward visual operation. This program automatically
detects objects like emails, numbers, lists, forms, links, photos, and
prices. When transferring extracted data to Excel, MongoDB, CSV,
HTML, TXT, MySQL, SQL Server, PostgreSQL, Google Sheets, or
WordPress, it can make use of a variety of export strategies.
2. Altair Monarch: Monarch is desktop-based, self-service, and does not
require any programming. It can link to various data sources, including
big data, cloud-based data, and both organized and unstructured data. It
connects to, cleans, and processes data with high speed and no errors. It
employs more than 80 built-in data preparation functions. Less time is
wasted on making data legible so that more time can be spent on creating
higher-level knowledge.
3. Klippa: The processing of contracts, invoices, receipts, and passports can
be done using the cloud with Klippa. For the majority of documents, the
conversion time may be between one and five seconds. The data
classification and manipulation may be done online, round-the-clock, and
supports a variety of file types, including PDF, JPG, and PNG. It can also
convert between JSON, PDF/A, XLSX, CSV, and XML. Additionally,
the software handles file sharing, custom branding, payment processing,
cost management, and invoicing management.
4. NodeXL: For Microsoft Excel 2007, 2010, 2013, and 2016, Basic is a
free, open-source add-on extension. Since the software is an add-on, data
integration is not performed; instead, it focuses on social network
analytics. Advanced network analytics, text and sentiment analysis, and
robust report generating are extra capabilities included with NodeXL Pro.

1. What is the data selection process??
2. What is Data Extraction and define the term ETL?
3. What are the challenges of data extraction?
3.5 DATA CURATION
Data curation is creating, organizing and managing data sets so that people
looking for information can access and use them. It comprises compiling,
arranging, indexing, and categorizing data for users inside of a company, a
group, or the general public. To support business decisions, academic needs,
scientific research, and other initiatives, data can be curated. Data curation is a
step in the larger data management process that helps prepare data sets for usage
in business intelligence (BI) and analytics applications. In other cases, the
curation process might be fed with ready-made data for ongoing management
and maintenance. In organizations without particular data curator employment,
data stewards, data engineers, database administrators, data scientists, or
business users may fill that role.
3.5.1 Steps of data curation
13
Analysis There are numerous tasks involved in curating data sets, which can be divided
into the following main steps.
• The data that will be required for the proposed analytics applications
should be determined.
• Map the data sets and note the metadata that goes with them.
• Collect the data sets.
• The data should be ingested into a system, a data lake, a data
warehouse, etc.
• Cleanse the data to remove abnormalities, inconsistencies, and
mistakes, including missing values, duplicate records, and spelling
mistakes.
• Model, organize, and transform the data to prepare it for specific
analytics applications.
• To make the data sets accessible to users, create searchable indexes
of them.
• Maintain and manage the data in compliance with the requirements
of continuous analytics and the laws governing data privacy and
security.
3.5.2 Importance of Data Curation
The following are the reasons for performing data curation.
1. Helps to organize pre-existing data for a corporation: Businesses produce
a large amount of data on a regular basis, however, this data can
occasionally be lacking. When a customer clicks on a website, adds
something to their cart, or completes a transaction, an online clothes
retailer might record that information. Data curators assist businesses in
better understanding vast amounts of information by assembling prior
data into data sets.
2. Connects professionals in different departments: When a company
engages in data curation, it often brings together people from several
departments who might not typically collaborate. Data curators might
collaborate with stakeholders, system designers, data scientists, and data
analysts to collect and transfer information.
3. High-quality data typically uses organizational techniques that make it
simple to grasp and have fewer errors. Curators may make sure that a
company's research and information continue to be of the highest caliber
because the data curation process entails cleansing the data. Removing
unnecessary information makes research more concise, which may
facilitate better data set structure.
4. Makes data easy to understand: Data curators make sure there are no
errors and utilize proper formatting. This makes it simpler for specialists
who are not knowledgeable about a research issue to comprehend a data
set.
5. Allows for higher cost and time efficiency: A business may spend more
time and money organizing and distributing data if it does not regularly
employ data curation. Because prior data is already organized and
distributed, businesses that routinely do data curation may be able to save
14
time, effort, and money. Businesses can reduce the time it takes to obtain
and process data by using data curators, who handle the data.
1. What is the Importance of Data Curation?
2. Explain Data Curation.
3. What are the goals of data curation?
4. What are the benefits of data curation?
3.6 DATA INTEGRATION
Data integration creates coherent data storage by combining data from several
sources. Smooth data integration is facilitated by the resolution of semantic
heterogeneity, metadata, correlation analysis, tuple duplicate identification, and
data conflict detection. It is a tactic that combines data from several sources so
that consumers may access it in a single, consistent view that displays their
status. Systems can communicate using flat files, data cubes, or numerous
databases. Data integration is crucial because it maintains data accuracy while
providing a consistent view of dispersed data. It helps the analysis tools extract
valuable information, which in turn helps the executive and management make
tactical choices that will benefit the company.
3.6.1 Data Integration Techniques

Manual Integration-When integrating data, this technique avoids employing
automation. The data analyst gathers, purifies, and integrates the data to create
actionable data. A small business with a modest amount of data can use this
approach. Nevertheless, the extensive, complex, and ongoing integration will
take a lot of time. It takes time because every step of the process must be
performed manually.
Middleware Integration-Data from many sources are combined, normalized, and
stored in the final data set using middleware software. This method is employed
when an organization has to integrate data from historical systems into modern
systems. Software called middleware serves as a translator between antiquated
and modern systems. You could bring an adapter that enables the connection of
two systems with various interfaces. It only works with specific systems.
Application-based integration- To extract, transform, and load data from various
sources, it uses software applications. Although this strategy saves time and
effort, it is a little more difficult because creating such an application requires
technical knowledge.
15
Analysis Uniform Access Integration- This method integrates information from a wider
range of sources. In this instance, however, the data is left in its initial place and
is not moved. To put it simply, this technique produces a unified view of the
combined data. The integrated data does not need to be saved separately because
the end user only sees the integrated view.
3.6.2 Data Integration Approaches

There are two basic data integration approaches. These are –
Tight Coupling- It combines data from many sources into a single physical
location using ETL (Extraction, Transformation, and Loading) tools.
Loose Coupling- The real source databases are the most efficient place to store
facts with loose coupling. This method offers an interface that receives a user
query, converts it into a format that the source databases can understand, and
immediately transmits the question to the source databases to get the answer.

1. What is meant by data integration?
2. What is an example of data integration?
3. What is the purpose of data integration?
3.7 KNOWLEDGE DISCOVERY

Knowledge discovery in databases is the process of obtaining pertinent
knowledge from a body of data (KDD). This well-known knowledge discovery
method includes several processes, such as data preparation and selection, data
cleansing, incorporating prior knowledge about the data sets, and interpreting
precise answers from the observed results.
Marketing, fraud detection, telecommunications, and manufacturing are some of
the key KDD application areas. In the last ten years, the KDD process has
reached its pinnacle. Inductive learning, Bayesian statistics, semantic query
optimization, knowledge acquisition for expert systems, and information theory
are just a few of the numerous discovery-related methodologies it now houses.
Extraction of high-level knowledge from low-level data is the ultimate objective.
Due to the accessibility and quantity of data available today, knowledge
discovery is a challenge of astounding importance and necessity. Given how
swiftly the topic has expanded, it is not surprising that professionals and experts
today have access to a variety of treatments.
Steps of Knowledge Discovery

1. Developing an understanding of the application domain: Knowledge
discovery starts with this preliminary step. It establishes the framework for
selecting the best course of action for a variety of options, such as
transformation, algorithms, representation, etc. The individuals in charge of a
KDD project need to be aware of the end users' goals as well as the environment
in which the knowledge discovery process will take place.
16
2. Selecting and producing the data set that will be used for discovery -Once
the objectives have been specified, the data that will be used for the knowledge
discovery process should be identified. Determining what data is accessible,
obtaining essential information, and then combining all the data for knowledge
discovery into one set are the factors that will be considered for the procedure.
Knowledge discovery is important since it extracts knowledge and insight from
the given data. This provides the framework for building the models.
3. Preprocessing and cleansing – This step helps in increasing the data
reliability. It comprises data cleaning, like handling the missing quantities and
removing noise or outliers. In this situation, it might make use of sophisticated
statistical methods or an analysis algorithm. For instance, the goal of the Data
Mining supervised approach may change if it is determined that a certain
attribute is unreliable or has a sizable amount of missing data. After developing
a prediction model for these features, missing data can be forecasted. A variety
of factors affect how much attention is paid to this level. However, breaking
down the components is important and frequently useful for enterprise data
frameworks.
4. Data Transformation-This phase entails creating and getting ready the
necessary data for knowledge discovery. Here, techniques of attribute
transformation (such as discretization of numerical attributes and functional
transformation) and dimension reduction (such as feature selection, feature
extraction, record sampling etc.) are employed. This step, which is frequently
very project-specific, can be important for the success of the KDD project.
Proper transformation results in proper analysis and proper conclusions.
5. Prediction and description- The decisions to use classification, regression,
clustering, or any other method can now be made. Mostly, this uses the KDD
objectives and the decisions made in the earlier phases. A forecast and a
description are two of the main objectives of knowledge discovery. The
visualization aspects are included in descriptive knowledge discovery. Inductive
learning, which generalizes a sufficient number of prepared models to produce
a model either explicitly or implicitly, is used by the majority of knowledge
discovery techniques. The fundamental premise of the inductive technique is
that the prepared model holds true for the examples that follow.
6. Deciding on knowledge discovery algorithm -We now choose the strategies
after determining the technique. In this step, a specific technique must be chosen
to be applied while looking for patterns with numerous inducers. If precision
and understandability are compared, the former is improved by neural networks,
while decision trees improve the latter. There are numerous ways that each meta-
learning system could be successful. The goal of meta-learning is to explain why
a data analysis algorithm is successful or unsuccessful in solving a particular
problem. As a result, this methodology seeks to comprehend the circumstances
in which a data analysis algorithm is most effective. Every algorithm has
parameters and learning techniques, including tenfold cross-validation or a
different division for training and testing.
7. Utilizing the Data Analysis Algorithm-Finally, the data analysis algorithm
is put into practice. The approach might need to be applied several times before
producing a suitable outcome at this point. For instance, by rotating the
algorithms, you can alter variables like the bare minimum of instances in a single
decision tree leaf.
17
Analysis 8. Evaluation-In this stage, the patterns, principles, and dependability of the
results of the knowledge discovery process are assessed and interpreted in light
of the objective outlined in the preceding step. Here, we take into account the
preprocessing steps and how they impact the final results. As an illustration, add
a feature in step 4 and then proceed. The primary considerations in this step are
the understanding and utility of the induced model. In this stage, the identified
knowledge is also documented for later use.
1. What is Knowledge Discovery?
2. What are the Steps involved in Knowledge Discovery?
3. What are knowledge discovery tools?
4. Explain the process of KDD.
3.8 SUMMARY
Despite the development of several methods for preparing data, the intricacy of
the issue and the vast amount of inconsistent or unclean data mean that this field
of study is still very active. This unit gives a general overview of data pre-
processing and describes how to turn raw data into usable information. The
preprocessing of the raw data included data integration, data reduction,
transformation, and discretization. In this unit, we have discussed five different
data-cleaning techniques that can make data more reliable and produce high-
quality results. Building, organizing, and maintaining data sets is known as data
curation. A data curator usually determines the necessary data sets and makes
sure they are gathered, cleaned up, and changed as necessary. The curator is also
in charge of providing users with access to the data sets and information related
to them, such as their metadata and lineage documentation. The primary goal of
the data curator is to make sure users have access to the appropriate data for
analysis and decision-making. Data integration is the procedure of fusing
information from diverse sources into a single, coherent data store. The unit also
introduced knowledge discovery techniques and procedures.
3.9 SOLUTIONS/ANSWERS

1. As a part of data preparation, data preprocessing refers to any type of
processing done on raw data to get it ready for a data processing
technique. It has long been regarded as a crucial first stage in the data
mining process.
2. It raises the reliability and accuracy of the data. Preprocessing data can
increase the correctness and quality of a dataset, making it more
dependable by removing missing or inconsistent data values brought by
human or computer mistakes. It ensures consistency in data.
18
3. Data quality is characterized by five characteristics: correctness,
completeness, reliability, relevance, and timeliness.
4. The five major steps of data preprocessing are:

• Data quality assessment
• Data cleaning
• Data transformation
• Data reduction
5. The practice of correcting or deleting inaccurate, damaged, improperly

formatted, duplicate, or incomplete data from a dataset is known as data
cleaning. There are numerous ways for data to be duplicated or
incorrectly categorized when merging multiple data sources.
6. Data cleansing, sometimes referred to as data cleaning or scrubbing, is

the process of locating and eliminating mistakes, duplication, and
irrelevant data from a raw dataset. Data cleansing, which is a step in the
preparation of data, ensures that the cleaned data is used to create
accurate, tenable visualizations, models, and business choices.
7. Step 1: Remove irrelevant data; Step 2: Deduplicate your data; Step 3:

Fix structural errors; Step 4: Deal with missing data; Step 5: Filter out
data outliers; Step 6: Validate your data.

1. The process of retrieving data from the database that are pertinent to
the analysis activity is known as data selection. Sometimes the data
selection process comes before data transformation and consolidation.
2. Data extraction is the process of gathering or obtaining many forms of
data from numerous sources, many of which may be erratically
organized or wholly unstructured. The process of extracting,
transforming, and loading data is called ETL. Thus, ETL integrates
information from several data sources into a single consistent data
store, which can be a data warehouse or data analytics system.
3. The cost and time involved in extracting data, as well as the accuracy
of the data, are obstacles. The correctness of the data depends on the
quality of the data source, which can be an expensive and time-
consuming procedure.

1. It entails gathering, organizing, indexing, and cataloguing information
for users within an organization, a group, or the wider public. Data
curation can help with academic needs, commercial decisions,
scientific research, and other endeavors.
2. The process of producing, arranging, and managing data sets so that
users who are looking for information can access and use them is
known as data curation. Data must be gathered, organized, indexed, and
catalogued for users within an organization, group, or the broader
public.
3. By gathering pertinent data into organized, searchable data assets, data
curation's overarching goal is to speed up the process of extracting
19
Analysis insights from raw data.
4. The benefits of data curation are:
o Easily discover and use data.
o Ensure data quality.
o Maintain metadata linked with data.
o Ensure compliance through data lineage and classification.

1. Data integration is used to bring together data from several sources to
give people a single perspective. Making data more readily available,
easier to consume, and easier to use by systems and users is the
foundation of data integration.
2. In the case of customer data integration, information about each
customer is extracted from several business systems, such as sales,
accounts, and marketing, and combined into a single picture of the
client for use in customer service, reporting, and analysis.
3. Data integration combines data collected from various platforms to
increase its value for your company. It enables your staff to collaborate
more effectively and provide more for your clients. You cannot access
the data collected in different systems without data integration.

1. Knowledge discovery is the labour-intensive process of extracting
implicit, unknown-before information from databases that may be
beneficial.
2. Steps of Knowledge Discovery:
• Developing an understanding of the application domain
• Selecting and producing the data set that will be used for the discovery
• Preprocessing and cleansing
• Data Transformation
• Prediction and description
• Deciding on a data analysis algorithm
• Utilizing the data analysis algorithm
• Evaluation
3. The process can benefit from a variety of qualitative and quantitative
methods and techniques, such as knowledge surveys, questionnaires,
one-on-one and group interviews, focus groups, network analysis, and
observation. It can be used to locate communities and specialists.
4. Knowledge Discovery from Data, often known as KDD, is another
commonly used phrase that is treated as a synonym for data mining.
Others see data mining as just a crucial stage in the knowledge
discovery process when intelligent techniques are used to extract data
patterns. The steps involved in knowledge discovery from data are as
follows: :
• Data cleaning (to remove noise or irrelevant data).
• Data integration (where multiple data sources may be combined).
• Data selection (where data relevant to the analysis task are retrieved
from the database).
• Data transformation (where data are consolidated into forms
appropriate for mining by performing summary or aggregation
functions, for sample).
20
• Data mining (an important process where intelligent methods are
applied in order to extract data patterns).
• Pattern evaluation (to identify the fascinating patterns representing
knowledge based on some interestingness measures).
• Knowledge presentation (where knowledge representation and
visualization techniques are used to present the mined knowledge to
the user).
3.10 FURTHER READINGS
References
Data Preprocessing in Data Mining - GeeksforGeeks. (2019, March 12). GeeksforGeeks;

GeeksforGeeks. https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
Data Cleaning in Data Mining - Javatpoint. (n.d.). Www.Javatpoint.Com. Retrieved February 11,
2023, from https://www.javatpoint.com/data-cleaning-in-data-mining
Data Integration in Data Mining - GeeksforGeeks. (2019, June 27). GeeksforGeeks;

GeeksforGeeks. https://www.geeksforgeeks.org/data-integration-in-data-mining/
Dowd, R., Recker, R.R., Heaney, R.P. (2000). Study subjects and ordinary patients. Osteoporos
Int. 11(6): 533-6.
Fourcroy, J.L. (1994). Women and the development of drugs: why can’t a woman be more like a
man? Ann N Y Acad Sci, 736:174-95.
Goehring, C., Perrier, A., Morabia, A. (2004). Spectrum Bias: a quantitative and graphical analysis
of the variability of medical diagnostic test performance. Statistics in Medicine, 23(1):125-35.
Gurwitz,J.H., Col. N.F., Avorn, J. (1992). The exclusion of the elderly and women from clinical
trials in acute myocardial infarction. JAMA, 268(11): 1417-22.
Hartt, J., Waller, G. (2002). Child abuse, dissociation, and core beliefs in bulimic disorders. Child
Abuse Negl. 26(9): 923-38.
Kahn, K.S, Khan, S.F, Nwosu, C.R, Arnott, N, Chien, P.F.(1999). Misleading authors’ inferences
in obstetric diagnostic test literature. American Journal of Obstetrics and Gynaecology., 181(1`),
112-5.
KDD Process in Data Mining - GeeksforGeeks. (2018, June 11). GeeksforGeeks;

GeeksforGeeks. https://www.geeksforgeeks.org/kdd-process-in-data-mining/
Maynard, C., Selker, H.P., Beshansky, J.R.., Griffith, J.L., Schmid, C.H., Califf, R.M., D’Agostino,
R.B., Laks, M.M., Lee, K.L., Wagner, G.S., et al. (1995). The exclusions of women from clinical
trials of thrombolytic therapy: implications for developing the thrombolytic predictive instrument
database. Med Decis Making (Medical Decision making: an international journal of the Society
for Medical Decision Making), 15(1): 38-43.
Pratt, M. K. (2022, January 31). What is Data Curation? - Definition from

SearchBusinessAnalytics. Business Analytics; TechTarget.
https://www.techtarget.com/searchbusinessanalytics/definition/data-curation
Robinson, D., Woerner, M.G., Pollack, S., Lerner, G. (1996). Subject selection bias in clinical:
data from a multicenter schizophrenia treatment center. Journal of Clinical Psychopharmacology,
16(2): 170-6.
21
Sharpe, N. (2002). Clinical trials and the real world: selection bias and generalisability of trial
Analysis results. Cardiovascular Drugs and Therapy, 16(1): 75-7.
Walter, S.D., Irwig, L., Glasziou, P.P. (1999). Meta-analysis of diagnostic tests with imperfect
reference standards. J Clin Epidemiol., 52(10): 943-51.
What is Data Extraction? Definition and Examples | Talend. (n.d.). Talend - A Leader in Data
Integration & Data Integrity. Retrieved February 11, 2023, from
https://www.talend.com/resources/data-extraction-defined/
Whitney, C.W., Lind, B.K., Wahl, P.W. (1998). Quality assurance and quality control in
longitudinal studies. Epidemiologic Reviews, 20(1): 71-80.
22
UNIT4: DATA VISUALIZATION AND
INTERPRETATION
Structure Page Nos.
4.0 Introduction
4.1 Objectives
4.2 Different types of plots
4.3 Histograms
4.4 Box plots
4.5 Scatter plots
4.6 Heat map
4.7 Bubble chart
4.8 Bar chart
4.9 Distribution plot
4.10 Pair plot
4.11 Line graph
4.12 Pie chart
4.13 Doughnut chart
4.14 Area chart
4.15 Summary
4.16 Answers
4.17 References
4.0 INTRODUCTION
The previous units of this course covers details on different aspects of data analysis,
including the basics of data science, basic statistical concepts related to data science and
data pre-processing. This unit explains the different types of plots for data visualization
and interpretation. This unit covers the details of the plots for data visualization and
further discusses their constructions and discusses the various use cases associated with
various data visualization plots. This unit will help you to appreciate the real-world
need for a workforce trained in visualization techniques and will help you to design,
develop, and interpret visual representations of data. The unit also defines the best
practices associated with the construction of different types of plots.
4.1 OBJECTIVES
After going through this unit, you will be able to:
• Explain the key characteristics of various types of plots for data visualization;
• Explain how to design and create data visualizations;
• Summarize and present the data in meaningful ways;
• Define appropriate methods for collecting, analysing, and interpreting the
numerical information.
4.2 DIFFERENT TYPES OF PLOTS

As more and more data are available to us today, there are several varieties of charts
and graphs than before. In reality, the amount of data that we produce, acquire, copy,
and use now will be nearly doubled by 2025. Data visualisation is therefore crucial and
serves as a powerful tool for organisations. One can benefit from graphs and charts in
the following ways:
• Encouraging the group to act proactively.
• Showcasing progress toward the goal to the stakeholders
• Displaying core values of a company or an organization to the audience.
Moreover, data visualisation can bring heterogeneous teams together around new
objectives and foster the trust among the team members. Let us discuss about various
graphs and charts that can be utilized in expression of various aspects of businesses.
4.3 HISTOGRAMS
A histogram visualises the distribution of data across distinct groups with continuous
classes. It is represented with set of rectangular bars with widths equal to the class
intervals and areas proportional to frequencies in the respective classes. A histogram
may hence be defined as a graphic of a frequency distribution that is grouped and has
continuous classes. It provides an estimate of the distribution of values, their extremes,
and the presence of any gaps or out-of-the-ordinary numbers. They are useful in
providing a basic understanding of the probability distribution.
Constructing a Histogram: To construct a histogram, the data is grouped into specific

class intervals, or “bins” and plotted along the x-axis. These represent the range of the
data. Then, the rectangles are constructed with their bases along the intervals for each
class. The height of these rectangles is measured along the y-axis representing the
frequency for each class interval. It's important to remember that in these
representations, every rectangle is next to another because the base spans the spaces
between class boundaries.
Use Cases: When it is necessary to illustrate or compare the distribution of specific

numerical data across several ranges of intervals, histograms can be employed. They
can aid in visualising the key meanings and patterns associated with a lot of data. They
may help a business or organization in decision-making process. Some of the use cases
of histograms include-
• Distribution of salaries in an organisation

• Distribution of height in one batch of students of a class, student performance
on an exam,
• Customers by company size, or the frequency of a product problem.
Best Practices
• Analyse various data groups: The best data groupings can be found by
creating a variety of histograms.
• Break down compartments using colour: The same chart can display a
second set of categories by colouring the bars that represent each category.
Types of Histogram
Normal distribution: In a normal distribution, the probability that points will occur on
each side of the mean is the same. This means that points on either side of the mean
could occur.
Example: Consider the following bins shows the frequency of length of wings of
housefly in 1/10 of millimetre.
Bin Frequency Bin Frequency

36-38 2 46-48 19
38-40 4 48-50 15
40-42 10 50-52 10
42-44 15 52-54 4
44-46 19 54-56 2
Bimodal Distribution: This distribution has two peaks. In the case of a bimodal
distribution, the data must be segmented before being analysed as normal distributions
in their own right.
Example:
Variable Frequency
0 2
1 6
2 4
3 2
4 4
5 6
6 4
Bimodal Distribution
8
6
freq
4
2
0
0 1 2 3 4 5 6
variable
Right-skewed distribution: A distribution that is skewed to the right is sometimes

referred to as a positively skewed distribution. A right-skewed distribution is one that
has a greater percentage of data values on the left and a lesser percentage on the right.
Whenever the data have a range boundary on the left side of the histogram, a right-
skewed distribution is frequently the result.
Example:
Left-skewed distribution: A distribution that is skewed to the left is sometimes
referred to as a negatively skewed distribution. A distribution that is left-skewed will
have a greater proportion of data values on the right side of the distribution and a lesser
proportion of data values on the left. When the data have a range limit on the right side
of the histogram, a right-skewed distribution commonly results. An alternative name
for this is a right-tailed distribution.
Example:
A random distribution: A random distribution is characterised by the absence of any

clear pattern and the presence of several peaks. When constructing a histogram using a
random distribution, it is possible that several distinct data attributes will be blended
into one. As a result, the data ought to be partitioned and investigated independently.
Example:
Edge Peak Distribution: When there is an additional peak at the edge of the
distribution that does not belong there, this type of distribution is called an edge peak
distribution. Unless you are very positive that your data set has the expected number of
outliers, this almost always indicates that you have plotted (or collected) your data
incorrectly (i.e. a few extreme views on a survey).
Comb Distribution: Because the distribution seems to resemble a comb, with
alternating high and low peaks, this type of distribution is given the name "comb
distribution." Rounding off an object might result in it having a comb-like form. For
instance, if you are measuring the height of the water to the nearest 10 centimetres but
your class width for the histogram is only 5 centimetres, you may end up with a comb-
like appearance.
Example
Histogram for the population data of a group of 86 people:
Age Group (in years) Population Size

20-25 23
26-30 18
31-35 15
36-40 6
41-45 11
46-50 13
TOTAL 86
Population data of a group of 100 people
Histogram
Population Size (Frequency)

25 23
20 18
15
15 13
11
10
6
5
0
20-25 26-30 31-35 36-40 41-45 46-50
Population Size 23 18 15 6 11 13
Age Group (Bins)

1. What is the difference between a Bar Graph and a Histogram?
……………………………………………………………………………………
……………………………………………………………………………………
2. Draw a Histogram for the following data:
Class Interval Frequency

0 − 10 35
10 − 20 70
20 − 30 20
30 − 40 40
40 − 50 50
3. Why is histogram used?

……………………………………………………………………………………
……………………………………………………………………………………
4. What do histograms show?
………………………………………………………………………………………
………………………………………………………………………………………
4.4 BOX PLOTS

When displaying data distributions using the five essential summary statistics of
minimum, first quartile, median, third quartile, and maximum, box-and-whisker plots,
also known as boxplots, are widely employed. It is a visual depiction of data that aids
in determining how widely distributed or how much the data values change. These
boxplots make it simple to compare the distributions since it makes the centre, spread,
and overall range understandable. They are utilised for data analysis wherein the
graphical representations are used to determine the following:
1. Shape of Distribution
2. Central Value
3. Variability of Data
Constructing a Boxplot: The two components of the graphic are described by their
names: the box, which shows the median value of data along with the first and third
quartiles (25 percentile and 75 percentile), and the whiskers, which shows the remaining
data. The 3rd quartile's difference from the first quartile of data is called the interquartile
range. The highest and minimum points in the data can also be displayed using the
whiskers. The points beyond 1.5 ´ interquartile range can be identified as suspected
outliers.
Use Cases: A boxplot is frequently used to demonstrate whether a distribution is

skewed and whether the data set contains any potential outliers, or odd observations.
Boxplots are also very useful for comparing or involving big data sets. Examples of box
plots include plotting the:
• Gas efficiency of vehicles
• Time spent reading across readers
Best Practices
• Cover the points within the box: This aids the viewer in concentrating on the
outliers.
• Box plot comparisons between categorical dimensions: Box plots are
excellent for quickly comparing dataset distributions.
Example
Subject Section A Section B Section C

English 59 65 82
Math 96 73 66
Science 78 57 81
Economics 96 79 73
English 65 55 94
Math 78 65 56
Science 68 61 85
Economics 96 98 56
English 85 63 85
Math 93 88 68
Science 94 66 94
Economics 67 59 86
English 82 66 96
Math 64 79 63
Science 55 90 97
Economics 73 89 95
English 89 66 75
Math 57 81 73
Science 67 92 88
Economics 78 65 69
The boxplots clearly shows that Section B has performed poorly in English, whereas
section C has performed poorly in Maths. Section A has mostly balanced performance,
but the marks of the students are most dispersed.
1. How to correctly interpret a boxplot?
……………………………………………………………………………………
……………………………………………………………………………………
2. What are the most important parts of a box plot?
……………………………………………………………………………………
……………………………………………………………………………………
3. What is the uses of box plot?

………………………………………………………………………………………………
………………………………………………………………………………
4. How do you describe the distribution of a box plot?

……………………………………………………………………………………...
……………………………………………………………………………………...
4.5 SCATTER PLOTS
Scatter plot is the most commonly used chart when observing the relationship between
two quantitative variables. It works particularly well for quickly identifying possible
correlations between different data points. The relationship between multiple variables
can be efficiently studied using scatter plots, which show whether one variable is a good
predictor of another or whether they normally fluctuate independently. Multiple distinct
data points are shown on a single graph in a scatter plot. Following that, the chart can
be enhanced with analytics like trend lines or cluster analysis. It is especially useful for
quickly identifying potential correlations between data points.
Constructing a Scatter Plot: Scatter plots are mathematical diagrams or plots that rely
on Cartesian coordinates. In this type of graph, the categories being compared are
represented by the circles on the graph (shown by the colour of the circles) and the
numerical volume of the data (indicated by the circle size). One colour on the graph
allows you to represent two values for two variables related to a data set, but two colours
can also be used to include a third variable.
Use Cases: Scatter charts are great in scenarios where you want to display both
distribution and the relationship between two variables.
• Display the relationship between time-on-platform (How Much Time Do
People Spend on Social Media) and churn (the number of people who stopped
being customers during a set period of time).
• Display the relationship between salary and years spent at company
Best Practices
• Analyze clusters to find segments: Based on your chosen variables, cluster
analysis divides up the data points into discrete parts.
• Employ highlight actions: You can rapidly identify which points in your
scatter plots share characteristics by adding a highlight action, all the while
keeping an eye on the rest of the dataset.
• mark customization: individual markings Add a simple visual hint to your
graph that makes it easy to distinguish between various point groups.
Example
Temperature (in deg C) Sale of Ice-Cream

17 ₹ 1,750.00
18 ₹ 1,603.00
22 ₹ 1,500.00
29 ₹ 2,718.00
27 ₹ 2,667.00
28 ₹ 3,422.00
31 ₹ 3,681.00
23 ₹ 2,734.00
24 ₹ 2,575.00
25 ₹ 2,869.00
35 ₹ 3,057.00
36 ₹ 3,846.00
38 ₹ 3,500.00
41 ₹ 3,496.00
42 ₹ 3,984.00
29 ₹ 4,109.00
39 ₹ 5,336.00
35 ₹ 5,197.00
42 ₹ 5,426.00
45 ₹ 5,365.00
Relationship between the Temperature and the
Sale of Ice-Cream
₹6,000.00 SCATTER PLOT
₹5,000.00
SALE OF ICE-CREAM
₹4,000.00
₹3,000.00
₹2,000.00
₹1,000.00
₹-
0 10 20 30 40 50
TEMPERATURE IN DEGREE C
Please note that a linear trendline has been fitted to scatter plot, indicating a positive
change in sales of ice-cream with increase in temperature.
1. What are the characteristics of a scatter plot?

……………………………………………………………………………………
……………………………………………………………………………………
2. What components make up a scatter plot?
………………………………………………………………………………………
………………………………………………………………………………………
3. What is the purpose of a scatter plot?
……………………………………………………………………………………
……………………………………………………………………………………
4. What are the 3 types of corelations that can be inferred from scatter plots?
……………………………………………………………………………………
……………………………………………………………………………………
4.6 HEAT MAP
Heatmaps are two-dimensional graphics that show data trends through colour
shading. They are an example of part to whole chart in which values are represented
using colours. A basic heat map offers a quick visual representation of the data. A
user can comprehend complex data sets with the help of more intricate heat maps.
Heat maps can be presented in a variety of ways, but they all have one thing in
common: they all make use of colour to convey correlations between data
values. Heat maps are more frequently utilised to present a more comprehensive
view of massive amounts of data. It is especially helpful because colours are simpler
to understand and identify than plain numbers.
Heat maps are highly flexible and effective at highlighting trends. Heatmaps are
naturally self-explanatory, in contrast to other data visualisations that require
interpretation. The greater the quantity/volume, the deeper the colour (the higher
the value, the tighter the dispersion, etc.). Heat Maps dramatically improve the
ability of existing data visualisations to quickly convey important data insights.
Use Cases: Heat Maps are primarily used to better show the enormous amounts of
data contained inside a dataset and help guide users to the parts of data
visualisations that matter most.
• Average monthly temperatures across the years
• Departments with the highest amount of attrition over time.
• Traffic across a website or a product page.
• Population density/spread in a geographical location.
Best Practices
• Select the proper colour scheme: This style of chart relies heavily on
colour, therefore it's important to pick a colour scheme that complements
the data.
• Specify a legend: As a related point, a heatmap must typically contain a
legend describing how the colours correspond to numerical values.
Example
Region-wise monthly sale of a SKU (stock-keeping unit)
MONTH
ZONE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
NORTH 75 84 61 95 77 82 74 92 58 90 54 83
SOUTH 50 67 89 61 91 77 80 72 82 78 58 63
EAST 62 50 83 95 83 89 72 96 96 81 86 82
WEST 69 73 59 73 57 61 58 60 97 55 81 92
The distribution of sales is shown in the sample heatmap above, broken down by
zone and spanning a 12-month period. Like in a typical data table, each cell displays
a numeric count, but the count is also accompanied by a colour, with higher counts
denoting deeper hues.
1. What type of input is needed for a heat map?

……………………………………………………………………………………
……………………………………………………………………………………
2. What kind of information does a heat map display?
……………………………………………………………………………………
……………………………………………………………………………………
3. What can be seen in heatmap?
……………………………………………………………………………………
……………………………………………………………………………………
4.7 BUBBLE CHART
Bubble diagrams are used to show the relationships between different variables. They
are frequently used to represent data points in three dimensions, specifically when the
bubble size, y-axis, and x-axis are all present. Using location and size, bubble charts
demonstrate relationships between data points. However, bubble charts have a restricted
data size capability since too many bubbles can make the chart difficult to read.
Although technically not a separate type of visualisation, bubbles can be used to show
the relationship between three or more measurements in scatter plots or maps by adding
complexity. By altering the size and colour of circles, large amounts of data are
presented concurrently in visually pleasing charts.
Constructing a Bubble Chart: For each observation of a pair of numerical variables

(A, B), a bubble or disc is drawn and placed in a Cartesian coordinate system
horizontally according to the value of variable A and vertically according to the value
of variable B. The area of the bubble serves as a representation for a third numerical
variable (C). Using various colours in various bubbles, you may even add a fourth
dataset (D: numerical or categorical).
By using location and proportions, bubble charts are frequently used to compare and
illustrate the relationships between circles that have been classified. Bubble Charts'
overall image can be utilised to look for patterns and relationships.
Use Cases: Usually, the positioning and ratios of the size of the bubbles/circles on this
chart are used to compare and show correlations between variables. Additionally, it is
utilised to spot trends and patterns in data.
• AdWords’ analysis: CPC vs Conversions vs share of total conversions
• Relationship between life expectancy, GD per capita and population size
Best Practices:
• Add colour: A bubble chart can gain extra depth by using colour.
• Set bubble size in appropriate proportion.
• Overlay bubbles on maps: From bubbles, a viewer can immediately determine
the relative concentration of data. These are used as an overlay to provide the
viewer with context for geographically-related data.
Example
Item Code Units Sold Sales (in Rs.) Profit %

PC001 325 ₹ 14,687.00 22%
PC002 1130 ₹ 16,019.00 18%
PC003 645 ₹ 16,100.00 25%
PC004 832 ₹ 12,356.00 9%
PC005 1200 ₹ 21,500.00 32%
PC006 925 ₹ 16,669.00 21%
PC007 528 ₹ 13,493.00 13%
PC008 750 ₹ 18,534.00 14%
PC009 432 ₹ 13,768.00 6%
PC0010 903 ₹ 22,043.00 11%
The three variables in this example are sales, profits, and the number of units sold.
Therefore, all three variables and their relationship can be displayed simultaneously
using a bubble chart.
Sales and Profit versus the Quantity sold
BUBBLE CHART
₹30,000.00
₹25,000.00
Sales (in INR)
₹20,000.00
₹15,000.00
₹10,000.00
₹5,000.00
₹-
0 200 400 600 800 1000 1200 1400
Number of units sold
1. What is bubble chart?

……………………………………………………………………………………
……………………………………………………………………………………
2. What is a bubble chart used for?
……………………………………………………………………………………
……………………………………………………………………………………
3. What is the difference between scatter plot and bubble chart?
……………………………………………………………………………………
……………………………………………………………………………………
4. What is bubble size in bubble chart?
……………………………………………………………………………….
……………………………………………………………………………….
4.8 BAR CHART
A bar chart is a graphical depiction of numerical data that uses rectangles (or
bars) with equal widths and varied heights. In the field of statistics, bar charts
are one of the methods for handling data.
Constructing a Bar Chart: The x-axis corresponds to the horizontal line, and
the y-axis corresponds to the vertical line. The y-axis represents frequency in
this graph. Write the names of the data items whose values are to be noted along
the x-axis that is horizontal.
Along the horizontal axis, choose the uniform width of bars and the uniform
gap between the bars. Pick an appropriate scale to go along the y-axis that runs
vertically so that you can figure out how high the bars should be based on the
values that are presented. Determine the heights of the bars using the scale you
selected, then draw the bars using that information.
Types of Bar chart: Bar Charts are mainly classified into two types:
Horizontal Bar Charts: Horizontal bar charts are the type of graph that are
used when the data being analysed is to be depicted on paper in the form of
horizontal bars with their respective measures. When using a chart of this type,
the categories of the data are indicated on the y-axis.
Example:
Vertical Bar Charts: A vertical bar chart displays vertical bars on graph (chart)
paper. These rectangular bars in a vertical orientation represent the
measurement of the data. The quantities of the variables that are written along
the x-axis are represented by these rectangular bars.
Example:
We can further divide bar charts into two basic categories:
Grouped Bar Charts: The grouped bar graph is also referred to as the clustered
bar graph (graph). It is valuable for at least two separate types of data. The
horizontal (or vertical) bars in this are categorised according to their position.
If, for instance, the bar chart is used to show three groups, each of which has
numerous variables (such as one group having four data values), then different
colours will be used to indicate each value. When there is a close relationship
between two sets of data, each group's colour coding will be the same.
Example:
Stacked Bar Charts: The composite bar chart is also referred to as the stacked
bar chart. It illustrates how the overall bar chart has been broken down into its
component pieces. We utilise bars of varying colours and clear labelling to
determine which category each item belongs to. As a result, in a chart with
stacked bars, each parameter is represented by a single rectangular bar. Multiple
segments, each of a different colour, are displayed within the same bar. The
various components of each separate label are represented by the various
segments of the bar. It is possible to draw it in either the vertical or horizontal
plane.
Example:
Use cases: Bar charts are typically employed to display quantitative data. The
following is a list of some of the applications of the bar chart-
• In order to clearly illustrate the relationships between various variables,
bar charts are typically utilised. When presented in a pictorial format,
the parameters can be more quickly and easily envisioned by the user.
• Bar charts are the quickest and easiest way to display extensive
amounts of data while saving time. It is utilised for studying trends over
extended amounts of time.
Best Practices:
• Use a common zero valued baseline
• Maintain rectangular forms for your bars
• Consider the ordering of category level and use colour wisely.
Example:
Region Sales
East 6,123
West 2,053
South 4,181
North 3,316
Sales By Region
North 3,316
East
South 4,181
West
West 2,053 South
East 6,123 North
- 2,000 4,000 6,000 8,000
Check your progress 6:
1. When should we use bar chart?

……………………………………………………………………………
……………………………………………………………………………
2. What are the different types of bar chart?
……………………………………………………………………………
……………………………………………………………………………
3. Draw a vertical bar chart.
……………………………………………………………………………
……………………………………………………………………………
4. Draw a horizontal bar chart.
Use the following data to answer the question 3 and 4:
4.9 DISTRIBUTION PLOT
Visually assessing the distribution of sample data, distribution charts do this by

contrasting the actual distribution of the data with the theoretical values expected from
a certain distribution. In addition to more traditional hypothesis tests, distribution plots
can be used to establish whether the data from the sample follows a particular
distribution. The distribution plot is useful for analysing the relationship between the
range of a set of numerical data and its distribution. The values of the data are
represented as points along an axis.
Constructing a Distribution Plot: You must utilise one or two dimensions, together
with one measure, in a distribution plot. You will get a single line visualisation if you
only use one dimension. If you use two dimensions, each value of the outer, second
dimension will produce a separate line.
Use Cases: Distribution of a data set shows the frequency of occurrence of each
possible outcome of a repeatable event observed many times. For instance:
• Height of a population.
• Income distribution in an economy
• Test scores listed by percentile.
Best Practices:
• It is advisable to have equal class widths.
• The class intervals should be mutually exclusive and non-overlapping.
• Open-ended classes at the lower and upper limits (e.g., <10, >100) should be
avoided.
Example
Sales Amount No. of Clients

1-1000 23
1001-2000 19
2001-3000 22
3001-4000 19
4001-5000 27
5001-6000 25
6001-7000 17
7001-8000 26
8001-9000 23
9001-10000 12
Grand Total 213
Sales Amount Distribution

30
25
20
15
10
5
0
0
0
00
00
00
00
00
00
00
00
00
00
10
00
-2
-3
-4
-5
-6
-7
-8
-9
1-
-1
01
01
01
01
01
01
01
01
01
10
20
30
40
50
60
70
80
90
Q.1 What is the distribution plot?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q.2 When should we use distribution plot?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q.3 What do distribution graphs show?

…………………………………………………………………………………………
…………………………………………………………………………………………..
4.10 PAIR PLOT
The pairs plot is an extension of the histogram and the scatter plot, which are both
fundamental figures. The scatter plots on the upper and lower triangles show the
relationship (or lack thereof) between two variables. The histogram along the diagonal
gives us the ability to see the distribution of a single variable, while the scatter plots on
the upper and lower triangles show the relationship (or lack thereof) between two
variables.
A pair plot can be utilised to gain an understanding of the optimum collection of

characteristics to describe a relationship between two variables or to create clusters
that are the most distinct from one another. Additionally, it is helpful to construct
some straightforward classification models by drawing some straightforward lines or
making linear separations in our data set.
Constructing a Pair Plot: If you have m attributes in your dataset, it creates a figure
with m x m subplots. Each attribute's univariate histograms (distributions) make up the
main-diagonal subplots. For a non-diagonal subplot, assume a position (i, j). The
dataset's samples are all plotted using a coordinate system with the characteristics i and
j as the axes. In other words, it projects the dataset on these two attributes only. This is
particularly interesting to visually inspect how the samples are spread with respect to
these two attributes ONLY. The "shape" of the spread can give you valuable insight on
the relation between the two attributes.
Use Cases: A pairs plot allows us to see both distribution of single variables and
relationships between two variables. It helps to identify the most distinct clusters or the
optimum combination of attributes to describe the relationship between two variables.
• By creating some straightforward linear separations or basic lines in our data
set, it also helps to create some straightforward classification models.
• Analysing socio-economic data of a population.
Best Practices:
• Use a different colour palette.
• For each colour level, use a different marker.
Example:
calories protein fat sodium fiber rating
70 4 1 130 10 68.40297
120 3 5 15 2 33.98368
70 4 1 260 9 59.42551
50 4 0 140 14 93.70491
110 2 2 180 1.5 29.50954
110 2 0 125 1 33.17409
130 3 2 210 2 37.03856
90 2 1 200 4 49.12025
90 3 0 210 5 53.31381
120 1 2 220 0 18.04285
110 6 2 290 2 50.765
120 1 3 210 0 19.82357
110 3 2 140 2 40.40021
110 1 1 180 0 22.73645
110 2 0 280 0 41.44502
100 2 0 290 1 45.86332
110 1 0 90 1 35.78279
110 1 1 180 0 22.39651
110 3 3 140 4 40.44877
110 2 0 220 1 46.89564
100 2 1 140 2 36.1762
100 2 0 190 1 44.33086
110 2 1 125 1 32.20758
110 1 0 200 1 31.43597
100 3 0 0 3 58.34514
120 3 2 160 5 40.91705
120 3 0 240 5 41.01549
110 1 1 135 0 28.02577
100 2 0 45 0 35.25244
110 1 1 280 0 23.80404
100 3 1 140 3 52.0769
110 3 0 170 3 53.37101
120 3 3 75 3 45.81172
120 1 2 220 1 21.87129
110 3 1 250 1.5 31.07222
110 1 0 180 0 28.74241
110 2 1 170 1 36.52368
140 3 1 170 2 36.47151
110 2 1 260 0 39.24111
100 4 2 150 2 45.32807
110 2 1 180 0 26.73452
100 4 1 0 0 54.85092
150 4 3 95 3 37.13686
150 4 3 150 3 34.13977
160 3 2 150 3 30.31335
100 2 1 220 2 40.10597
120 2 1 190 0 29.92429
140 3 2 220 3 40.69232
90 3 0 170 3 59.64284
130 3 2 170 1.5 30.45084
120 3 1 200 6 37.84059
100 3 0 320 1 41.50354
50 1 0 0 0 60.75611
50 2 0 0 1 63.00565
100 4 1 135 2 49.51187
100 5 2 0 2.7 50.82839
120 3 1 210 5 39.2592
100 3 2 140 2.5 39.7034
90 2 0 0 2 55.33314
110 1 0 240 0 41.99893
110 2 0 290 0 40.56016
80 2 0 0 3 68.23589
90 3 0 0 4 74.47295
90 3 0 0 3 72.80179
110 2 1 70 1 31.23005
110 6 0 230 1 53.13132
90 2 0 15 3 59.36399
110 2 1 200 0 38.83975
140 3 1 190 4 28.59279
100 3 1 200 3 46.65884
110 2 1 250 0 39.10617
110 1 1 140 0 27.7533
100 3 1 230 3 49.78745
100 3 1 200 3 51.59219
110 2 1 200 1 36.18756
The pair plot can be interpreted as follows:
Along the boxes of the diagonal, the variable names are displayed.
A scatterplot of the correlation between each pairwise combination of factors is shown
in each of the remaining boxes. For instance, a scatterplot of the values for rating and
sodium can be seen in the matrix's box in the top right corner. A scatterplot of values
for rating, that is positively connected with rating, and so forth may be seen in the box
in the upper left corner. We can see the association between each pair of variables in
our dataset from this single visualisation. For instance, calories and rating appear to
have a negative link but protein and fat appear to be unrelated.
1. Why pair plot is used?

……………………………………………………………………………
……………………………………………………………………………
2. How do you read a pairs plot?

……………………………………………………………………………
……………………………………………………………………………
3. What does a pair plot show?

……………………………………………………………………………
……………………………………………………………………………
4.11 LINE GRAPH
A graph that depicts change over time by means of points and lines is known as a line
graph, line chart, or line plot. It is a graph that shows a line connecting a lot of points
or a line that shows how the points relate to one another. The graph is represented by
the line or curve that connects successive data points to show quantitative data between
two variables that are changing. The values of these two variables are compared along
a vertical axis and a horizontal axis in linear graphs.
One of the most significant uses of line graphs is tracking changes over both short and
extended time periods. It is also used to compare the changes that have taken place for
diverse groups over the course of the same time period. It is strongly advised to use a
line graph rather than a bar graph when working with data that only has slight
fluctuations.
As an illustration, the finance department of a company would want to visualise how

its current cash balance has changed over time. If so, they will plot the points over the
horizontal and vertical axis using a line graph. It typically refers to the time period that
the data span.
Following are the types of line graphs:
1. Simple Line Graph: Only a single line is plotted on the graph.
Example:
Time (hr) Distance (km)

0.5 180
1 360
1.5 540
2 720
2.5 900
3 1080
2. Multiple Line Graph: The same set of axes is used to plot several lines. An
excellent way to compare similar objects over the same time period is via a
multiple line graph.
Example:
Time(hr) Rahul dist.(km) Mahesh dist. (km)

0.5 180 200
1 360 400
1.5 540 600
2 720 800
2.5 900 1000
3 1080 1200
3. Compound Line Graph: Whenever one piece of information may be broken
down into two or more distinct pieces of data. A compound line graph is the
name given to this particular kind of line graph. To illustrate each component
that makes up the whole, lines are drawn. The line at the top displays the total,
while the line below displays a portion of the total. The size of each component
can be determined by the distance that separates every pair of lines.
Example:
Time Cars Buses Bikes

1-2pm 37 45 42
2-3pm 44 34 26
3-4pm 23 39 27
4-5pm 29 41 48
Constructing a line graph: When we have finished creating the data tables, we will
then use those tables to build the linear graphs. These graphs are constructed by plotting
a succession of points, which are then connected together with straight lines to offer a
straightforward method for analysing data gathered over a period of time. It provides a
very good visual format of the outcome data that was gathered over the course of time.
Use cases: Tracking changes over both short and long time periods is an important
application of line graphs. Additionally, it is utilised to compare changes over the same
time period for various groups. Anytime there are little changes, using a line graph
rather than a bar graph is always preferable.
• Straight line graphs can be used to explain potential future contract

markets and business prospects.
• To determine the precise strength of medications, a straight-line graph
is employed in both medicine and pharmacy.
• The government uses straight line graphs for both research and
budgetary planning.
• Chemistry and biology both use linear graphs.
• To determine whether our body weight is acceptable for our height,

straight line graphs are employed.
Best Practices
• Only connecting adjacent values along an interval scale should be done with
lines.
• In order to provide correct insights, intervals should be of comparable size.
• Select a baseline that makes sense for your set of data; a zero baseline might
not adequately capture changes in the data.
• Line graphs are only helpful for comparing data sets if the axes have the same
scales.
Example:
Sales 2011 2012 2013 2014 2015 2016 2017 2018

North 12000 13000 12500 14500 17300 16000 18200 22000
South 9000 9000 9000 9500 9500 9500 10000 9000
West 28000 27500 24000 25000 24500 24750 28000 29000
East 18000 8000 7000 22000 13000 14500 16500 17000
Q.1 What is the line graph?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q.2 Where can we use line graph?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q.3 Draw a line chart from the following information:

4.12 PIE CHART
A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable
(e.g. percentage distribution). Such a chart resembles a circle that has been divided into
a number of equal halves. Each segment corresponds to a specific category. The overall
size of the circle is divided among the segments in the same proportion as the category's
share of the whole data set.
A pie chart often depicts the individual components that make up the whole. In order to
bring attention to a particular piece of information that is significant, the illustration
may, on occasion, show a portion of the pie chart that is cut away from the rest of the
diagram. This type of chart is known as an exploded pie chart.
Types of a Pie chart: There are mainly two types of pie charts one is 2D pie chart and
another is 3D pie chart. This can be further classified into flowing categories:
1. Simple Pie Chart: The most fundamental kind of pie chart is referred to simply as
a pie chart and is known as a simple pie chart. It is an illustration that depicts a pie
chart in its most basic form.
Example:
Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2
Owners(%)
Cats Dogs Birds Reptiles Small Mammals
2. Exploded Pie Chart: To create an exploding pie chart, you must first separate the
pie from the chart itself, as opposed to merging the two elements together. It is common
practise to do this in order to draw attention to a certain section or slice of a pie chart.
Example:
Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2
Owners(%)
Cats Dogs Birds Reptiles Small Mammals
3.Pie of Pie: The pie of pie method is a straightforward approach that enables more
categories to be represented on a pie chart without producing an overcrowded and
difficult-to-read graph. A pie chart that is generated from an already existing pie chart
is referred to as a "pie of pie".
Example:
Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2
3. Bar of Pie: A bar of pie is an additional straightforward method for showing

additional categories on a pie chart while minimising space consumption on the pie
chart itself. The expansion that was developed from the already existing pie chart
was a bar graph rather than a pie of pie, despite the fact that both serve comparable
objectives.
Example:
Pets Owners (%)

Cats 38
Dogs 41
Birds 16
Reptiles 3
Small Mammals 2
Constructing a Pie chart: “The total value of the pie is always 100%”
To work out with the percentage for a pie chart, follow the steps given below:
• Categorize the data

• Calculate the total
• Divide the categories
• Convert into percentages
• Finally, calculate the degrees
Therefore, the pie chart formula is given as (Given Data/Total value of Data) × 360°
Use cases: If you want your audience to get a general idea of the part-to-whole
relationship in your data, and comparing the exact sizes of the slices is not as critical to
you, then you should use pie charts. And indicate that a certain portion of the whole is
disproportionately small or large.
• Voting preference by age group
• Market share of cloud providers
Best Practices
• Fewer pie wedges are preferred: The observer may struggle to interpret the chart's
significance if there are too many proportions to compare. Similar to this, keep the
overall number of pie charts on dashboards to a minimum.
Overlay pies on maps: Pie charts can be used to further deconstruct geographic
tendencies in your data and produce an engaging display.
Example
COMPANY MARKET SHARE

Company A 24%
Company B 13%
Company C 8%
Company D 33%
Company E 22%
MARKET SHARE
Company A
22% 24% Company B
Company C
13%
Company D
33%
8% Company E
Q1. What is the pie chart?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q2. What are the different type of pie charts?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q.3 Draw a pie chart from the following information:
4.13 DOUGHNUT CHART
Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. It presents survey questions or data with a
limited number of categories for making comparisons.
In comparison to pie charts, they provide for more condensed and straightforward
representations. In addition, the canter hole can be used to assist in the display of
relevant information. You might use them in segments, where each arc would indicate
a proportional value associated with a different piece of data.
Constructing a Doughnut chart: A doughnut chart, like a pie chart, illustrates the
relationship of individual components to the whole, but unlike a pie chart, it can display
more than one data series at the same time. A ring is added to a doughnut chart for each
data series that is plotted within the chart itself. The beginning of the first data series
can be seen near the middle of the chart. A specific kind of pie chart called a doughnut
chart is used to show the percentages of categorical data. The amount of data that falls
into each category is indicated by the size of that segment of the donut. The creation of
a donut chart involves the use of a string field and a number, count of features, or
rate/ratio field.
There are two types of doughnut chart one is normal doughnut chart and another is
exploded doughnut chart. Exploding doughnut charts, much like exploded pie charts,
highlight the contribution of each value to a total while emphasising individual values.
However, unlike exploded pie charts, exploded doughnut charts can include more than
one data series.
Use cases: Doughnut charts are good to use when comparing sets of data. By using the
size of each component to reflect the percentage of each category, they are used to
display the proportions of categorical data. A string field and a count of features,
number, rate/ratio, or field are used to make a doughnut chart.
• Android OS market share
• Monthly sales by channel
Best Practices
• Stick to five slices or less because thinner and long-tail slices become unreadable
and uncomparable.
• Use this chart to display one point in time with the help of the filter legend.
• Well-formatted and informative labels are essential because the information
conveyed by circular shapes alone is not enough and is imprecise.
• It is a good practice to sort the slices to make it more clear for comparison.
Example:
Project Status
Completed 30%
Work in progress 25%
Incomplete 45%
Q1. What is the doughnut chart?

…………………………………………………………………………………
…………………………………………………………………………………
Q.2 What distinguishes a doughnut chart from a pie chart?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q.3 Draw a doughnut chart from the following information:
Product 2020 2021

x 40 50
y 30 60
z 60 70
4.14 AREA CHART
An area chart, a hybrid of a line and bar chart, shows the relationship between the
numerical values of one or more groups and the development of a second variable, most
often the passage of time. The inclusion of shade between the lines and a baseline,
similar to a bar chart's baseline, distinguishes a line chart from an area chart. An area
chart has this as its defining feature.
Types of Area Chart:
Overlapping area chart: An overlapping area chart results if we wish to look at how
the values of the various groups compare to one another. The conventional line chart
serves as the foundation for an overlapping area chart. One point is plotted for each
group at each of the horizontal values, and the height of the point indicates the group's
value on the vertical axis variable.
All of the points for a group are connected from left to right by a line. A zero baseline
is supplemented by shading that is added by the area chart between each line. Because
the shading for different groups will typically overlap to some degree, the shading itself
incorporates a degree of transparency to ensure that the lines delineating each group
may be seen clearly at all times.
The shading brings attention to group that has the highest value by highlighting group's
pure hue. Take care that one series is not always higher than the other, as this could
cause the plot to become confused with the stacked area chart, which is the other form
of area chart. In circumstances like these, the most prudent course of action will consist
of sticking to the traditional line chart.
Months (2016) Web Android IOS

June 0 -
July 70k -
Aug 55k 80k
Sep 60k 165k 80k
Oct 70k 165k 295k
Nov 80k 200k 290k
Dec 40k 125k 155k
Stacked area chart: The stacked area chart is what is often meant to be conveyed when
the phrase "area chart" is used in general conversation. When creating the chart of
overlapping areas, each line was tinted based on its vertical value all the way down to
a shared baseline. Plotting lines one at a time creates the stacked area chart, which uses
the height of the most recent group of lines as a moving baseline. Therefore, the total
that is obtained by adding up all of the groups' values will correspond to the height of
the line that is entirely piled on top.
When you need to keep track of both the total value and the breakdown of that total by
groups, you should make use of a stacked area chart. This type of chart will allow you
to do both at the same time. By contrasting the heights of the individual curve segments,
we are able to obtain a sense of how the contributions made by the various subgroups
stack up against one another and the overall sum.
Example:
A B C D
Printers Projectors White Boards
2017 32 45 28
2018 47 43 40
2019 40 39 43
2020 37 40 41
2021 39 49 39
Stacked Area chart
150
100
50
0
1 2 3 4 5
Printers Projectors White Boards
Use Cases: In most cases, many lines are drawn on an area chart in order to create a
comparison between different groups (also known as series) or to illustrate how a whole
is broken down into its component pieces. This results in two distinct forms of area
charts, one for each possible application of the chart.
• Magnitude of a single quantitative variable's trend - An increase in a public
company's revenue reserves, programme enrollment from a qualified subgroup by
year, and trends in mortality rates over time by primary causes of death are just a
few examples.
• Comparison of the contributions made by different category members (or
group)- the variation in staff sizes among departments, or support tickets opened
for various problems.
• Birth and death rates over time for a region, the magnitudes of cost vs. revenue for
a business, the magnitudes of export vs. import over time for a country
Best Practices:
• To appropriately portray the proportionate difference in the data, start the y-axis at
0.
• To boost readability, choose translucent, contrasting colours.
• Keep highly variable data at the top of the chart and low variable data at the bottom
during stacking.
• If you need to show how each value over time contributes to a total, use a stacked
area chart.
• However, it is recommended to utilise 100% stacked area charts if you need to
demonstrate a part to whole relationship in a situation where the cumulative total is
unimportant.
Example:
The above Stacked area chart is belonging to tele-service offered by various television
based applications. In this data, there are different type of subscribers who are using
the services provided by tele-applications in different months.
Q1. What is area chart?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q2. What are types of area charts?

…………………………………………………………………………………………
…………………………………………………………………………………………
Q3. Draw an area chart from the following information:
Product A Product B Product C

2017 2000 600 75
2018 2200 450 85
2019 2100 500 125
2020 3000 750 123
4.15 SUMMARY
This Unit introduces you to some of the basic charts that are used in data science. The
Unit defines the characteristics of Histograms, which are very popular in univariate
frequency analysis of quantitative variables. It then discusses the importance and
various terms used in the box plots, which are very useful while comparing quantitative
variable over some qualitative characteristic. Scatter plots are used to visualise the
relationships between two quantitative variables. The Unit also discusses about the heat
map, which are excellent visual tools for comparing values. In case three variables are
to be compared then you may use bubble charts. The unit also highlights the importance
of bar charts, distribution plots, pair plots and line graphs. In addition, it highlights the
importance of Pie chart, doughnut charts and area charts for visualising different kinds
of data. In addition, there are many different kinds of charts that are used in different
analytical tool. You may read about them from reafferences.
4.16 ANSWERS

i. A bar graph is a pictorial representation using vertical and
horizontal bars in a graph. The length of bars are proportional
to the measure of data. It is also called bar chart. A histogram
is also a pictorial representation of data using rectangular bars,
that are adjacent to each other. It is used to represent grouped
frequency distribution with continuous classes.
ii.
iii. It is used to summarise continuous or discrete data that is

measured on an interval scale. It is frequently used to
conveniently depict the key characteristics of the data
distribution.
iv. A histogram is a graphic depiction of data points arranged into

user-specified ranges. The histogram, which resembles a bar
graph in appearance, reduces a data series into an intuitive
visual by collecting numerous data points and organising them
into logical ranges or bins.

1. Follow these instructions to interpret a boxplot. :
Step 1: Evaluate the major characteristics. Look at the distribution's
centre and spread. Examine the potential impact of the sample size on
the boxplot's visual appeal.
Step 2: Search for signs of anomalous or out-of-the-ordinary data.
Skewed data suggest that data may not be normal. Other situations in
your data may be indicated by outliers.
Step 3: Evaluate and compare groups. Evaluate and compare the centre
and spread of groups if your boxplot contains them.
2. A boxplot is a common method of showing data distribution based on a five-number
summary ("minimum," first quartile ("Q1"), median ("Q3"), and "maximum"). You
can learn more about your outliers' values from it.
3. Box plots are generally used for 3 purposes -
• Finding outliers in the data
• Finding the dispersion of data from a median
• Finding the range of data
4. The box plot distribution will reveal the degree to which the data are clustered, how
skewed they are, and also how symmetrical they are.
• Positively Skewed: The box plot is positively skewed if the distance from the me-
dian to the maximum is greater than the distance from the median to the mini-
mum.
• Negatively skewed: Box plots are said to be negatively skewed if the distance
from the median to the minimum is higher than the distance from the median to
the maximum.
• Symmetric: When the median of a box plot is equally spaced from both the maxi-
mum and minimum values, the box plot is said to be symmetric.

1.
• The most practical method for displaying bivariate (2-variable) data is a scatter plot.
• A scatter plot can show the direction of a relationship between two variables when
there is an association or interaction between them (positive or negative).
• The linearity or nonlinearity of an association or relationship can be ascertained
using a scatter plot.
• A scatter plot reveals anomalies, questionably measured data, or incorrectly plotted
data visually.
2.
• The Title- A brief description of what is in your graph is provided in the title.
• The Legend- The meaning of each point is explained in the legend.
• The Source- The source explains how you obtained the data for your graph.
• Y-Axis.
• The Data.
• X-Axis.
3. A scatter plot is composed of a horizontal axis containing the measured values of one
variable (independent variable) and a vertical axis representing the measurements of the
other variable (dependent variable). The purpose of the scatter plot is to display what
happens to one variable when another variable is changed.
4.
• Positive Correlation.
• Negative Correlation.
• No Correlation (None)

1. Three main types of input exist to plot a heatmap: wide format, correlation
matrix, and long format.
Wide format: The wide format (or the untidy format) is a matrix where
each row is an individual, and each column is an observation. In this case,
the heatmap makes a visual representation of the matrix: each square of the
heatmap represents a cell. The color of the cell changes according to its
value.
Correlation matrix: Suppose you measured several variables for n
individuals. A common task is to check if some variables are correlated.
You can easily calculate the correlation between each pair of variables, and
plot this as a heatmap. This lets you discover which variable is related to
the other.
Long format: In the tidy or long format, each line represents an
observation. You have 3 columns: individual, variable name, and value (x,
y and z). You can plot a heatmap from this kind of data.
2. A heat map is a two-dimensional visualisation of data in which colours

stand in for values. A straightforward heat map offers a quick visual
representation of the data. The user can comprehend complex data sets with
the help of more intricate heat maps.
3. Using one variable on each axis, heatmaps are used to display relationships
between two variables. You can determine if there are any trends in the
values for one or both variables by monitoring how cell colours vary across
each axis.
1. A bubble chart is a variant of a scatter chart in which the data points

are swapped out for bubbles, with the size of the bubbles serving as a
representation of an additional dimension of the data. A bubble chart
horizontal and vertical axes are both value axes.
2. To identify whether at least three numerical variables are connected or

exhibit a pattern, bubble charts are utilised. They could be applied in
specific situations to compare categorical data or demonstrate trends
across time.
3. In scatter charts, one numeric field is displayed on the x-axis and

another on the y-axis, making it simple to see the correlation between
the two values for each item in the chart. A third numerical field in a
bubble chart regulates the size of the data points.
4. Any bubbles between 0 and 5 pts on this scale will appear at 5 pt, and
all the bubbles on your chart will be between 5 and 20 pts. To construct
a chart that displays many dimensions, combine bubble size with
colour by value.

Answer 1:
In the process of statistics development, bar charts are typically employed to
display the data. The following is a list of some of the applications of the bar
chart:
To clearly illustrate the relationships between various variables, bar charts are
typically utilised. When presented in a pictorial format, the parameters can be
more quickly and easily envisioned by the user.
Bar charts are the quickest and easiest way to display extensive amounts of data
while also saving time.
The method of data representation that is most commonly utilised. As a result,
it is utilised in a variety of different sectors.
When studying trends over extended amounts of time, it is helpful to have this
information.
Answer 2:
Charts are primarily divided into two categories:
Horizontal Bar Charts:
Vertical Bar Charts
We can further divide into two types:
Grouped Bar Charts

Stacked Bar Charts
Answer 3:
Answer4:

1. For visually assessing the distribution of sample data, you can draw
distribution charts. Using these charts, you can contrast the actual
distribution of the data with the theoretical values expected from a
certain distribution.
2. The distribution plot is useful for analysing the relationship
between the range of a set of numerical data and its distribution.
You are only allowed to use one or two dimensions and one
measure when creating a distribution graphic.
3. These graphs show - how the data is distributed; how the data is
composed; how values relate to one another.

1. We can visualise pairwise relationships between variables in a dataset
using pair plots. By condensing a lot of data into a single figure, this gives the
data a pleasant visual representation and aids in our understanding of the data.
2. A scatter plot of a and b, one of a and c, and finally one of a and d are
shown in the first line. b and a (symmetric to the first row) are in the second
row, followed by b and c, b and d, and so on. In pairs, no sums, mean squares,
or other calculations are performed. That is in your data frame if you discover
it in your pairings plot.
3. Pair plots are used to determine the most distinct clusters or the best
combination of features to describe a connection between two variables. By
creating some straightforward linear separations or basic lines in our data set,
it also helps to create some straightforward classification models.

1. A graph that depicts change over time by means of points and lines is known
as a line graph, line chart, or line plot. It is a chart that depicts a line uniting
numerous points or a line that illustrates the relation between the points. The
line or curve used to depict quantitative data between two changing variables
in the graph combines a sequence of succeeding data points to create a
representation of the graph.
2. Tracking changes over a short as well as a long period of time is one of the
most important applications of line graphs. Additionally, it is utilised to
compare the modifications that have occurred for various groups throughout
the course of the same period of time. When dealing with data that has only
minor variations, using a line graph rather than a bar graph is strongly
recommended. For instance, the finance team at a corporation may wish to
chart the evolution of the cash balance that the company now possesses
throughout the course of time.
3.
1. A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable.
(e.g. percentage distribution).
2. There are mainly two types of pie charts one is 2D pie chart and another is 3D pie
chart. This can be further classified into flowing categories:
1. Simple Pie Chart
2. Exploded Pie Chart
3. Pie of Pie
4. Bar of Pie
3.
1. Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. In comparison to pie charts, they provide for
more condensed and straightforward representations.
2. A donut chart is similar to a pie chart, with the exception
that the centre is cut off. When you want to display
particular dimensions, you use arc segments rather than
slices. Just like a pie chart, this form of chart can assist you
in comparing certain categories or dimensions to the
greater overall; nevertheless, it has a few advantages over
its pie chart counterpart.
3.
Product Sales
60
30
40
x y z
1. An area chart shows how the numerical values of one or more groups change in
proportion to the development of a second variable, most frequently the passage of time.
It combines the features of a line chart and a bar chart. A line chart can be differentiated
from an area chart by the addition of shading between the lines and a baseline, just like
in a bar chart. This is the defining characteristic of an area chart.
2. Overlapping area chart and Stacked area chart
3.
Area Chart
3500
3000
2500
2000
1500
1000
500
0
2017 2018 2019 2020
Product A Product B Product C
4.17 REFERENCES
• Useful Ways to Visualize Your Data (With Examples). Pdf

• Data Visualization Cheat Sheet. Pdf
• Which chart or graph is right for you? Pdf
• https://www.excel-easy.com/examples/frequency-distribution.html
• https://analyticswithlohr.com/2020/09/15/556/
• https://www.fusioncharts.com/line-charts
• https://evolytics.com/blog/tableau-201-make-stacked-area-chart/
• https://chartio.com/learn/charts/area-chart-complete-guide/
• https://www.lifewire.com/exploding-pie-charts-in-excel-3123549
• https://www.formpl.us/resources/graph-chart/line/
• https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch9/bargraph-
diagrammeabarres/5214818-eng.htm
• https://sixsigmamania.com/?p=475
• https://study.com/academy/lesson/measures-of-dispersion-variability-and-
skewness.html

Block 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Block 1

Uploaded by

Copyright:

Available Formats

Basics of Data Science

UNIT 1 INTRODUCTION TO DATA SCIENCE

As stated earlier data science is a multi-disciplinary science, as shown in Figure 1.

Figure 1: Data Science

1.3 TYPES OF DATA

Customer (custID, custName, custPhone, custAddress, custCategory, custPAN, custAadhar)

Figure 2: A sample schema of structured data

Categorical or qualitative Data:

Measurement scale of data

• Every representation of the measure should be unique, this is referred

Based on these characteristics four basic measurement scales are defined.

Measurement Characteristics Example

Figure 4: Measurement Scales of Data

Figure 5: Population and Sample

Term Used for Example

Next, we discuss different kind of analysis that can be performed on data.

Check Your Progress 1:

2. Differentiate between structured, semi-structured, unstructured and

1.4 BASIC METHODS OF DATA ANALYSIS

1.4.1 Descriptive Analysis

Enrolment Number Gender Height

Descriptive of Categorical Data:

Figure 7: The Pie Chart

Descriptive of Quantitative Data:

Example 2: Find the mean and median of the following data:

Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

The mean can be computed as:

Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

So, the median is:

Data Set (n observations) 1 2 3 4 5 6 7 8 9 10 11

Relationship Comments about A possible Graph of Data

Mean Mode Median

Mean Median Mode

Mode Median Mean

Figure 8: Mean and Median for possible data

The concept of data distribution is explained in the next Unit.

Spread of Quantitative data: Another important aspect of defining the quantitative

Measure Description Example (Please

Check Your Progress 2

2. Age is a quantitative data; how will you describe its data?

3. How can you find that given data is left skewed?

4. What is IQR? Can it be used to find outliers?

1.4.2 Exploratory Analysis

Exploratory data analysis was suggested by John Turkey of Princeton University in

Figure 10: Exploratory Data Analysis

1.4.3 Inferential Analysis

Figure 12: The Output of two sample t-test (two tail)

1.4.4 Predictive Analysis

1.5 COMMON MISCONCEPTIONS OF DATA

Correlation is not Causation: Correlation analysis establishes relationship between

Simpsons Paradox: Simson paradox is an interesting situation, which

University Student Passed Passed % Student Failed Failed % Total

Applications using Similarity analysis

Applications related to Web Searching

Applications related to Healthcare System

Applications related to Transport sector

Data Science Project Requirements Analysis Phase

Data collection and Preparation Phase

Descriptive data analysis

Data Modelling and Model Testing

Model deployment and Refinement

Data Modelling Descriptive