Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 41

CS3352 FOUNDATIONS OF DATA SCIENCE

II YEAR / III SEMESTER B.Tech.- INFORMATION TECHNOLOGY

UNIT I
INTRODUCTION

COMPILED BY,

Mr.M.KARTHIKEYAN, M.E., HoD / IT

VERIFIED BY

HOD PRINCIPAL CEO/CORRESPONDENT

DEPARTMENT OF INFORMATION TECHNOLOGY

SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE – 637 205.


CREDIT POINT
ANNA UNIVERSITY, CHENNAI
AFFILIATED INSTITUTIONS
R-2021
B.TECH INFORMATION TECHNOLOGY

PERIODS
TOTAL
S. COURSE CATE PER
COURSE TITLE CONTACT CREDITS
NO. CODE GORY WEEK
PERIODS
L T P
THEORY
1. MA3354 Discrete Mathematics BSC 3 1 0 4 4
Digital Principles and
2. CS3351 ESC 3 0 2 5 4
Computer Organization
Foundations of Data
3. CS3352 PCC 3 0 0 3 3
Science
Data Structures and 3
4. CD3291 PCC 3 0 0 3
Algorithms
Object Oriented
5. CS3391 PCC 3 0 0 3 3
Programming
PRACTICALS
Data Structures and
6. CD3281 PCC 0 0 4 4 2
Algorithms Laboratory
Object Oriented
7. CS3381 PCC 0 0 3 3 1.5
Programming Laboratory
8. CS3361 Data Science Laboratory PCC 0 0 4 4 2
9. GE3361 Professional Development EEC 0 0 2 2 1
TOTAL 15 1 15 31 23.5
CS3352 FOUNDATIONS OF DATA SCIENCE L T P
C3
00 3
COURSE OBJECTIVES:
 To understand the data science fundamentals and process.
 To learn to describe the data for the data science process.
 To learn to describe the relationship between data.
 To utilize the Python libraries for Data Wrangling.
 To present and interpret data using visualization libraries in Python

UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview –
Defining research goals – Retrieving data – Data preparation - Exploratory Data
analysis – build the model– presenting findings and building applications - Data Mining -
Data Warehousing – Basic Statistical descriptions of Data

UNIT II DESCRIBING DATA


Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing
Data with Averages - Describing Variability - Normal Distributions and Standard (z) Scores

UNIT III DESCRIBING RELATIONSHIPS


Correlation –Scatter plots –correlation coefficient for quantitative data –computational
formula for correlation coefficient – Regression –regression line –least squares
regression line – Standard error of estimate – interpretation of r2 –multiple regression
equations –regression towards the mean

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING


Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks,
boolean logic – fancy indexing – structured arrays – Data manipulation with Pandas –
data indexing and selection – operating on data – missing data – Hierarchical
indexing – combining datasets – aggregation and grouping – pivot tables

UNIT V DATA VISUALIZATION


Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and
contour plots – Histograms – legends – colors – subplots – text and annotation –
customization – three dimensional plotting - Geographic Data with Basemap -
Visualization with Seaborn.
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: Define the data science process
CO2: Understand different types of data description for data science
process CO3: Gain knowledge on relationships between data
CO4: Use the Python Libraries for Data Wrangling
CO5: Apply visualization Libraries in Python to interpret and explore
data TOTAL:45 PERIODS

TEXTBOOKS:

REFERENCE:
1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in
Python”, Green Tea Press,2014.
SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE- 637205

DEPARTMENT OF INFORMATION TECHNOLOGY

LECTURE PLAN

Subject Code : CS3352

Subject Name : FOUNDATIONS OF DATA SCIENCE

Name of the faculty : M.KARTHIKEYAN

Designation: : HOD / IT

Course : III Semester B.Tech. – Information Technology

Academic Year : 2022– 2023 (ODD SEMESTER)

RECOMMENDED TEXT BOOKS / REFERENCE BOOKS

Sl Title of the Book Author Reference


No
1 Introducing Data Science, David Cielen, Arno D. B.
Manning Publications, 2016. (Unit I) Meysman, and T1
Mohamed Ali
2 Statistics”, Eleventh Edition, Wiley Robert S. Witte and
John S. Witte,
Publications, 2017. (Units II and III) T2

3 Python Data Science Handbook, O’Reilly, Jake VanderPlas


2016. (Units IV and V) T3

1
NO. OF
S.No. TOPIC REFERENCES TEACHING
HOURS

UNIT I INTRODUCTION

Data Science: Benefits and


1 T1-CH1 Black Board 1
uses

2 facets of data - T1-CH1 Black Board 1

Data Science Process:


3 T1-CH1 Black Board 1
Overview –
Defining research goals –
4 Retrieving data – Data T1-CH1 Black Board 1
preparation -

5 Exploratory Data analysis – T1-CH1 Black Board 1

6 build the model T1-CH1 Black Board 1

presenting findings and


7 T1-CH1 Black Board 1
building applications -
Data Mining - Data
8 T1-CH1 Black Board 1
Warehousing –
Basic Statistical
9 T1-CH1 PPT 1
descriptions of Data
UNIT II DESCRIBING DATA

10 Types of Data - T2-CH1 Black Board 1

11 Types of Variables T2-CH1 Black Board 1

Describing Data with Tables


12 T2-CH1 Black Board 1
and Graphs
Describing Data with
13 T2-CH1 Black Board 1
Averages

14 Describing Variability - T2-CH1 Black Board 1

15 Normal Distributions T2-CH1 Black Board 1

16 Standard (z) Scores T2-CH1 Black Board 1

Standard (z) Scores Cont..


17 T2-CH1 Black Board 1

2
UNIT III DESCRIBING RELATIONSHIPS
Black Board
18 Correlation T2-CH2 1

–Scatter plots –correlation Black Board


coefficient for quantitative T2-CH2 1
19
data
–computational formula for Black Board
20 T2-CH2 1
correlation coefficient
– Regression –regression Black Board
21 T2-CH2 1
line –
least squares regression Black Board
22 T2-CH2 1
line –
Standard error of estimate Black Board
– T2-CH2 1
23

Black Board
24 interpretation of r2 – T2-CH2 1
Black Board
25 multiple regression equations – T2-CH2 1

26 regression towards the mean T2-CH2 PPT 1

27 regression towards the mean Cont.. T2-CH2 PPT 1

UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING

28 Basics of Numpy arrays – T3-CH1 Black Board 1

aggregations –computations on arrays


29 T3-CH1 Black Board 1

30 comparisons, masks, boolean logic – T3-CH1 PPT 1

31 fancy indexing – structured arrays – T3-CH1 Black Board 1

32 Data manipulation with Pandas T3-CH1 Black Board 1


1
33 – data indexing and selection – T3-CH1 Black Board
1
34 operating on data – T3-CH1 Black Board

35 missing data – Hierarchical indexing – T3-CH1 Black Board 1

3
36 combining datasets – T3-CH1 Black Board 1
aggregation and grouping – pivot
37 tables T3-CH1 Black Board 1

UNIT V DATA VISUALIZATION

38 T3-CH2 PPT 1
Importing Matplotlib –
Line plots – Scatter plots
39 T3-CH2 PPT 1

40 visualizing errors T3-CH2 PPT 1

41 density and contour plots T3-CH2 PPT 1

Histograms – legends –
42 T3-CH2 DEMO 1
colors – subplots

43 text and annotation – T3-CH2 DEMO 1


44 customization – three
dimensional plotting -. T3-CH2 PPT 1

45 Geographic Data with


Base map - Visualization T3-CH2 PPT 1
with Seaborn
Total 45
Revision 5
Total Hours 50

UNIT - I
4
INTRODUCTION

 Data Science: Benefits and uses


 facets of data
 Data Science Process: Overview – Defining research goals
 Retrieving data – Data preparation
 Exploratory Data analysis
 build the model– presenting findings and building applications
 Data Mining - Data Warehousing
 Basic Statistical descriptions of Data

LIST OF IMPORTANT QUESTIONS

5
UNIT - I

INTRODUCTION

PART – A
1. What is Data Science?
2. Differentiate between Data Analytics and Data Science
3. What are the challenges in Data Science?
4. List the facets of data.
5. What are the steps in data science process ?
6. Explain unstructured data and give example.
7. What do you understand about linear regression?
8. What are outliers?
9. What do you understand by logistic regression?
10. What is a confusion matrix?
11. What do you understand about the true-positive rate and false-positive rate?
12. How is Data Science different from traditional application programming?
13. Explain the differences between supervised and unsupervised learning.
14.What is the difference between the long format data and wide format data?
15. Mention some techniques used for sampling. What is the main advantage of
sampling?

PART – B

1.Explain various steps in the Data Science process (OR) Data Science Lifecycle.
2.Explain the facets of data in detail
3.What are the steps in Data Cleansing, explain with example.
4.Explain the common error occur in data cleansing process (7)
5.Explain in detail about Data warehousing and Data Mining.
6.Steps Involved in Data Science Modelling.

6
LIST OF IMPORTANT QUESTIONS
UNIT - I

INTRODUCTION

PART – A
1. What is Data Science?
Data Science is a field of computer science that explicitly deals with turning data into information
and extracting meaningful insights out of it. The reason why Data Science is so popular is that
the kind of insights it allows us to draw from the available data has led to some major innovations
in several products and companies. Using these insights, we are able to determine the taste of a
particular customer, the likelihood of a product succeeding in a particular market, etc.

2. Differentiate between Data Analytics and Data Science


Data Analytics Data Science
Data Analytics is a subset of Data Science. Data Science is a broad technology that
includes various subsets such as Data
Analytics, Data Mining, Data Visualization,
etc.
The goal of data analytics is to illustrate the The goal of data science is to discover
precise details of retrieved insights. meaningful insights from massive datasets
and derive the best possible solutions to
resolve business issues.
Requires just basic programming Requires knowledge in advanced
languages. programming languages.
It focuses on just finding the solutions. Data Science not only focuses on finding
the solutions but also predicts the future
with past patterns or insights.
A data analyst’s job is to analyse data in A data scientist’s job is to provide insightful
order to make decisions. data visualizations from raw data that are
easily understandable.

3. What are the challenges in Data Science?


 Multiple Data Sources. ...
7
 Data Security. ...
 Lack of Clarity on Business Problem. ...
 Undefined KPIs and Metrics. ...
 Difficulty in Finding Skilled Data Scientists. ...
 Getting Value Out of Data Science

4. List the facets of data.


There are many facets of data science, including:
 Identifying the structure of data.
 Cleaning, filtering, reorganizing, augmenting, and aggregating data.
 Visualizing data.
 Data analysis, statistics, and modeling.
 Machine Learning.
 Assembling data processing pipelines to link these steps.

5. What are the steps in data science process ?


The Data Science Process
Step 1: Frame the problem. ...
Step 2: Collect the raw data needed for your problem. ...
Step 3: Process the data for analysis. ...
Step 4: Explore the data. ...
Step 5: Perform in-depth analysis. ...
Step 6: Communicate results of the analysis.

6. Explain unstructured data and give example.


Unstructured data just happens to be in greater abundance than structured data is.
Examples of unstructured data are: Rich media. Media and entertainment data, surveillance
data, geo-spatial data, audio, weather data. Document collections.

7. What do you understand about linear regression?


Linear regression helps in understanding the linear relationship between the dependent
and the independent variables. Linear regression is a supervised learning algorithm, which helps
in finding the linear relationship between two variables. One is the predictor or the independent
variable and the other is the response or the dependent variable. In Linear Regression, we try to
understand how the dependent variable changes w.r.t the independent variable. If there is only

8
one independent variable, then it is called simple linear regression, and if there is more than one
independent variable then it is known as multiple linear regression. 8.

8. What are outliers?

An outlier is an observation that lies an abnormal distance from other values in a random

sample from a population

9. What do you understand by logistic regression?


Logistic regression is a classification algorithm that can be used when the dependent variable is
binary. Let’s take an example. Here, we are trying to determine whether it will rain or not on the
basis of temperature and humidity.

Temperature and humidity are the independent variables, and rain would be our dependent
variable. So, the logistic regression algorithm actually produces an S shape curve.

Now, let us look at another scenario: Let’s suppose that x-axis represents the runs scored by
Virat Kohli and the y-axis represents the probability of the team India winning the match. From
this graph, we can say that if Virat Kohli scores more than 50 runs, then there is a greater
probability for team India to win the match. Similarly, if he scores less than 50 runs then the
probability of team India winning the match is less than 50 percent.

So, basically in logistic regression, the Y value lies within the range of 0 and 1. This is how
logistic regression works.

9
10. What is a confusion matrix?

The confusion matrix is a table that is used to estimate the performance of a model. It tabulates
the actual values and the predicted values in a 2×2 matrix.

True Positive (d): This denotes all of those records where the actual values are true and the
predicted values are also true. So, these denote all of the true positives. False Negative (c): This
denotes all of those records where the actual values are true, but the predicted values are false.
False Positive (b): In this, the actual values are false, but the predicted values are true. True
Negative (a): Here, the actual values are false and the predicted values are also false. So, if you
want to get the correct values, then correct values would basically represent all of the true
positives and the true negatives. This is how the confusion matrix works.

11. What do you understand about the true-positive rate and false-positive rate?
True positive rate: In Machine Learning, true-positive rates, which are also referred to as
sensitivity or recall, are used to measure the percentage of actual positives which are correctly
identified. Formula: True Positive Rate = True Positives/Positives False positive rate: False
positive rate is basically the probability of falsely rejecting the null hypothesis for a particular test.
The false-positive rate is calculated as the ratio between the number of negative events wrongly
categorized as positive (false positive) upon the total number of actual events. Formula: False-
Positive Rate = False-Positives/Negatives.

12. How is Data Science different from traditional application programming?


Data Science takes a fundamentally different approach in building systems that provide value
than traditional application development.

In traditional programming paradigms, we used to analyze the input, figure out the expected
output, and write code, which contains rules and statements needed to transform the provided
10
input into the expected output. As we can imagine, these rules were not easy to write, especially,
for data that even computers had a hard time understanding, e.g., images, videos, etc.

Data Science shifts this process a little bit. In it, we need access to large volumes of data that
contain the necessary inputs and their mappings to the expected outputs. Then, we use Data
Science algorithms, which use mathematical analysis to generate rules to map the given inputs
to outputs.

This process of rule generation is called training. After training, we use some data that was set
aside before the training phase to test and check the system’s accuracy. The generated rules are
a kind of a black box, and we cannot understand how the inputs are being transformed into
outputs.

However, If the accuracy is good enough, then we can use the system (also called a model).

As described above, in traditional programming, we had to write the rules to map the input to the
output, but in Data Science, the rules are automatically generated or learned from the given data.
This helped solve some really difficult challenges that were being faced by several companies.

13. Explain the differences between supervised and unsupervised learning.

Supervised and unsupervised learning are two types of Machine Learning techniques. They both
allow us to build models. However, they are used for solving different kinds of problems.
Supervised Learning Unsupervised Learning
Works on the data that contains both Works on the data that contains no
inputs and the expected output, i.e., the mappings from input to output, i.e., the
labeled data unlabeled data
Used to create models that can be Used to extract meaningful information out
employed to predict or classify things of large volumes of data
Commonly used supervised learning Commonly used unsupervised learning
algorithms: Linear regression, decision algorithms: K-means clustering, Apriori
tree, etc. algorithm, etc.

11
14. What is the difference between the long format data and wide format data?

Long Format Data Wide Format Data


A long format data has a column for Whereas, Wide data has a column for each
possible variable types and a column for variable.
the values of those variables.
Each row in the long format represents one The repeated responses of a subject will be
time point per subject. As a result, each in a single row, with each response in its
topic will contain many rows of data. own column, in the wide format.
This data format is most typically used in R This data format is most widely used in
analysis and for writing to log files at the data manipulations, stats programmes for
end of each experiment. repeated measures ANOVAs and is seldom
used in R analysis.
A long format contains values that do A wide format contains values that do not
repeat in the first column. repeat in the first column.
Use df.melt() to convert wide form to long use df.pivot().reset_index() to convert long
form form into wide form

15. Mention some techniques used for sampling. What is the main advantage of

sampling?

Sampling is defined as the process of selecting a sample from a group of people or from any
particular kind for research purposes. It is one of the most important factors which decides the
accuracy of a research/survey result.

Mainly, there are two types of sampling techniques:

Probability sampling: It involves random selection which makes every element get a chance to
be selected. Probability sampling has various subtypes in it, as mentioned below:

 Simple Random Sampling


 Stratified sampling
 Systematic sampling
 Cluster Sampling
 Multi-stage Sampling

12
 Non- Probability Sampling: Non-probability sampling follows non-random selection which
means the selection is done based on your ease or any other required criteria. This helps to
collect the data easily. The following are various types of sampling in it:
o Convenience Sampling
o Purposive Sampling
o Quota Sampling
o Referral /Snowball Sampling

16 What is bias in Data Science?

Bias is a type of error that occurs in a Data Science model because of using an algorithm that is
not strong enough to capture the underlying patterns or trends that exist in the data. In other
words, this error occurs when the data is too complicated for the algorithm to understand, so it
ends up building a model that makes simple assumptions. This leads to lower accuracy because
of underfitting. Algorithms that can lead to high bias are linear regression, logistic regression,
etc.==

17. What is dimensionality reduction?

Dimensionality reduction is the process of converting a dataset with a high number of dimensions
(fields) to a dataset with a lower number of dimensions. This is done by dropping some fields or
columns from the dataset. However, this is not done haphazardly. In this process, the dimensions
or fields are dropped only after making sure that the remaining information will still be enough to
succinctly describe similar information.

18. Why is Python used for Data Cleaning in DS?

Data Scientists have to clean and transform the huge data sets in a form that they can work with.
It is important to deal with the redundant data for better results by removing nonsensical outliers,
malformed records, missing values, inconsistent formatting, etc.

Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively used
for Data cleaning and analysis. These libraries are used to load and clean the data and do

13
effective analysis. For example, a CSV file named “Student” has information about the students
of an institute like their names, standard, address, phone number, grades, marks, etc.

19. Why is R used in Data Visualization?

R provides the best ecosystem for data analysis and visualization with more than 12,000
packages in Open-source repositories. It has huge community support, which means you can
easily find the solution to your problems on various platforms like StackOverflow.

It has better data management and supports distributed computing by splitting the operations
between multiple tasks and nodes, which eventually decreases the complexity and execution
time of large datasets.

20. What are the popular libraries used in Data Science?

Below are the popular libraries used for data extraction, cleaning, visualization, and deploying DS
models:

 TensorFlow: Supports parallel computing with impeccable library management backed by


Google.
 SciPy: Mainly used for solving differential equations, multidimensional programming, data
manipulation, and visualization through graphs and charts.
 Pandas: Used to implement the ETL(Extracting, Transforming, and Loading the datasets)
capabilities in business applications.
 Matplotlib: Being free and open-source, it can be used as a replacement for MATLAB, which
results in better performance and low memory consumption.
 PyTorch: Best for projects which involve Machine Learning algorithms and Deep Neural
Networks.

21. What is variance in Data Science?

Variance is a type of error that occurs in a Data Science model when the model ends up being
too complex and learns features from data, along with the noise that exists in it. This kind of error
can occur if the algorithm used to train the model has high complexity, even though the data and
the underlying patterns and trends are quite easy to discover. This makes the model a very
14
sensitive one that performs well on the training dataset but poorly on the testing dataset, and on
any kind of data that the model has not yet seen. Variance generally leads to poor accuracy in
testing and results in overfitting.

22. What is pruning in a decision tree algorithm?

Pruning a decision tree is the process of removing the sections of the tree that are not necessary
or are redundant. Pruning leads to a smaller decision tree, which performs better and gives
higher accuracy and speed.

23. What is entropy in a decision tree algorithm?

In a decision tree algorithm, entropy is the measure of impurity or randomness. The entropy of a
given dataset tells us how pure or impure the values of the dataset are. In simple terms, it tells us
about the variance in the dataset.
For example, suppose we are given a box with 10 blue marbles. Then, the entropy of the box is 0
as it contains marbles of the same color, i.e., there is no impurity. If we need to draw a marble
from the box, the probability of it being blue will be 1.0. However, if we replace 4 of the blue
marbles with 4 red marbles in the box, then the entropy increases to 0.4 for drawing blue
marbles.

24. What information is gained in a decision tree algorithm?

When building a decision tree, at each step, we have to create a node that decides which feature
we should use to split data, i.e., which feature would best separate our data so that we can make
predictions. This decision is made using information gain, which is a measure of how much
entropy is reduced when a particular feature is used to split the data. The feature that gives the
highest information gain is the one that is chosen to split the data.

25. What is k-fold cross-validation?

In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop over the
entire dataset k times. In each iteration of the loop, one of the k parts is used for testing, and the

15
other k − 1 parts are used for training. Using k-fold cross-validation, each one of the k parts of the
dataset ends up being used for training and testing purposes.

26. Explain how a recommender system works.

A recommender system is a system that many consumer-facing, content-driven, online platforms


employ to generate recommendations for users from a library of available content. These
systems generate recommendations based on what they know about the users’ tastes from their
activities on the platform.

For example, imagine that we have a movie streaming platform, similar to Netflix or Amazon
Prime. If a user has previously watched and liked movies from action and horror genres, then it
means that the user likes watching the movies of these genres. In that case, it would be better to
recommend such movies to this particular user. These recommendations can also be generated
based on what users with a similar taste like watching.

27. What is a normal distribution?

Data distribution is a visualization tool to analyze how data is spread out or distributed. Data can
be distributed in various ways. For instance, it could be with a bias to the left or the right, or it
could all be jumbled up.

Data may also be distributed around a central value, i.e., mean, median, etc. This kind of
distribution has no bias either to the left or to the right and is in the form of a bell-shaped curve.
This distribution also has its mean equal to the median. This kind of distribution is called a normal
distribution.

28. What is Deep Learning?

Deep Learning is a kind of Machine Learning, in which neural networks are used to imitate the
structure of the human brain, and just like how a brain learns from information, machines are also
made to learn from the information that is provided to them.

Deep Learning is an advanced version of neural networks to make the machines learn from data.
In Deep Learning, the neural networks comprise many hidden layers (which is why it is called

16
‘deep’ learning) that are connected to each other, and the output of the previous layer is the input
of the current layer.

29.Mention the Tools for Data Science

Following are some tools required for data science:

o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.

PART - B
1.Explain various steps in the Data Science process (OR) Data Science Lifecycle

The life-cycle of data science is explained as below diagram.

17
The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right questions. When you
start any data science project, you need to determine what are the basic requirements, priorities,
and project budget. In this phase, we need to determine all the requirements of the project such
as the number of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:

o Data cleaning

o Data Reduction

o Data integration

o Data transformation,

18
After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by
using various statistical formula and visualization tools to understand the relations between
variable and to see what data can inform us. Common tools used for model planning are:

o SQL Analysis Services

o R

o SAS

o Python

4. Model-building: In this phase, the process of model building starts. We will create datasets
for training and testing purpose. We will apply different techniques such as association,
classification, and clustering, to build the model.

Following are some common Model building tools:

o SAS Enterprise Miner

o WEKA

o SPCS Modeler

o MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of complete
project performance and other components on a small scale before the full deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which we have set on
the initial phase. We will communicate the findings and final result with the business team.

Applications of Data Science:


o Image recognition and speech recognition:

o Data science is currently using for Image and speech recognition. When you upload an
image on Facebook and start getting the suggestion to tag to your friends. This automatic

19
tagging suggestion uses image recognition algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
o Gaming world:

o In the gaming world, the use of Machine learning algorithms is increasing day by day. EA
Sports, Sony, Nintendo, are widely using data science for enhancing user experience.
o Internet search:

o When we want to search for something on the internet, then we use different types of
search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the
data science technology to make the search experience better, and you can get a search
result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving cars. With
self-driving cars, it will be easy to reduce the number of road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is being
used for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc.
o Recommendation systems:

o Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you started
getting suggestions for similar products, so this is because of data science technology.
o Risk detection:

o Finance industries always had an issue of fraud and risk of losses, but with the help of
data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any type
of losses with an increase in customer satisfaction.

2. Explain the facets of data in detail

In Data Science and Big Data you’ll come across many different types of data, and each of them

tends to require different tools and techniques. The main categories of data are these:

20
 Structured

 Unstructured

 Natural Language

 Machine-generated

 Graph-based

 Audio, video and images

 Streaming

Let’s explore all these interesting data types..

Structured Data

Structured data is the data that depends on a data model and resides in a fixed field within a

record. It’s often easy to store structured data in tables within data bases or Excel files.

SQL, Structured Query Language, is the preferred way to manage and query data that resides in

data bases. You may also come across structured data that might give you a hard time storing it

in a traditional relational database.

Hierarchical data such as a family tree is one such example.The world isn’t made up of structured

data, though; it’s imposed upon it by humans and machines.

Unstructured Data

Unstructured data is data that isn’t easy to fit into a data model because the content is context-

specific or varying. One example of unstructured data is your regular email. Although email

contains structured elements such as the sender, title, and body text, it’s a challenge to find the

number of people who have written an email complaint about a specific employee because so

many ways exist to refer to a person, for example. The thousands of different languages and

21
dialects out there further complicate this.

A human-written email, is also a perfect example of natural language data.

Natural Language

Natural language is a special type of unstructured data ;it’s challenging to process

because it requires knowledge of specific data science techniques and linguistics.The natural

language processing community has had success in entity recognition, topic recognition,

summarization, text completion, and sentiment analysis, but models trained in one domain

don’t generalize well to other domains. Even state-of-the-art techniques aren’t able to decipher the

meaning of every piece of text. This shouldn’t be a surprise though: humans struggle with natural

language as well. It’s ambiguous by nature. The concept of meaning itself is questionable here.

Have two people listen to the same conversation. Will they get the same meaning? The meaning

of the same words can vary when coming from someone upset or joyous.

Machine-generated Data

Machine-generated data is informative that’s automatically created by a computer, process,

application or other machine without human intervention. Machine-generated data is

becoming a major data resource and will continue to do so.

The analysis of Machine data relies on highly scalable tools, due to high volume and

speed.

22
Examples are, web server logs, call detail records, network event logs and telemetry.

Example for Machine data

This is not the best approach for highly interconnected or “networked” data, where the relationship

between entities have a valuable role to play.

Graph-based or Network Data

“Graph data” can be a confusing term because any data can be shown in a graph. “Graph” in this

case points to mathematical graph theory. In graph theory, a graph is a mathematical

structure to model pair-wise relationships between objects. Graph or network data is, in

short, data that focuses on the relationship or adjacency of objects.

The graph structures use nodes, edges, and properties to represent and store graphical

data.

Friends in social network is an example of Graph-based data

23
Graph-based data is a natural way to represent social networks, and its structure allows you to

calculate specific metrics such as the influence of a person and the shortest path between two

people.

Graph databases are used to store graph-based data and are queried with specialized query

languages such as SPARQL.

Graph data poses its challenges, but for a computer interpreting additive and image data, it can

be ever more difficult.

Audio, Images and Videos

Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks

that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for

computers.

Multimedia data in the form of audio, video, images and sensor signals have become an integral

part of everyday life. Moreover, they have revolutionized product testing and evidence collection

by providing multiple sources of data for quantitative and systematic assessment.

We have various libraries, development languages and IDEs commonly used in the field, such

as :
 MATLAB
 openCV
 ImageJ
 Python
 R
 Java
 C
 C++
 C#

Streaming Data

While streaming data can take almost any of the previous forms, it has an extra property.

The data flows into the system when an event happens instead of being loaded into a data

24
store in a batch. Although it isn’t really a different type of data, we treat it here as much because

you need to adapt your process to deal with this type of information.

3. What are the steps in Data Cleansing, explain with example?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.

Difference between data cleaning and data transformation

Data cleaning is the process that removes data that does not belong in your dataset. Data
transformation is the process of converting data from one format or structure into another.
Transformation processes can also be referred to as data wrangling, or data munging,
transforming and mapping data from one "raw" data form into another format for warehousing
and analyzing. This article focuses on the processes of cleaning that data.

Techniques used for data cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your organization.

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you

25
might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more
performant dataset.

Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be
analyzed as the same category.

Step 3: Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with. However, sometimes it
is the appearance of an outlier that will prove a theory you are working on. Remember: just
because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.

Step 4: Handle missing data

You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.

1. As a first option, you can drop observations that have missing values, but doing this will
drop or lose information, so be mindful of this before you remove it.

2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.

3. As a third option, you might alter the way the data is used to effectively navigate null
values.
Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation:

26
 Does the data make sense?
 Does the data follow the appropriate rules for its field?
 Does it prove or disprove your working theory, or bring any insight to light?
 Can you find trends in the data to help you form your next theory?
 If not, is that because of a data quality issue?

False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you.

Try Tableau for free to create beautiful visualizations with your data.

Components of quality data

Determining the quality of data requires an examination of its characteristics, then weighing those
characteristics according to what is most important to your organization and the application(s) for
which they will be used.

5 characteristics of quality data

Validity. The degree to which your data conforms to defined business rules or constraints.

1. Accuracy. Ensure your data is close to the true values.

2. Completeness. The degree to which all required data is known.

3. Consistency. Ensure your data is consistent within the same dataset and/or across
multiple data sets.

4. Uniformity. The degree to which the data is specified using the same unit of measure.

Benefits of data cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:

27
 Removal of errors when multiple sources of data are at play.

 Fewer errors make for happier clients and less-frustrated employees.

 Ability to map the different functions and what your data is intended to do.

 Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.

 Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.

4. Explain the common error occur in data cleansing process.


Data Cleansing: Problems and Solutions

It is more important for any organization to have the right data as compared to a large data
set. Data cleansing solutions can have several problems during the process of data
scrubbing. The company needs to understand the various problems and figure out how to
tackle them. Some of the key data cleaning problems and solutions include -

 Data is never static


It is important that the data cleansing process arranges the data so that it is easily
accessible to everyone who needs it. The warehouse should contain unified data and not
in a scattered manner. The data warehouse must have a documented system which is
helpful for the employees to easily access the data from different sources. Data cleaning
also further helps to improve the data quality by removing inaccurate data as well as
corrupt and duplicate entries.

 Incorrect data may lead to bad decisions


While operating your business you rely on certain source of data, based on which you
make most of your business decisions. If the data has a lot of errors, the decisions you
take may be incorrect and prove to be hazardous for your business. The way you collect
data and how your data warehouse functions can easily have an impact on your
productivity.

 Incorrect data can affect client records

28
Complete client records are only possible when the names and addresses match. Names
and addresses of the client can be poor sources of data. To avoid these mistakes,
companies should provide external references which are capable of verifying the data,
supplementing data points and correcting any inconsistencies.

 Develop a data cleansing framework in advance


Data cleansing can be a time consuming and expensive job for your company. Once the
data is cleaned it needs to be stored in a secure location. The staff should keep a
complete log of the entire process so as to ascertain which data went through which
process. If a data scrubbing framework is not created in advance, the entire process can
become repetitive.

 Big data can bring in bigger problems


Big data needs regular cleansing to maintain its effectiveness. It requires complex
computer data analysis of semi-structured or structured and voluminous data. Data
cleansing helps in extracting information from such a big set of data and come up with
some data which can be used to make certain key business decisions.

5.Explain in detail about Data warehousing and Data Mining.

Data warehousing is a method of organizing and compiling data into one database, whereas
data mining deals with fetching important data from databases. Data mining attempts to depict
meaningful patterns through a dependency on the data that is compiled in the data warehouse.

29
DATA WAREHOUSE:

A data warehouse is where data can be collected for mining purposes, usually with large storage
capacity. Various organizations’ systems are in the data warehouse, where it can be fetched as
per usage.

Source 🡪 Extract 🡪Transform 🡪 Load 🡪 Target.

(Data warehouse process)

Data warehouses collaborate data from several sources and ensure data accuracy, quality, and
consistency. System execution is boosted by differentiating the process of analytics from
traditional databases. In a data warehouse, data is sorted into a formatted pattern by type and as
needed. The data is examined by query tools using several patterns.

Data warehouses store historical data and handle requests faster, helping in online analytical
processing, whereas a database is used to store current transactions in a business process that
is called online transaction processing.

FEATURES OF DATA WAREHOUSES:

 Subject Oriented:
It provides you with important data about a specific subject like suppliers, products, promotion,
customers, etc. Data warehousing usually handles the analysis and modeling of data that assist
any organization to make data-driven decisions.

 Integrated:
Different heterogeneous sources are put together to build a data warehouse, such as level
documents or social databases.

 Time-Variant:
The data collected in a data warehouse is identified with a specific period.

30
 Nonvolatile:
This means the earlier data is not deleted when new data is added to the data warehouse. The
operational database and data warehouse are kept separate and thus continuous changes in the
operational database are not shown in the data warehouse.

APPLICATIONS OF DATA WAREHOUSES:

Data warehouses help analysts or senior executives analyze, organize, and use data for decision
making.

It is used in the following fields:

 Consumer goods
 Banking services
 Financial services
 Manufacturing
 Retail sectors

ADVANTAGES OF DATA WAREHOUSING:

 Cost-efficient and provides quality of data


 Performance and productivity are improved
 Accurate data access and consistency

DATA MINING:

In this process, data is extracted and analyzed to fetch useful information. In data mining hidden
patterns are researched from the dataset to predict future behavior. Data mining is used to
indicate and discover relationships through the data.

Data mining uses statistics, artificial intelligence, machine learning systems, and some databases
to find hidden patterns in the data. It supports business-related queries that are time-consuming
to resolve.

31
FEATURES OF DATA MINING:

 It is good with large databases and datasets


 It predicts future results
 It creates actionable insights
 It utilizes the automated discovery of patterns

ADVANTAGES OF DATA MINING:

 Fraud Detection:
It is used to find which insurance claims, phone calls, debit or credit purchases are fraud.

 Trend Analysis:
Existing marketplace trends are analyzed,which provides a strategic benefit as it helps in
reduction of costs, as in manufacturing per demand.

 Market Analysis:
It can predict the market and therefore help to make business decisions. For example: it can
identify a target market for a retailer, or certain types of products desired by types of customers.

DATA MINING TECHNIQUES:

 Classification:
It is used to fetch the appropriate information from the dataset and to segregate different classes
that are present in the dataset. Below are the classification models.

1. K-nearest neighbors
2. Support Vector Machine
3. Gaussian Naïve Bayes, etc.

 Clustering:
It is used to find similarities in data by putting related data together and helping to identify
different variations in the dataset. It helps to find hidden patterns. An example of clustering is text
mining, medical diagnostics, etc.

 Association Rules:

32
They are used to identify a connection of two or more items. For example, if-then scenarios of
items that are frequently purchased in tandem in a grocery store can calculate the proportion of
items that are bought by customers together. Lift, confidence, and support are techniques used in
association rules.

 Outlier Detection:
It is used to identify patterns that do not match the normal behavior in the data, as the outlier
deviates from the rest of the data points. It helps in fraud detection, intrusion, etc. Boxplot and z-
score are ways to detect outliers.

6.Steps Involved in Data Science Modelling.

The key steps involved in Data Science Modelling are:

 Step 1: Understanding the Problem


 Step 2: Data Extraction
 Step 3: Data Cleaning
 Step 4: Exploratory Data Analysis
 Step 5: Feature Selection
 Step 6: Incorporating Machine Learning Algorithms
 Step 7: Testing the Models
 Step 8: Deploying the Model

Step 1: Understanding the Problem


The first step involved in Data Science Modelling is understanding the problem. A
Data Scientist listens for keywords and phrases when interviewing a line-of-business
expert about a business challenge. The Data Scientist breaks down the problem into a
procedural flow that always involves a holistic understanding of the business
challenge, the Data that must be collected, and various Artificial Intelligence and Data
Science approach that can be used to address the problem.

33
Step 2: Data Extraction
The next step in Data Science Modelling is Data Extraction. Not just any Data, but the
Unstructured Data pieces you collect, relevant to the business problem you’re trying to
address. The Data Extraction is done from various sources online, surveys, and
existing Databases.

Step 3: Data Cleaning

Data Cleaning is useful as you need to sanitize Data while gathering it. The following
are some of the most typical causes of Data Inconsistencies and Errors:

 Duplicate items are reduced from a variety of Databases.


 The error with the input Data in terms of Precision.
 Changes, Updates, and Deletions are made to the Data entries.
 Variables with missing values across multiple Databases.

Step 4: Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a robust technique for familiarising yourself with
Data and extracting useful insights. Data Scientists sift through Unstructured Data to
find patterns and infer relationships between Data elements. Data Scientists use
Statistics and Visualisation tools to summarise Central Measurements and variability
to perform EDA.

If Data skewness persists, appropriate transformations are used to scale the


distribution around its mean. When Datasets have a lot of features, exploring them
can be difficult. As a result, to reduce the complexity of Model inputs, Feature
Selection is used to rank them in order of significance in Model Building for enhanced
efficiency. Using Business Intelligence tools like Tableau, MicroStrategy, etc. can be
quite beneficial in this step. This step is crucial in Data Science Modelling as the
Metrics are studied carefully for validation of Data Outcomes.

34
Step 5: Feature Selection

Feature Selection is the process of identifying and selecting the features that
contribute the most to the prediction variable or output that you are interested in, either
automatically or manually.

The presence of irrelevant characteristics in your Data can reduce the Model accuracy
and cause your Model to train based on irrelevant features. In other words, if the
features are strong enough, the Machine Learning Algorithm will give fantastic
outcomes. Two types of characteristics must be addressed:

 Consistent characteristics that are unlikely to change.


 Variable characteristics whose values change over time.

Step 6: Incorporating Machine Learning Algorithms

This is one of the most crucial processes in Data Science Modelling as the Machine
Learning Algorithm aids in creating a usable Data Model. There are a lot of algorithms
to pick from, the Model is selected based on the problem. There are three types of
Machine Learning methods that are incorporated:

1) Supervised Learning

It is based on the results of a previous operation that is related to the existing business operation.
Based on previous patterns, Supervised Learning aids in the prediction of an outcome. Some of
the Supervised Learning Algorithms are:

 Linear Regression
 Random Forest
 Support Vector Machines

35
2) Unsupervised Learning

This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on


examining the interactions and connections between the presently available Data points. Some
of the Unsupervised Learning Algorithms are:

 KNN (k-Nearest Neighbors)


 K-means Clustering
 Hierarchical Clustering
 Anomaly Detection

3) Reinforcement Learning

It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with the
real world. In simple terms, it is a mechanism by which a system learns from its mistakes and
improves over time. Some of the Reinforcement Learning Algorithms are:

 Q-Learning
 State-Action-Reward-State-Action (SARSA)
 Deep Q Network

For further information on Advance Machine Learning techniques, visit here.

Step 7: Testing the Models

This is the next phase, and it’s crucial to check that our Data Science Modelling efforts
meet the expectations. The Data Model is applied to the Test Data to check if it’s
accurate and houses all desirable features. You can further test your Data Model to
identify any adjustments that might be required to enhance the performance and
achieve the desired results. If the required precision is not achieved, you can go back
to Step 5 (Machine Learning Algorithms), choose an alternate Data Model, and then
test the model again.

36
Step 8: Deploying the Model

The Model which provides the best result based on test findings is completed and
deployed in the production environment whenever the desired result is achieved
through proper testing as per the business needs. This concludes the process of Data
Science Modelling.

Applications of Data Science

Every industry benefits from the experience of Data Science companies, but the most
common areas where Data Science techniques are employed are the following:

 Banking and Finance: The banking industry can benefit from Data Science in
many aspects. Fraud Detection is a well-known application in this field that
assists banks in reducing non-performing assets.
 Healthcare: Health concerns are being monitored and prevented using
Wearable Data. The Data acquired from the body can be used in the medical
field to prevent future calamities.
 Marketing: Marketing offers a lot of potential, such as a more effective price
strategy. Pricing based on Data Science can help companies like Uber and E-
Commerce businesses enhance their profits.
 Government Policies: Based on Data gathered through surveys and other
official sources, the government can use Data Science to better build poli==cies
that cater to the interests and wishes of the people

37

You might also like