Professional Documents
Culture Documents
Unit 1 Fod
Unit 1 Fod
UNIT I
INTRODUCTION
COMPILED BY,
VERIFIED BY
PERIODS
TOTAL
S. COURSE CATE PER
COURSE TITLE CONTACT CREDITS
NO. CODE GORY WEEK
PERIODS
L T P
THEORY
1. MA3354 Discrete Mathematics BSC 3 1 0 4 4
Digital Principles and
2. CS3351 ESC 3 0 2 5 4
Computer Organization
Foundations of Data
3. CS3352 PCC 3 0 0 3 3
Science
Data Structures and 3
4. CD3291 PCC 3 0 0 3
Algorithms
Object Oriented
5. CS3391 PCC 3 0 0 3 3
Programming
PRACTICALS
Data Structures and
6. CD3281 PCC 0 0 4 4 2
Algorithms Laboratory
Object Oriented
7. CS3381 PCC 0 0 3 3 1.5
Programming Laboratory
8. CS3361 Data Science Laboratory PCC 0 0 4 4 2
9. GE3361 Professional Development EEC 0 0 2 2 1
TOTAL 15 1 15 31 23.5
CS3352 FOUNDATIONS OF DATA SCIENCE L T P
C3
00 3
COURSE OBJECTIVES:
To understand the data science fundamentals and process.
To learn to describe the data for the data science process.
To learn to describe the relationship between data.
To utilize the Python libraries for Data Wrangling.
To present and interpret data using visualization libraries in Python
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview –
Defining research goals – Retrieving data – Data preparation - Exploratory Data
analysis – build the model– presenting findings and building applications - Data Mining -
Data Warehousing – Basic Statistical descriptions of Data
TEXTBOOKS:
REFERENCE:
1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in
Python”, Green Tea Press,2014.
SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE- 637205
LECTURE PLAN
Designation: : HOD / IT
1
NO. OF
S.No. TOPIC REFERENCES TEACHING
HOURS
UNIT I INTRODUCTION
2
UNIT III DESCRIBING RELATIONSHIPS
Black Board
18 Correlation T2-CH2 1
Black Board
24 interpretation of r2 – T2-CH2 1
Black Board
25 multiple regression equations – T2-CH2 1
3
36 combining datasets – T3-CH1 Black Board 1
aggregation and grouping – pivot
37 tables T3-CH1 Black Board 1
38 T3-CH2 PPT 1
Importing Matplotlib –
Line plots – Scatter plots
39 T3-CH2 PPT 1
–
Histograms – legends –
42 T3-CH2 DEMO 1
colors – subplots
UNIT - I
4
INTRODUCTION
5
UNIT - I
INTRODUCTION
PART – A
1. What is Data Science?
2. Differentiate between Data Analytics and Data Science
3. What are the challenges in Data Science?
4. List the facets of data.
5. What are the steps in data science process ?
6. Explain unstructured data and give example.
7. What do you understand about linear regression?
8. What are outliers?
9. What do you understand by logistic regression?
10. What is a confusion matrix?
11. What do you understand about the true-positive rate and false-positive rate?
12. How is Data Science different from traditional application programming?
13. Explain the differences between supervised and unsupervised learning.
14.What is the difference between the long format data and wide format data?
15. Mention some techniques used for sampling. What is the main advantage of
sampling?
PART – B
1.Explain various steps in the Data Science process (OR) Data Science Lifecycle.
2.Explain the facets of data in detail
3.What are the steps in Data Cleansing, explain with example.
4.Explain the common error occur in data cleansing process (7)
5.Explain in detail about Data warehousing and Data Mining.
6.Steps Involved in Data Science Modelling.
6
LIST OF IMPORTANT QUESTIONS
UNIT - I
INTRODUCTION
PART – A
1. What is Data Science?
Data Science is a field of computer science that explicitly deals with turning data into information
and extracting meaningful insights out of it. The reason why Data Science is so popular is that
the kind of insights it allows us to draw from the available data has led to some major innovations
in several products and companies. Using these insights, we are able to determine the taste of a
particular customer, the likelihood of a product succeeding in a particular market, etc.
8
one independent variable, then it is called simple linear regression, and if there is more than one
independent variable then it is known as multiple linear regression. 8.
An outlier is an observation that lies an abnormal distance from other values in a random
Temperature and humidity are the independent variables, and rain would be our dependent
variable. So, the logistic regression algorithm actually produces an S shape curve.
Now, let us look at another scenario: Let’s suppose that x-axis represents the runs scored by
Virat Kohli and the y-axis represents the probability of the team India winning the match. From
this graph, we can say that if Virat Kohli scores more than 50 runs, then there is a greater
probability for team India to win the match. Similarly, if he scores less than 50 runs then the
probability of team India winning the match is less than 50 percent.
So, basically in logistic regression, the Y value lies within the range of 0 and 1. This is how
logistic regression works.
9
10. What is a confusion matrix?
The confusion matrix is a table that is used to estimate the performance of a model. It tabulates
the actual values and the predicted values in a 2×2 matrix.
True Positive (d): This denotes all of those records where the actual values are true and the
predicted values are also true. So, these denote all of the true positives. False Negative (c): This
denotes all of those records where the actual values are true, but the predicted values are false.
False Positive (b): In this, the actual values are false, but the predicted values are true. True
Negative (a): Here, the actual values are false and the predicted values are also false. So, if you
want to get the correct values, then correct values would basically represent all of the true
positives and the true negatives. This is how the confusion matrix works.
11. What do you understand about the true-positive rate and false-positive rate?
True positive rate: In Machine Learning, true-positive rates, which are also referred to as
sensitivity or recall, are used to measure the percentage of actual positives which are correctly
identified. Formula: True Positive Rate = True Positives/Positives False positive rate: False
positive rate is basically the probability of falsely rejecting the null hypothesis for a particular test.
The false-positive rate is calculated as the ratio between the number of negative events wrongly
categorized as positive (false positive) upon the total number of actual events. Formula: False-
Positive Rate = False-Positives/Negatives.
In traditional programming paradigms, we used to analyze the input, figure out the expected
output, and write code, which contains rules and statements needed to transform the provided
10
input into the expected output. As we can imagine, these rules were not easy to write, especially,
for data that even computers had a hard time understanding, e.g., images, videos, etc.
Data Science shifts this process a little bit. In it, we need access to large volumes of data that
contain the necessary inputs and their mappings to the expected outputs. Then, we use Data
Science algorithms, which use mathematical analysis to generate rules to map the given inputs
to outputs.
This process of rule generation is called training. After training, we use some data that was set
aside before the training phase to test and check the system’s accuracy. The generated rules are
a kind of a black box, and we cannot understand how the inputs are being transformed into
outputs.
However, If the accuracy is good enough, then we can use the system (also called a model).
As described above, in traditional programming, we had to write the rules to map the input to the
output, but in Data Science, the rules are automatically generated or learned from the given data.
This helped solve some really difficult challenges that were being faced by several companies.
Supervised and unsupervised learning are two types of Machine Learning techniques. They both
allow us to build models. However, they are used for solving different kinds of problems.
Supervised Learning Unsupervised Learning
Works on the data that contains both Works on the data that contains no
inputs and the expected output, i.e., the mappings from input to output, i.e., the
labeled data unlabeled data
Used to create models that can be Used to extract meaningful information out
employed to predict or classify things of large volumes of data
Commonly used supervised learning Commonly used unsupervised learning
algorithms: Linear regression, decision algorithms: K-means clustering, Apriori
tree, etc. algorithm, etc.
11
14. What is the difference between the long format data and wide format data?
15. Mention some techniques used for sampling. What is the main advantage of
sampling?
Sampling is defined as the process of selecting a sample from a group of people or from any
particular kind for research purposes. It is one of the most important factors which decides the
accuracy of a research/survey result.
Probability sampling: It involves random selection which makes every element get a chance to
be selected. Probability sampling has various subtypes in it, as mentioned below:
12
Non- Probability Sampling: Non-probability sampling follows non-random selection which
means the selection is done based on your ease or any other required criteria. This helps to
collect the data easily. The following are various types of sampling in it:
o Convenience Sampling
o Purposive Sampling
o Quota Sampling
o Referral /Snowball Sampling
Bias is a type of error that occurs in a Data Science model because of using an algorithm that is
not strong enough to capture the underlying patterns or trends that exist in the data. In other
words, this error occurs when the data is too complicated for the algorithm to understand, so it
ends up building a model that makes simple assumptions. This leads to lower accuracy because
of underfitting. Algorithms that can lead to high bias are linear regression, logistic regression,
etc.==
Dimensionality reduction is the process of converting a dataset with a high number of dimensions
(fields) to a dataset with a lower number of dimensions. This is done by dropping some fields or
columns from the dataset. However, this is not done haphazardly. In this process, the dimensions
or fields are dropped only after making sure that the remaining information will still be enough to
succinctly describe similar information.
Data Scientists have to clean and transform the huge data sets in a form that they can work with.
It is important to deal with the redundant data for better results by removing nonsensical outliers,
malformed records, missing values, inconsistent formatting, etc.
Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively used
for Data cleaning and analysis. These libraries are used to load and clean the data and do
13
effective analysis. For example, a CSV file named “Student” has information about the students
of an institute like their names, standard, address, phone number, grades, marks, etc.
R provides the best ecosystem for data analysis and visualization with more than 12,000
packages in Open-source repositories. It has huge community support, which means you can
easily find the solution to your problems on various platforms like StackOverflow.
It has better data management and supports distributed computing by splitting the operations
between multiple tasks and nodes, which eventually decreases the complexity and execution
time of large datasets.
Below are the popular libraries used for data extraction, cleaning, visualization, and deploying DS
models:
Variance is a type of error that occurs in a Data Science model when the model ends up being
too complex and learns features from data, along with the noise that exists in it. This kind of error
can occur if the algorithm used to train the model has high complexity, even though the data and
the underlying patterns and trends are quite easy to discover. This makes the model a very
14
sensitive one that performs well on the training dataset but poorly on the testing dataset, and on
any kind of data that the model has not yet seen. Variance generally leads to poor accuracy in
testing and results in overfitting.
Pruning a decision tree is the process of removing the sections of the tree that are not necessary
or are redundant. Pruning leads to a smaller decision tree, which performs better and gives
higher accuracy and speed.
In a decision tree algorithm, entropy is the measure of impurity or randomness. The entropy of a
given dataset tells us how pure or impure the values of the dataset are. In simple terms, it tells us
about the variance in the dataset.
For example, suppose we are given a box with 10 blue marbles. Then, the entropy of the box is 0
as it contains marbles of the same color, i.e., there is no impurity. If we need to draw a marble
from the box, the probability of it being blue will be 1.0. However, if we replace 4 of the blue
marbles with 4 red marbles in the box, then the entropy increases to 0.4 for drawing blue
marbles.
When building a decision tree, at each step, we have to create a node that decides which feature
we should use to split data, i.e., which feature would best separate our data so that we can make
predictions. This decision is made using information gain, which is a measure of how much
entropy is reduced when a particular feature is used to split the data. The feature that gives the
highest information gain is the one that is chosen to split the data.
In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop over the
entire dataset k times. In each iteration of the loop, one of the k parts is used for testing, and the
15
other k − 1 parts are used for training. Using k-fold cross-validation, each one of the k parts of the
dataset ends up being used for training and testing purposes.
For example, imagine that we have a movie streaming platform, similar to Netflix or Amazon
Prime. If a user has previously watched and liked movies from action and horror genres, then it
means that the user likes watching the movies of these genres. In that case, it would be better to
recommend such movies to this particular user. These recommendations can also be generated
based on what users with a similar taste like watching.
Data distribution is a visualization tool to analyze how data is spread out or distributed. Data can
be distributed in various ways. For instance, it could be with a bias to the left or the right, or it
could all be jumbled up.
Data may also be distributed around a central value, i.e., mean, median, etc. This kind of
distribution has no bias either to the left or to the right and is in the form of a bell-shaped curve.
This distribution also has its mean equal to the median. This kind of distribution is called a normal
distribution.
Deep Learning is a kind of Machine Learning, in which neural networks are used to imitate the
structure of the human brain, and just like how a brain learns from information, machines are also
made to learn from the information that is provided to them.
Deep Learning is an advanced version of neural networks to make the machines learn from data.
In Deep Learning, the neural networks comprise many hidden layers (which is why it is called
16
‘deep’ learning) that are connected to each other, and the output of the previous layer is the input
of the current layer.
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.
PART - B
1.Explain various steps in the Data Science process (OR) Data Science Lifecycle
17
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions. When you
start any data science project, you need to determine what are the basic requirements, priorities,
and project budget. In this phase, we need to determine all the requirements of the project such
as the number of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
18
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by
using various statistical formula and visualization tools to understand the relations between
variable and to see what data can inform us. Common tools used for model planning are:
o R
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create datasets
for training and testing purpose. We will apply different techniques such as association,
classification, and clustering, to build the model.
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of complete
project performance and other components on a small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on
the initial phase. We will communicate the findings and final result with the business team.
o Data science is currently using for Image and speech recognition. When you upload an
image on Facebook and start getting the suggestion to tag to your friends. This automatic
19
tagging suggestion uses image recognition algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
o Gaming world:
o In the gaming world, the use of Machine learning algorithms is increasing day by day. EA
Sports, Sony, Nintendo, are widely using data science for enhancing user experience.
o Internet search:
o When we want to search for something on the internet, then we use different types of
search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the
data science technology to make the search experience better, and you can get a search
result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving cars. With
self-driving cars, it will be easy to reduce the number of road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is being
used for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc.
o Recommendation systems:
o Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you started
getting suggestions for similar products, so this is because of data science technology.
o Risk detection:
o Finance industries always had an issue of fraud and risk of losses, but with the help of
data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any type
of losses with an increase in customer satisfaction.
In Data Science and Big Data you’ll come across many different types of data, and each of them
tends to require different tools and techniques. The main categories of data are these:
20
Structured
Unstructured
Natural Language
Machine-generated
Graph-based
Streaming
Structured Data
Structured data is the data that depends on a data model and resides in a fixed field within a
record. It’s often easy to store structured data in tables within data bases or Excel files.
SQL, Structured Query Language, is the preferred way to manage and query data that resides in
data bases. You may also come across structured data that might give you a hard time storing it
Hierarchical data such as a family tree is one such example.The world isn’t made up of structured
Unstructured Data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying. One example of unstructured data is your regular email. Although email
contains structured elements such as the sender, title, and body text, it’s a challenge to find the
number of people who have written an email complaint about a specific employee because so
many ways exist to refer to a person, for example. The thousands of different languages and
21
dialects out there further complicate this.
Natural Language
because it requires knowledge of specific data science techniques and linguistics.The natural
language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain
don’t generalize well to other domains. Even state-of-the-art techniques aren’t able to decipher the
meaning of every piece of text. This shouldn’t be a surprise though: humans struggle with natural
language as well. It’s ambiguous by nature. The concept of meaning itself is questionable here.
Have two people listen to the same conversation. Will they get the same meaning? The meaning
of the same words can vary when coming from someone upset or joyous.
Machine-generated Data
The analysis of Machine data relies on highly scalable tools, due to high volume and
speed.
22
Examples are, web server logs, call detail records, network event logs and telemetry.
This is not the best approach for highly interconnected or “networked” data, where the relationship
“Graph data” can be a confusing term because any data can be shown in a graph. “Graph” in this
structure to model pair-wise relationships between objects. Graph or network data is, in
The graph structures use nodes, edges, and properties to represent and store graphical
data.
23
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.
Graph databases are used to store graph-based data and are queried with specialized query
Graph data poses its challenges, but for a computer interpreting additive and image data, it can
Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers.
Multimedia data in the form of audio, video, images and sensor signals have become an integral
part of everyday life. Moreover, they have revolutionized product testing and evidence collection
We have various libraries, development languages and IDEs commonly used in the field, such
as :
MATLAB
openCV
ImageJ
Python
R
Java
C
C++
C#
Streaming Data
While streaming data can take almost any of the previous forms, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a data
24
store in a batch. Although it isn’t really a different type of data, we treat it here as much because
you need to adapt your process to deal with this type of information.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.
Data cleaning is the process that removes data that does not belong in your dataset. Data
transformation is the process of converting data from one format or structure into another.
Transformation processes can also be referred to as data wrangling, or data munging,
transforming and mapping data from one "raw" data form into another format for warehousing
and analyzing. This article focuses on the processes of cleaning that data.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your organization.
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you
25
might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more
performant dataset.
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be
analyzed as the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working with. However, sometimes it
is the appearance of an outlier that will prove a theory you are working on. Remember: just
because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
1. As a first option, you can drop observations that have missing values, but doing this will
drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null
values.
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation:
26
Does the data make sense?
Does the data follow the appropriate rules for its field?
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help you form your next theory?
If not, is that because of a data quality issue?
False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you.
Try Tableau for free to create beautiful visualizations with your data.
Determining the quality of data requires an examination of its characteristics, then weighing those
characteristics according to what is most important to your organization and the application(s) for
which they will be used.
Validity. The degree to which your data conforms to defined business rules or constraints.
3. Consistency. Ensure your data is consistent within the same dataset and/or across
multiple data sets.
4. Uniformity. The degree to which the data is specified using the same unit of measure.
Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:
27
Removal of errors when multiple sources of data are at play.
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.
It is more important for any organization to have the right data as compared to a large data
set. Data cleansing solutions can have several problems during the process of data
scrubbing. The company needs to understand the various problems and figure out how to
tackle them. Some of the key data cleaning problems and solutions include -
28
Complete client records are only possible when the names and addresses match. Names
and addresses of the client can be poor sources of data. To avoid these mistakes,
companies should provide external references which are capable of verifying the data,
supplementing data points and correcting any inconsistencies.
Data warehousing is a method of organizing and compiling data into one database, whereas
data mining deals with fetching important data from databases. Data mining attempts to depict
meaningful patterns through a dependency on the data that is compiled in the data warehouse.
29
DATA WAREHOUSE:
A data warehouse is where data can be collected for mining purposes, usually with large storage
capacity. Various organizations’ systems are in the data warehouse, where it can be fetched as
per usage.
Data warehouses collaborate data from several sources and ensure data accuracy, quality, and
consistency. System execution is boosted by differentiating the process of analytics from
traditional databases. In a data warehouse, data is sorted into a formatted pattern by type and as
needed. The data is examined by query tools using several patterns.
Data warehouses store historical data and handle requests faster, helping in online analytical
processing, whereas a database is used to store current transactions in a business process that
is called online transaction processing.
Subject Oriented:
It provides you with important data about a specific subject like suppliers, products, promotion,
customers, etc. Data warehousing usually handles the analysis and modeling of data that assist
any organization to make data-driven decisions.
Integrated:
Different heterogeneous sources are put together to build a data warehouse, such as level
documents or social databases.
Time-Variant:
The data collected in a data warehouse is identified with a specific period.
30
Nonvolatile:
This means the earlier data is not deleted when new data is added to the data warehouse. The
operational database and data warehouse are kept separate and thus continuous changes in the
operational database are not shown in the data warehouse.
Data warehouses help analysts or senior executives analyze, organize, and use data for decision
making.
Consumer goods
Banking services
Financial services
Manufacturing
Retail sectors
DATA MINING:
In this process, data is extracted and analyzed to fetch useful information. In data mining hidden
patterns are researched from the dataset to predict future behavior. Data mining is used to
indicate and discover relationships through the data.
Data mining uses statistics, artificial intelligence, machine learning systems, and some databases
to find hidden patterns in the data. It supports business-related queries that are time-consuming
to resolve.
31
FEATURES OF DATA MINING:
Fraud Detection:
It is used to find which insurance claims, phone calls, debit or credit purchases are fraud.
Trend Analysis:
Existing marketplace trends are analyzed,which provides a strategic benefit as it helps in
reduction of costs, as in manufacturing per demand.
Market Analysis:
It can predict the market and therefore help to make business decisions. For example: it can
identify a target market for a retailer, or certain types of products desired by types of customers.
Classification:
It is used to fetch the appropriate information from the dataset and to segregate different classes
that are present in the dataset. Below are the classification models.
1. K-nearest neighbors
2. Support Vector Machine
3. Gaussian Naïve Bayes, etc.
Clustering:
It is used to find similarities in data by putting related data together and helping to identify
different variations in the dataset. It helps to find hidden patterns. An example of clustering is text
mining, medical diagnostics, etc.
Association Rules:
32
They are used to identify a connection of two or more items. For example, if-then scenarios of
items that are frequently purchased in tandem in a grocery store can calculate the proportion of
items that are bought by customers together. Lift, confidence, and support are techniques used in
association rules.
Outlier Detection:
It is used to identify patterns that do not match the normal behavior in the data, as the outlier
deviates from the rest of the data points. It helps in fraud detection, intrusion, etc. Boxplot and z-
score are ways to detect outliers.
33
Step 2: Data Extraction
The next step in Data Science Modelling is Data Extraction. Not just any Data, but the
Unstructured Data pieces you collect, relevant to the business problem you’re trying to
address. The Data Extraction is done from various sources online, surveys, and
existing Databases.
Data Cleaning is useful as you need to sanitize Data while gathering it. The following
are some of the most typical causes of Data Inconsistencies and Errors:
Exploratory Data Analysis (EDA) is a robust technique for familiarising yourself with
Data and extracting useful insights. Data Scientists sift through Unstructured Data to
find patterns and infer relationships between Data elements. Data Scientists use
Statistics and Visualisation tools to summarise Central Measurements and variability
to perform EDA.
34
Step 5: Feature Selection
Feature Selection is the process of identifying and selecting the features that
contribute the most to the prediction variable or output that you are interested in, either
automatically or manually.
The presence of irrelevant characteristics in your Data can reduce the Model accuracy
and cause your Model to train based on irrelevant features. In other words, if the
features are strong enough, the Machine Learning Algorithm will give fantastic
outcomes. Two types of characteristics must be addressed:
This is one of the most crucial processes in Data Science Modelling as the Machine
Learning Algorithm aids in creating a usable Data Model. There are a lot of algorithms
to pick from, the Model is selected based on the problem. There are three types of
Machine Learning methods that are incorporated:
1) Supervised Learning
It is based on the results of a previous operation that is related to the existing business operation.
Based on previous patterns, Supervised Learning aids in the prediction of an outcome. Some of
the Supervised Learning Algorithms are:
Linear Regression
Random Forest
Support Vector Machines
35
2) Unsupervised Learning
3) Reinforcement Learning
It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with the
real world. In simple terms, it is a mechanism by which a system learns from its mistakes and
improves over time. Some of the Reinforcement Learning Algorithms are:
Q-Learning
State-Action-Reward-State-Action (SARSA)
Deep Q Network
This is the next phase, and it’s crucial to check that our Data Science Modelling efforts
meet the expectations. The Data Model is applied to the Test Data to check if it’s
accurate and houses all desirable features. You can further test your Data Model to
identify any adjustments that might be required to enhance the performance and
achieve the desired results. If the required precision is not achieved, you can go back
to Step 5 (Machine Learning Algorithms), choose an alternate Data Model, and then
test the model again.
36
Step 8: Deploying the Model
The Model which provides the best result based on test findings is completed and
deployed in the production environment whenever the desired result is achieved
through proper testing as per the business needs. This concludes the process of Data
Science Modelling.
Every industry benefits from the experience of Data Science companies, but the most
common areas where Data Science techniques are employed are the following:
Banking and Finance: The banking industry can benefit from Data Science in
many aspects. Fraud Detection is a well-known application in this field that
assists banks in reducing non-performing assets.
Healthcare: Health concerns are being monitored and prevented using
Wearable Data. The Data acquired from the body can be used in the medical
field to prevent future calamities.
Marketing: Marketing offers a lot of potential, such as a more effective price
strategy. Pricing based on Data Science can help companies like Uber and E-
Commerce businesses enhance their profits.
Government Policies: Based on Data gathered through surveys and other
official sources, the government can use Data Science to better build poli==cies
that cater to the interests and wishes of the people
37