Data Management and Data Transformation, Introduction To Machine Learning

Data Management and Data
Transformation, Introduction to
Machine Learning
By
Dr. Amod Kumar Tiwari
Asso. Professor, CSED
Rajkiya Engineering College, Sonbhadra
1
Outline
1. Preface
2. Definition
3. Introduction to Machine Learning (ML)
4. Need for ML
5. Types of Learning in ML
6. Applications of ML
7. Limitations of ML
8. ML and Data Management
9. Data Transformation in ML
2
Preface
DATA, DATA EVERYWHERE…
 Widespread use of personal computers and wireless communication
leads to “big data”.
 We are both producers and consumers of data.
 Data is not random, it has structure, e.g., customer behavior. We
need “big theory” to extract that structure from data for
 Understanding the process
 Making predictions for the future
 It is a biggest challenge to store and process such a huge data.
 More challenging to extract meaningful insight from the data pile.
 Extracted information is of high significance & aids in decision
making.
 But is the data always valuable? 3
Cont.
4
Definition
DATA What is it ?
Data is a collection of raw facts and figures having no meaning on its
own but when processed lead to meaningful information.
5
Cont.
DATA EVERYWHERE…
 Widespread use of personal computers and wireless communication
leads to “big data”.
 We are both producers and consumers of data.
 Data is not random, it has structure, e.g., customer behavior. We
need “big theory” to extract that structure from data for
 Understanding the process
 Making predictions for the future
 It is a biggest challenge to store and process such a huge data.
 More challenging to extract meaningful insight from the data pile.
 Extracted information is of high significance & aids in decision
making.
 But is the data always valuable? 6
Data Can Toil/Spoil…
7
Introduction To Machine Learning (ML)
8
Cont.
Five essential prerequisites for studying machine learning:
 Statistics Knowledge: Probability, Basic and Inferential Statistics

 Mathematical foundation: Linear Algebra and Calculus
 Programming Languages: Preferably Python (Pandas, Numpy,
Matplotlib)
 Domain Knowledge: Related to the problem
 Common Sense – which isn’t common
9
Cont.
 Machine Learning: Systematic way of “learning” from “data” or

“past experience” by the Machine (computers, Smart Phones,
Robots etc.)
 Learning: Make intelligent predictions or decisions based on data by
optimizing a model
 There is no need to “learn” to calculate payroll
 Learning is used when:
 Human expertise does not exist (navigating on Mars),
 Humans are unable to explain their expertise (speech
recognition)
 Solution changes in time (routing on a computer network)
 Solution needs to be adapted to particular cases (user biometrics)
Cont.
Standard Definition of Machine Learning
1
1
Need For ML
For tasks that are easily performed by humans but are complex for
computer systems to emulate for example … So that machines can take
charge of humans)
 Vision: Identify faces in a photograph, objects in a video or still
image, etc.
 Natural language Processing: Translate a sentence from Hindi to
English, question answering, identify sentiment of text, etc.
 Speech Recognition: Recognize spoken words, speaking sentences
naturally
 Game playing: Play games like chess, Go, Dota.
 Robotics: Walking, jumping, displaying emotions, driverless car etc.
Cont.
For tasks that are beyond human capabilities

 E.g. IBM Watson’s Jeopardy-playing machine.
Analysis of large and complex datasets

 E.g.: Analyzing Social media data
Fields where there are very few (almost no) human experts
 Industrial/manufacturing control
 Testing and Quality Assurance
 Mass spectrometer analysis,
 Drug design
 Astronomic discovery
Cont.
Beneficial when the scenarios are highly volatile/ rapidly changing

 Credit scoring
 Financial modeling
 Fraud detection
 Diagnosis
Types of Learning in ML
Cont.
Supervised Learning
 Supervised machine learning is a branch of ML that combines

algorithms and methods of the model building based on a set of
examples containing “known input – known output” pairs.
 In this case, we teach or train the machine using data that are
properly or correctly labeled.
 The most widespread supervised algorithms are:

decision trees; support vector machines;
Bayesian classifier; linear discriminant analysis;
k-nearest neighbor; linear regression;
logistic regression; neural networks.
Cont.
Unsupervised Learning
 Unlike supervised learning, this ML type does not need labels and
corresponding outputs to be provided. Instead, unsupervised
learning uses unlabeled input data and determines the structure of
the set.
 Unsupervised learning is typically used for clustering, anomaly

detection, association mining, and dimensionality reduction.
 The frequently-used unsupervised algorithms are:

k-means clustering; Association rule;
Principal Component Analysis; t-Distributed Stochastic
Neighbor Embedding; Usage Scenarios of Unsupervised
Learning Algorithms
Cont.
Reinforcement Learning
 Reinforcement learning is a type of ML algorithm which lets

software agents and machines automatically identify the suitable
behavior within a particular situation, to increase its performance.
It also provides a way to overcome the limitations of deep learning
to address a multi-step problem.
 The focus of reinforcement learning is on regimented learning

processes when the machine learning algorithm is provided with a
set of actions, parameters, and final values.
 It learns from past experiences and changes its approach in

response to a new situation, trying to achieve the best possible
outcome.
Cont.
 The machine’s goal is validated in the form of a special signal
called a reward. These signals are granted to the machine each
time when it completes a task correctly. By automating the
calculation of rewards, you can allow the machine to learn on its
own.
 Most popular reinforced algorithms include:

Q-Learning; Temporal Difference; Monte-Carlo Tree
Search; Asynchronous Actor-Critic Agents
Applications of ML
Applications of ML
 Image recognition: To identify objects, persons, places, digital

images, etc. The popular use case of image recognition and face
detection is, Automatic friend tagging suggestion by Facebook,
geo tagging by Google, Biometrics etc.
 Speech Recognition: Process of converting voice instructions into

text. E.g. Speech to text, Voice recognition, Google’s Voice
Search, Voice based assistance viz Siri, Cortana, and Alexa etc.
 Product recommendations: Mechanism of understanding the user

interest using various machine learning algorithms & suggests the
product as per customer interest. Google recommendation,
Youtube video recommendation, Food Recommendation on Apps
etc.
Cont.
 Self-driving cars: The art of automating the driving by computers.
E.g. Tesla cars by Tesla company which uses unsupervised
learning method to train the car models for object (people, vehicle
or any obstacle), detection navigation etc. to facilitate smooth
driving.
 Transportation and Commuting: It provides a customized

application which is unique to you. Automatically detects your
location and provides options to either go home or office or any
other frequent place based on your History and Patterns E.g.:
Uber/Ola
 Stock Data Prediction: Predicting the closing price of stock using

time series models and neural networks.
Cont.
 Medical Diagnosis: ML is used for diseases identification,
classification and prediction of cancers and tumors using image
processing and numerical data analysis. E.g. 3D models that can
predict the exact position of lesions in the brain. Classification of
disease as lethal or non-lethal, Prediction of reoccurrence of
cancer etc.
 Automatic Language Translation: Converts the unknown

language into known one. E.g. Google's GNMT (Google Neural
Machine Translation)
 Basket Analysis: Identifying the frequently bought items and

redesigning the shelf to increase the sales in the super market.
 Data Analytics: Analyzing the data to facilitate decision making.
E.g. Sentiment analysis, Business analytics, medical analytics etc.
Limitations of ML
 Limitation 1 — Ethics: If my self-driving car kills someone on
the road, whose fault is it?
 Limitation 2 — Deterministic Problems: Machine learning is
stochastic, not deterministic.
 Limitation 3 — Data: Lack of data, lack of good data leads to
wrong results.
 Limitation 4 — Misapplication: whereby people blindly use
machine learning to solve statistical problems and statistical
techniques to solve machine learning problem. It should be noted
that statistical modeling is inherently confirmatory, and machine
learning is inherently exploratory.
 Limitation 5 — Interpretability: Lack of interpretability of the
ML methods, despite their apparent success especially in the field
of genomics, proteomics, metabolomics, etc.
AI/ML and Data Management
 Making decisions based on data is nothing new, but as companies
pursue the goal of becoming more insight-driven, it has become
clear that there is a need to adopt new technologies and methods
that facilitate data-centric decision-making at the heart of the
business. At the forefront here is artificial intelligence, or AI,
which includes machine learning (ML) and deep learning (DL).
 AI has the potential to transform nearly all aspects of life,

including how people work, study, travel, govern and pursue
leisure activities. But in order to take full advantage of everything
that AI has to offer, enterprises must also embed AI at the data
level, ensuring that AI enables the full scope of the data
management lifecycle, from ingestion to curation and discovery,
as well as driving applications that are built on that data.
Cont.
 When it comes to data and the systems that manage it, enterprises
are challenged in terms of increasing operational efficiencies and
providing greater data access to a variety of data consumers.
Enterprises need data management systems that run efficiently at
high performance, capable of producing accurate results, and
enterprises also need the data to be accessible to data scientists
for building AI-enabled applications.
 Data management systems and AI are synergistic. When AI

becomes embedded within and throughout the data management
system, it has the potential to improve database query accuracy
and performance, and to optimize system resources. Further, as
the underlying data platforms evolve to better support AI
initiatives – for example, by providing direct support for the use
of Python, GO, JSON and Jupyter notebooks –
Cont.
 Data from 451 Research’s latest Voice of the Enterprise: Data
Platforms and Analytics survey reveals the extent to which
enterprises see AI and ML as critical aspects of their data
platform and analytics initiatives. Two-thirds of all respondents
agree that AI and ML are an important component of their data
platform and analytics initiatives, but this figure increases to 88%
among the most data-driven companies (i.e., those at which
nearly all strategic decisions are data-driven).
Cont.
Business Impact
Improve Operational Efficiencies: Enterprises often struggle to
ensure that database systems are running efficiently. Queries that
overload the system, consume excessive resources or impact other
running jobs not only impact performance but also require manual
resources to rectify. ML can help by automating the management of
queries based on their likely resource consumption, providing a
more stable and reliable system that can prioritize queries, reducing
manual governance and monitoring of the database.
Improve Query Performance And Accuracy: ML-enabled database

querying can have a dramatic impact on increasing the overall
accuracy of or confidence in the query result. By executing queries
in a more efficient manner, enterprises can lower the time taken to
generate insight and improve business decisions.
Cont.
Empower business analysts: One of the primary challenges when
doing analytics has been to ‘democratize’ the technology to enable a
broader range of people to be able to make analytics-driven
decisions. Accelerating the development of AI-based applications
can enable the output of machine learning models to be placed in the
hands of domain experts and business decision-makers.
Accelerate data scientist productivity: 451 Research survey results

indicate that accessing and preparing data is one of the three most
significant barriers to ML adoption. An AI-enabled database can
help overcome this barrier to insight by accelerating data exploration
and lowering development times though the integration of developer
tools and frameworks.
Cont.
The automation of database admin tasks is set to change the role of
the DBA: Through the automation of mundane database
administration tasks such as database provisioning and performance
tuning, DBAs can focus their time on higher-impact Tasks such as
architecture planning and data security.
Data Transformation in ML
 Data Transformation is the technique of converting data from one
format to another. Data Transformation can be divided into
following steps. Each of these steps will be applied based on the
complexity of the transformation.
 Data Discovery: This is more of an exploratory step which

involves profiling the data using data profiling tools or
sometimes using manual scripts. The goal of this step is to
understand the structure and characteristics of data.
 Data Mapping: This is a process which defines how

individual fields are mapped, modified, joined, filtered,
aggregated etc. to produce the final desired output.
 Data Transformation code: This is a process of generating
code(e.g, SQL, python, R etc) which will transform data
based on the data mapping rules.
 Code implementation: This is a process in which the

generated code is executed against the data to create the
desired output.
 Review of data: This process is to ensure that the output data

meets the transformation requirements. This step is mostly
carried out by the business or end user.
 Now that we have seen different steps involved in Data
Transformation, let’s get into some more details and see how to
transform the data into a machine-learning-digestible format. All
machine learning algorithms are based on mathematics. So, we
need to convert all the columns into numerical format. Before
that, let’s see all the different types of data we have.
 Taking a broader perspective, data is classified into numerical and

categorical data:
1. Numerical: As the name suggests, this is numeric data that is

quantifiable.
2. Categorical: The data is a string or non-numeric data that is
qualitative in nature.
 Numerical data is further divided into the following:
i. Discrete: To explain in simple terms, any numerical data that is

countable is called discrete, for example, the number of people
in a family or the number of students in a class. Discrete data
can only take certain values (such as 1, 2, 3, 4, etc).
ii. Continuous: Any numerical data that is measurable is called

continuous. For example, the height of a person or the time
taken to reach a destination. Continuous data can take virtually
any value (for example, 1.25, 3.8888, and 77.1276).
 Categorical data is further divided into the following:
i. Ordered: Any categorical data that has some order associated

with it is called ordered categorical data, for example, movie ratings
(excellent, good, bad, worst) and feedback (happy, not bad, bad).
You can think of ordered data as being something you could mark on
a scale.
ii. Nominal: Any categorical data that has no order is called nominal
categorical data. Examples include gender and country.
 From these different types of data, we will focus on categorical

data.
Handling Categorical Data
 There are some algorithms that can work well with categorical
data, such as decision trees.
 But most machine learning algorithms cannot operate directly
with categorical data. These algorithms require the input and
output both to be in numerical form.
 If the output to be predicted is categorical, then after prediction
we convert them back to categorical data from numerical data.
 Let’s discuss some key challenges that we face while dealing with
categorical data:
i. High cardinality: Cardinality means uniqueness in data. The data

column, in this case, will have a lot of different values. A good
example is User ID – in a table of 500 different users, the User ID
column would have 500 unique values.
ii. Rare occurrences: These data columns might have variables that
occur very rarely and therefore would not be significant enough to
have an impact on the model.
iii. Frequent occurrences: There might be a category in the data

columns that occurs many times with very low variance, which
would fail to make an impact on the model.
iv. Won’t fit: This categorical data, left unprocessed, won’t fit our
model. Encoding To address the problems associated with
categorical we can use encoding. This is the process by which we
convert a categorical variable into a numerical form. Here, we will
look at three simple methods of encoding categorical data.
Replacing This is a technique in which we replace the categorical
data with a number. This is a simple replacement and does not
involve much logical processing. Let’s look at an exercise to get a
better idea of this.
1.Encoding :To address the problems associated with categorical

data, we can use encoding. This is the process by which we convert
a categorical variable into a numerical form. Here, we will look at
three simple methods of encoding categorical data.
2. Replacing: This is a technique in which we replace the
categorical data with a number. This is a simple replacement and
does not involve much logical processing. Let’s look at an exercise
to get a better idea of this.
Handling Categorical Data — Method 1 : Replacing
Replacement of Categorical data with a Number:

Using inplace=True
Resources
Books
 E. Alpaydin, Introduction to Machine Learning, 3rd Edition, MIT Press, 2014.
 C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2016.
Lecture Notes
 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT
Press (V1.1)
 https://www.javatpoint.com/applications-of-machine-learning
Websites
 Geekforgeeks.com
 Medium.com
 Towardsdatascience.com
 https://en.wikipedia.org/wiki/Data_transformation
Image Courtesy: Google Images

Thank You!

Data Management and Data Transformation, Introduction To Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Management and Data Transformation, Introduction To Machine Learning

Uploaded by

Copyright:

Available Formats

Data Management and Data

Five essential prerequisites for studying machine learning:

 Statistics Knowledge: Probability, Basic and Inferential Statistics

 Machine Learning: Systematic way of “learning” from “data” or

Standard Definition of Machine Learning

For tasks that are beyond human capabilities

Analysis of large and complex datasets

Beneficial when the scenarios are highly volatile/ rapidly changing

 Supervised machine learning is a branch of ML that combines

 The most widespread supervised algorithms are:

 Unsupervised learning is typically used for clustering, anomaly

 The frequently-used unsupervised algorithms are:

 Reinforcement learning is a type of ML algorithm which lets

 The focus of reinforcement learning is on regimented learning

 It learns from past experiences and changes its approach in

 Most popular reinforced algorithms include:

 Image recognition: To identify objects, persons, places, digital

 Speech Recognition: Process of converting voice instructions into

 Product recommendations: Mechanism of understanding the user

 Transportation and Commuting: It provides a customized

 Stock Data Prediction: Predicting the closing price of stock using

 Automatic Language Translation: Converts the unknown

 Basket Analysis: Identifying the frequently bought items and

 AI has the potential to transform nearly all aspects of life,

 Data management systems and AI are synergistic. When AI

Improve Query Performance And Accuracy: ML-enabled database

Accelerate data scientist productivity: 451 Research survey results

 Data Discovery: This is more of an exploratory step which

 Data Mapping: This is a process which defines how

 Code implementation: This is a process in which the

 Review of data: This process is to ensure that the output data

 Taking a broader perspective, data is classified into numerical and

1. Numerical: As the name suggests, this is numeric data that is

i. Discrete: To explain in simple terms, any numerical data that is

ii. Continuous: Any numerical data that is measurable is called

 Categorical data is further divided into the following:

i. Ordered: Any categorical data that has some order associated

 From these different types of data, we will focus on categorical

Handling Categorical Data

i. High cardinality: Cardinality means uniqueness in data. The data

iii. Frequent occurrences: There might be a category in the data

1.Encoding :To address the problems associated with categorical

Replacement of Categorical data with a Number:

Image Courtesy: Google Images

You might also like