Professional Documents
Culture Documents
16CS63: Machine Learning
16CS63: Machine Learning
16CS63: Machine Learning
Course Content
Number of
analysts
• Description Methods
• Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Task-relevant Data
Data Cleaning
Data Integration
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
3/3/2021 Dept of CSE, FET, Jain 23
Data Mining: On What Kinds of Data?
• Relational database
• Advanced database and information repository
• Object-relational database
• Spatial and temporal data
• Time-series data
• Stream data
• Multimedia database
• Heterogeneous and legacy database
• Text databases & WWW
MOLAP;
ROLAP;
HOLAP
• Four perspective :
• Business Perspective:
• characterises the procedures and standards by which the business works on a
day-to-day basis.
• Application Perspective:
• characterises the interactions among the procedures and standards utilised by the
organisation.
• Information Perspective
• characterises and groups the raw information, such as record documents,
databases, pictures, presentations and spreadsheets that a company requires to
operate efficiently.
• Technology Perspective:
• characterises the equipment, working frameworks, programming and
networking solutions utilised by an organisation
3/3/2021 Dept of CSE, FET, Jain 32
Definition
Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn.
• 1. Regression Regression algorithms are mostly used to make predictions on numbers i.e when
the output is a real or continuous value.
• Algorithms under Regression:
a. Simple Linear Regression Model: It is a statistical method that analyses the relationship
between two quantitative variables. This technique is mostly used in financial fields, real estate, etc.
b. Logistic Regression: It is carried out in cases of fraud detection, clinical trials, etc. wherever the
output is binary.
c. Support Vector Regression: SVR is a bit different from SVM. In simple regression, the aim is to
minimize the error, while in SVR, we adjust the error within a threshold.
d. Multivariate Regression Algorithm: This technique is used in the case of multiple predictor
variables. It can be operated with matrix operations and Python’s Numpy library.
e. Multiple Regression Algorithm: It works with multiple quantitative variables in both linear and
non-linear regression algorithms.
• 3. Clustering Clustering is a Machine Learning technique that involves classifying data points into
specific groups. If we have some objects or data points, then we can apply the clustering algorithm(s) to
analyze and group them as per their properties and features.
• Clustering methods:
• Density-based methods: In this method, clusters are considered dense regions depending on their similarity
and difference from the lower dense region.
• Hierarchical methods: The clusters formed in this method are the tree-like structures. This method forms trees
or clusters from the previous cluster. There are two types of hierarchical methods: Agglomerative (Bottom-up
approach) and Divisive (Top-down approach).
• Partitioning methods: This method partitions the objects based on k-clusters and each method form a single
cluster.
• Grid based methods: In this method, data are combined into a number of cells that form a grid-like structure.
• 4. Association Analysis Association rule mining finds interesting associations and relationships among
large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is Market Based Analysis.
• Market Based Analysis is one of the key techniques used by large relations to show associations between
items. It allows retailers to identify relationships between the items that people buy together frequently.
There are different ways an algorithm can model a problem based onits
interaction
with the experience or environment or whatever we count to call the
• 1. Supervised Learning
• 2.Unsupervised Learning
• 3.Reinforcement Learning
• Marketing : It can be used to characterize & discover customer segments for marketing purposes.
• Biology : It can be used for classification among different species of plants and animals.
• Libraries : It is used in clustering different books on the basis of topics and information.
• Insurance : It is used to acknowledge the customers, their policies and identifying the frauds.
Analyzing large
volumes of data,
What is Big Data Analytics?
or big data. What Big Data Analytics isn’t?
• There are a variety of benefits to dimensionality reduction . A key benefit is that many data mining
algorithms work better if the dimensionality-the number of attributes in the data-is lower. This is partly
because dimensionality reduction can eliminate irrelevant features and reduce noise and partly because of the
curse of dimensionality.
• Another benefit is that a reduction of dimensionality can lead to a more understandable model because the
model may involve fewer attributes. Also, dimensionality reduction may allow the data to be more easily
visualized.
• Even if dimensionality reduction doesn't reduce the data to two or three dimensions, data is often visualized
by looking at pairs or triplets of attributes, and the number of such combinations is greatly reduced. Finally,
the amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of a
data set by creating new attributes that are a combination of the old attributes. The reduction of
dimensionality by selecting new attributes that are a subset of the old is known as feature subset selection or
feature selection.
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Techniques
• Principle Component Analysis
• Others: supervised and non-linear techniques
• Definitions of density and distance between points, which is critical for clustering
and outlier detection, become less meaningful
Less is More
The Curse of Dimensionality (Bellman,
1961)
• Redundant features
• duplicate much or all of the information contained in one or more other
attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• contain no information that is useful for the data mining task at hand
• Example: students' ID is often irrelevant to the task of predicting students'
GPA
3/3/2021 Dept of CSE, FET, Jain 90
Feature Subset Selection
• Techniques:
• Brute-force approach: (next class)
• Try all possible feature subsets as input to data mining algorithm
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
• Filter approaches:
• Features are selected before data mining algorithm is run
• Wrapper approaches:
• Use the data mining algorithm as a black box to find best subset of
attributes
• 2. Lack of Quality Data - While enhancing algorithms often consumes most of the time of developers in AI,
data quality is essential for the algorithms to function as intended. Noisy data, dirty data, and incomplete
data are the problems in ideal Machine Learning, there is lack of good data..