Analytics & Algorithm

A Primer on
Analytics & Algorithms

(with Select Ones)
Dr.D.V.Srinivas Kumar
School of Management Studies
E-mail: srinivasdaruri@gmail.com
MDP on Analytics
A process or set of rules to be followed in calculations or other problem-solving
operations, especially by a computer.
In mathematics and computer science, an algorithm is a self-contained step-by-

step set of operations to be performed.
Algorithms perform calculation, data processing, and/or automated reasoning tasks

A Few Select Ones
K-NN
Decision Trees
SVMs
MBA with Apriori Algorithm
k-NN
Assumption: Birds of Same feather flock together.
Similarity is measured using distance.

How to Choose k?
Check what happens when k=1

&
When k=3
When k= all the observations
The Way Out
Test several k values on a variety of test datasets

and choose the one that delivers the best
classification performance.
A less common, but interesting solution to this
problem is to choose a
larger k, but……..
apply a weighted voting process in which the vote

of the closer neighbours is considered more
authoritative than the vote of the far away
neighbours.
Many k-NN implementations offer this option.

The traditional method of rescaling features for
k-NN is min-max normalization.
This process transforms a feature such that all of its

values fall in a range between 0
and 1.
The formula for normalizing a feature is as follows:

normalize <- function(x) {
return ((x - min(x)) / (max(x) -
min(x)))
}
> normalize(c(1, 2, 3, 4, 5))
[1] 0.00 0.25 0.50 0.75 1.00
> normalize(c(10, 20, 30, 40,
50))
[1] 0.00 0.25 0.50 0.75 1.00
Decision Trees
whether a potential movie would fall into one of
three categories:
Critical Success,
Mainstream Hit, or
Box Office Bust.
Divide and Conquer
Decision trees are built using a heuristic called recursive partitioning.
This approach is also commonly known as divide and conquer because

it splits the data into subsets, which are then split repeatedly into even
smaller subsets, and so on and so forth until the process stops when
the algorithm determines the data within the subsets are sufficiently
homogenous, or another stopping criterion has been met
Divide and conquer might stop at a node in a case that:
• All (or nearly all) of the examples at the node have the same
class
• There are no remaining features to distinguish among the

examples
• The tree has grown to a predefined size limit

Using the divide and conquer strategy, we can build a simple decision tree from
this data.
First, to create the tree's root node, we split the feature indicating the number of
celebrities, partitioning the movies into groups with and without a significant
number of A-list stars:
Form the Rules
The degree to
Which a subset of examples contains only a single class is

known as purity, and any subset composed of only a
single class is called pure.
There are various measurements of purity that can be used

to identify the best decision tree splitting candidate.
Entropy, a concept borrowed from information theory that
quantifies the randomness, or disorder, within a set of class
values.
Sets with high entropy are very diverse and provide little
information about other items that may also belong in the
set, as there is no apparent commonality.
The decision tree hopes to find splits that reduce entropy,

ultimately increasing homogeneity within the groups.
Support Vector Machines
SVM is a classification technique that is listed under supervised

learning models in Machine Learning.
In layman's terms it involves finding the hyperplane (line in 2D,

plane in 3D and hyperplane in higher dimensions.
The data points that kind of "support" this hyperplane on either
sides are called the "support vectors".
In the previous picture, the filled blue circle and the two filled
squares are the support vectors.
For cases where the two classes of data are not linearly separable,
the points are projected to an exploded (higher dimensional) space
where linear separation may be possible.
A problem involving multiple classes can be broken down into

multiple one-versus-one or one-versus-rest binary classification
problems.
Boring adults the call balls data, the stick a classifier, the
biggest gap trick optimization, call flipping the table kernelling
and the piece of paper a hyperplane
An analogy call balls as data, the stick a classifier, the
biggest gap trick optimization, call flipping the table
kernelling and the piece of paper a hyperplane
Market Basket Analysis
Market Basket Analysis is a modelling technique based upon the theory that if
you buy a certain group of items, you are more (or less) likely to buy another
group of items
Credit card transactions: items purchased by credit card give insight into other
products the customer is likely to purchase.
• Supermarket purchases: common combinations of products can be used to inform
product placement on supermarket shelves.
• Telecommunication product purchases: commonly associated options (call
waiting, caller display, etc) help determine how to structure product bundles which
maximise revenue
• Banking services: the patterns of services used by retail customers are used to
identify other services they may wish to purchase.
• Insurance claims: unusual combinations of insurance claims can be a sign of fraud.
• Medical patient histories: certain combinations of conditions can indicate
increased risk of various complications
Support
Support = # (or percent) of transactions that include

both the antecedent and the consequent
Example: support for the item set {Milk, Bread} is 4

out of 10 transactions, or 40%
Supp((x1, x2, …) implies (y1, y2, …)) =
Measures of Performance
Confidence: the % of antecedent transactions that also have the
consequent item set
Benchmark confidence = transactions with consequent as % of all

transactions
Lift = confidence/(benchmark confidence)
Lift > 1 indicates a rule that is useful in finding consequent items

sets (i.e., more useful than just selecting transactions randomly)

Analytics &amp; Algorithm

Uploaded by

Copyright:

Available Formats

You might also like

Analytics &amp; Algorithm

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analytics &amp; Algorithm

Uploaded by

Copyright:

Available Formats

A Primer on