16 Comparison of Data Science Algorithms

Comparison of Data Science Algorithms
Classification: Predicting a categorical target variable

Algorithm Description Model Input Output Pros Cons Use Cases
Tends to overfit the

Partitions the data Intuitive to data. Small
A set of rules
into smaller No The label explain to changes in input
to partition a
subsets where each restrictions cannot be nontechnical data can yield Marketing
Decision data set based
subset contains on variable numeric. It business users. substantially segmentation,
trees on the values of
(mostly) responses type for must be Normalizing different trees. fraud detection
the different
of one class (either predictors categorical predictors is not Selecting the right
predictors
“yes” or “no”) necessary parameters can be
challenging
Model can be
Models the A set of easily explained
No
relationship organized rules Prediction of to business users Manufacturing,
restrictions.
between input and that contain an target Divides the data set applications
Rule Accepts
output by deducing antecedent variable, in rectilinear where description
induction categorical, Easy to deploy
simple “IF/THEN” (inputs) and which is fashion of model is
numeric, and in almost any
rules from a data consequent categorical necessary
binary inputs tools and
set (output class)
applications
k-Nearest A lazy learner Entire training No Prediction of Requires very The deployment Image processing,
neighbors where no model is data set is the restrictions. target little time to runtime and applications
generalized. Any model However, the variable, build the model. storage where slower
new unknown data distance which is Handles missing requirements will response time is
point is compared calculations categorical attributes in the be expensive acceptable
against similar work better unknown record
known data points with numeric gracefully. Arbitrary selection
in the training set data. Data Works with of value of k
No description of
the model
needs to be nonlinear
Time required to Training data set
No model and needs to be
Predicts the output A lookup table restrictions. deploy is representative
Prediction of
class based on the of probabilities However, the minimum sample of
probability
Bayes’ theorem by and conditional probability population and
Naïve for all class Spam detections,
calculating class probabilities for calculation needs to have
Bayesian values, along Great algorithm text mining
conditional each attribute works better complete
with the for
probability and with an output with combinations of
winning class benchmarking.
prior probability class categorical input and output.
attributes Strong statistical Attributes need to
foundation be independent
A computational No easy way to

and mathematical explain the inner
Good at
model inspired by A network working of the
Prediction of modeling Image
the biological topology of model
Artificial All attributes target (label) nonlinear recognition, fraud
nervous system. layers and
neural should be variable, relationships. detection, quick
The weights in the weights to
networks numeric which is Fast response Requires response time
network learn to process input
categorical time in preprocessing data. applications
reduce the error data
deployment Cannot handle
between actual and
prediction missing attributes
Boundary The model is a Extremely robust Computational

Prediction of
detection algorithm vector equation against performance during Optical character
target (label)
Support that that allows one All attributes overfitting. training phase can recognition, fraud
variable,
vector identifies/defines to classify new should be Small changes to be slow. This may detection,
which can be
machines multi-dimensional data points into numeric input data do not be compounded by modeling “black-
categorical or
boundaries different regions affect boundary the effort needed to swan” events
numeric
separating data (classes) and, thus, do not optimize parameter
yield different
results. Good at
points belonging to
handling combinations
different classes
nonlinear
relationships
Leverages wisdom Achieving model

of the crowd. Reduces the independence is
Superset of
Employs a number A meta-model Prediction generalization tricky Most of the
restrictions
Ensemble of independent with individual for all class error. Takes practical
from the
models models to make a base models and values with a different search classifiers are
base model Difficult to explain
prediction and an aggregator winning class space into ensemble
used the inner working
aggregates the final consideration
prediction of the model
Regression: Predicting a numeric target variable

Algorithm Description Model Input Output Pros Cons Use Case
The classical
The model
predictive Cannot handle
consists of The workhorse of
model that missing data.
coefficients for most predictive
expresses the All The label Categorical data Pretty much any
each input modeling
Linear relationship attributes may be are not directly scenario that requires
predictor and techniques. Easy to
regression between inputs should be numeric or usable, but predicting a continuous
their statistical use and explain to
and an output numeric binominal require numeric value
significance. A nontechnical
parameter in the transformation
bias (intercept) business users
form of an into numeric
may be optional
equation
Logistic Technically, The model All The label One of the most Cannot handle Marketing scenarios
regression this is a consists of attributes may only common missing data. Not (e.g., will click or not
coefficients for
each input
predictor that
relate to the
classification
“logit.” classification intuitive when
method. But
Transforming should be be methods. dealing with a click), any general two-
structurally it is
the logit into numeric binominal Computationally large number of class problem
similar to linear
probabilities of efficient predictors
regression
occurrence (of
each class)
completes the
model
Association analysis: Unsupervised process for finding relationships between items

Measures the
Transactions List of Unsupervised
strength of Finds simple,
format with relevant approach with Requires
co- easy to Recommendation engines,
FP-Growth items in the rules minimal user preprocessing if
occurrence understand rules cross-selling, and content
and Apriori columns and developed inputs. Easy to input is of
between one like {Milk, suggestions
transactions from the understand different format
item with Diaper}→{Beer}
in the rows data set rules
another
Clustering: An unsupervised process for finding meaningful groups in data

Algorithm Description Model Input Output Pros Cons Use case
k-Means Data set is divided Algorithm No Data set is Simple to Specification Customer
restrictions.
finds k centroids However, the of k is
and all the data distance implement. arbitrary and
appended by segmentation, anomaly
points are calculations Can be used may not find
into k clusters by one of detection, applications
assigned to the work better for natural
finding k centroids the k cluster where globular
nearest centroid, with numeric dimension clusters.
labels clustering is natural
which forms a data. Data reduction Sensitive to
cluster should be outliers
normalized
Specification
No of density
restrictions. Finds the parameters. A
However, the natural bridge Applications where
List of clusters
Identifies clusters distance Cluster labels clusters of between two clusters are
and assigned data
as a high-density calculations based on any shape. clusters can nonglobular shapes and
DBSCAN points. Default
area surrounded by work better identified No need to merge the when the prior number
cluster 0 contains
low-density areas with numeric clusters mention cluster. of natural groupings is
noise points
data. Data number of Cannot cluster not known
should be clusters varying
normalized density data
set
Self- A visual clustering A two- No No explicit A visual way Number of Diverse applications
organizing technique with dimensional restrictions. clusters to explain centroids including visual data
maps roots from neural lattice where However, the identified. the clusters. (topology) is exploration, content
networks and similar data distance Similar data Reduces specified by suggestions, and
prototype clustering points are calculations points occupy multi- the user. Does dimension reduction
arranged next to work better either the same dimensional not find
each other with numeric cell or are data to two natural
data. Data placed next to dimensions clusters in the
should be each other in data
normalized the
neighborhood
Anomaly detection: Supervised and unsupervised techniques to find outliers in data

Every data
Accepts both point has a
numeric and distance Easy to
Outlier All data points
categorical score. The implement.
identified based are assigned a
Distance– attributes. higher the Works well Specification Fraud detection,
on distance distance score
based Normalization is distance, with of k is arbitrary preprocessing technique
to kth nearest based on nearest
required since the more numeric
neighbor neighbor
distance is likely the attributes
calculated data point
is an outlier
Every data
Accepts both point has a
Specification of
numeric and density Easy to
Outlier is All data points distance
categorical score. The implement.
identified based are assigned a parameter by
Density- attributes. lower the Works well Fraud detection,
on data points in density score the user.
based Normalization is density, the with preprocessing technique
low-density based on the Inability to
required since more likely numeric
regions neighborhood identify varying
density is the data attributes
density regions
calculated point is an
outlier
Local Outlier is All data points Accepts both Every data Can handle Specification of Fraud detection,
outlier identified based as assigned a numeric and point has a the varying distance preprocessing technique
factor on calculation of relative density categorical density density parameter by
relative density score based on attributes. score. The scenario the user
lower the
relative
Normalization is
density, the
in the the required since
more likely
neighborhood neighborhood density is
the data
calculated
point is an
outlier
Deep learning: Training using multiple layers of representation of data

Layer Type Description Input Output Pros Cons Use Cases
For most practical

Based on the Typically the
A tensor of classification
concept of applying output of Very
typically three or problems, conv
filters to incoming convolutional powerful and
more dimensions. layers have to be Classify almost any data
two-dimensional layer is flattened general
Two of the coupled with where spatial information is
representation of and passed purpose
dimensions dense layers highly correlated such as
data, such as through a dense network. The
Convolutional correspond to the which result in a images. Even audio data can
images. Machine or fully number of
image while a large number of be converted into images
learning is used to connected layer weights to be
third is sometimes weights to be (using fourier transforms)
automatically which usually learned in the
used for trained and thus and classified via conv nets.
determine the terminates in a conv layer is
color/channel lose any speed
correct weights for softmax output not very high.
encoding. advantages of a
the filters. layer.
pure conv layer.
Recurrent Just as conv nets are A sequence of RNNs can Unlike other RNNs suffer from Forecasting time series,
specialized for any type (time process types of vanishing (or natural language processing
analyzing spatially series, text, sequences and neural exploding) situations such as machine
correlated data, speech, etc). output other networks, gradients when translation, image
Layer Type Description Input Output Pros Cons Use Cases
recurrent nets are

specialized for RNNs have the sequences are
temporally sequences (many no restriction very long. RNNs
correlated data: to many), or that the input are also not
sequences. The data output a fixed shape of the amenable to many captioning.
can be sequences of tensor (many to data be of stacked layers due
numbers, audio one). fixed to the same
signals, or even dimension. reasons.
images
Recommenders: Finding the user’s preference of an item

Assumptio
Algorithm Description Input Output Pros Cons Use Case
n
Find a Cold start

The only input
cohort of problem for
needed is the
users who new users and
ratings matrix
provided Similar items eCommerce,
Ratings
Collaborative similar users or Complete music, new
matrix with
Filtering - ratings. items have d ratings Computation connection
user-item
neighborhood based Derive the similar matrix grows linearly recommendation
preferences.
outcome likes Domain with the s
rating from agnostic number of
the cohort items and
users users
CollaborativeFilterin Decompose User’s Ratings Complete Works in sparse Cannot Content

g - Latent matrix the user- preference matrix with d ratings matrix explain why recommendation
Assumptio
n
item matrix
into two
of an item
matrices (P
can be
and Q) with More accurate
better
latent than
explained
factors. Fill user-item neighborhood the prediction
factorization by their matrix s
the blank preferences. based is made
preference
values in the collaborative
of an item’s
ratings filtering
character
matrix by
(inferred)
dot product
of P and Q
Abstract the Addresses cold

Requires item
features of start problem
profile data set
the item and for new items
build item
profile. Use Recommen
Music
the item d items User-item
Complete recommendation
Content-based profile to similar to rating
d ratings Can provide from Pandora
filtering evaluate the those the matrix and
matrix explanations on Recommender and CiteSeer’s
user user liked Item profile
why the s are domain citation indexing
preference in the past
for the recommendatio specific
attributes in n is made
the item
profile
Content-based - A Every time User-item Complete Every user has a Storage and eCommerce,
Supervised learning personalized a user rating d ratings separate model computational content, and
models classificatio prefers an matrix and matrix and could be time connection
n or item, it is a Item profile independently recommendation
Assumptio
n
regression
model for
every single
user in the
system.
Learn a
vote of
classifier customized.
preference
based on Hyper s
for item
user likes or personalization
attributes
dislikes of
an item and
its
relationship
with item
attributes
Time series forecasting: Predicting future value of a variable

Decompose the
Increased Accuracy Application
time series into
Models for the understanding of depends on the where the
trend, Historical Forecasted
Decomposition individual the time series models used explanation of
seasonality, and value value
components by visualizing for components is
noise. Forecast
the components components important
the components
Exponential The future Learn the Historical Forecasted Applies to wide Multiple Cases where
smoothing value is the parameters of value value range of time seasonality in trend or
function of past the smoothing series with or the data make seasonality is
observations equation from without trend or the models not evident
the historical
seasonality cumbersome
data
The future
value is the
Parameter
ARIMA function of auto Forms a The optimal Applies on
values for
(autoregressive correlated past Historical Forecasted statistical p,d,q value is almost all
(p,d,q), AR,
integrated data points and value value baseline for unknown to types of time
and MA
moving average) the moving model accuracy begin with series data
coefficients
average of the
predictions
Create cross- Machine Uses any The Applies to

sectional data learning machine windowing user cases
Windowing-
set with time models like Historical Forecasted learning size, horizon, where the
based machine
lagged inputs regression, value value approaches on and skip time series has
learning
neural the cross- values are trend and/or
networks, etc. sectional data arbitrary seasonality
Feature selection: Selection of most important attributes

PCA combines the Each principal Numerical Numerical Efficient way to Sensitive to Most
(principal most component is a attributes attributes extract predictors scaling effects, numeric-
component important function of (reduced that are i.e., requires valued data
analysis) attributes into attributes in the set). Does uncorrelated to normalization of sets require
filter-based a fewer dataset not really each other. Helps attribute values dimension
number of require a to apply Pareto before reduction
transformed label principle in application.
attributes identifying Focus on variance
sometimes results
attributes with
in selecting noisy
highest variance
attributes
Data sets
require a Applications
Selecting
label. Can for feature
attributes No restrictions
Info gain Similar to only be selection
based on on variable Same as decision Same as decision
(filter- decision tree applied on where target
relevance to type for trees trees
based) model data sets variable is
the target or predictors
with categorical or
label
nominal numeric
label
Data sets
Extremely robust.
require a
Selecting Uses the chi- A fast and Applications
label. Can
attributes square test of efficient scheme for feature
Chi-square Categorical only be Sometimes
based on independence to identify which selection
(filter- (polynominal) applied on difficult to
relevance to to relate categorical where all
based) attributes data sets interpret
the target or predictors to variables to select variables are
with a
label label for a predictive categorical
nominal
model
label
Once a variable is Data sets

Selecting Multicollinearity added to the set, it with a large
Works in
Forward attributes The label problems can be is never removed number of
conjunction All attributes
Selection based on may be avoided. Speeds in subsequent input
with modeling should be
(wrapper- relevance to numeric or up the training iterations even if variables
methods such numeric
based) the target or binominal phase of the its influence on where feature
as regression
label modeling process the target selection is
diminishes required
Data sets
Selecting Multicollinearity Need to begin
Works in with few
Backward attributes The label problems can be with a full model,
conjunction All attributes input
elimination based on may be avoided. Speeds which can
with modeling should be variables
(wrapper- relevance to numeric or up the training sometimes be
methods such numeric where feature
based) the target or binominal phase of the computationally
as regression selection is
label modeling process intensive
required

16 Comparison of Data Science Algorithms

Uploaded by

Copyright:

Available Formats

You might also like

16 Comparison of Data Science Algorithms

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

16 Comparison of Data Science Algorithms

Uploaded by

Copyright:

Available Formats

Comparison of Data Science Algorithms

Classification: Predicting a categorical target variable

Tends to overfit the

A computational No easy way to

Boundary The model is a Extremely robust Computational

Leverages wisdom Achieving model

Regression: Predicting a numeric target variable

Association analysis: Unsupervised process for finding relationships between items

Clustering: An unsupervised process for finding meaningful groups in data

Anomaly detection: Supervised and unsupervised techniques to find outliers in data

Deep learning: Training using multiple layers of representation of data

For most practical

recurrent nets are

Recommenders: Finding the user’s preference of an item

Find a Cold start

CollaborativeFilterin Decompose User’s Ratings Complete Works in sparse Cannot Content

Abstract the Addresses cold

Time series forecasting: Predicting future value of a variable

Create cross- Machine Uses any The Applies to

Feature selection: Selection of most important attributes

Once a variable is Data sets

You might also like