Professional Documents
Culture Documents
13 14 BDT DataMining&MachineLearning PDF
13 14 BDT DataMining&MachineLearning PDF
13 14 BDT DataMining&MachineLearning PDF
Cyrus Lentin
Data Mining
Data Mining Is The Computing Process Of Discovering Patterns In Large Data Sets Involving Methods
At The Intersection Of Machine Learning, Statistics, And Database Systems
It Is An Essential Process Where Intelligent Methods Are Applied To Extract Data Patterns
The Overall Goal Of The Data Mining Process Is To Extract Information From A Data Set And
Transform It Into An Understandable Structure For Further Use
Aside From The Raw Analysis Step, It Involves
Database And Data Management Aspects
Data Pre-processing
Model And Inference Considerations
Interestingness Metrics
Complexity Considerations
Post-processing Of Discovered Structures
Visualization, And Online Updating
Data Mining Is The Analysis Step Of The "Knowledge Discovery In Databases" Process, Or KDD
Knowledge Discovery In Databases (KDD) Process Is Commonly Defined With The Stages:
Selection
Pre-processing
Transformation
Data Mining
Interpretation/Evaluation
Cross Industry Standard Process For Data Mining (CRISP-DM) Defines KDD In Six Phases:
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
A Simplified Process Defines KDD As
Pre-processing
Data Mining
Results Validation
Big Data Technology - Cyrus Lentin 2
Pre Processing
Before Data Mining Algorithms Can Be Used, A Target Data Set Must Be Assembled
As Data Mining Can Only Uncover Patterns Actually Present In The Data
The Target Data Set Must Be Large Enough To Contain These Patterns
The Target Data Should Also Remain Concise Enough To Be Mined Within An Acceptable Time Limit
A Common Source For Data Is A Data Mart Or Data Warehouse
Pre-processing Is Essential To Analyze The Multivariate Data Sets Before Data Mininga
The Target Set Is Then Cleaned. Data Cleaning Removes The Observations Containing Noise And
Those With Missing Data
Data mining can unintentionally be misused, and can then produce results which appear to be
significant; but which do not actually predict future behaviour and cannot be reproduced on a new
sample of data and bear little use
Often this results from investigating too many hypotheses and not performing proper statistical
hypothesis testing
A simple version of this problem in machine learning is known as overfitting, but the same problem
can arise at different phases of the process and thus a train/test split - when applicable at all - may
not be sufficient to prevent this from happening
With supervised learning, there is always With unsupervised learning there is no feedback
feedback loop to correct you based on the prediction results, ie, there is no
teacher to correct you
IMPORTANT
Independent Variables must continuous numeric or categoric numeric values
Dependent Variables are also continuous numeric values
IMPORTANT
Independent Variables must continuous numeric or categoric numeric values
Dependent Variables are also continuous numeric values
IMPORTANT
Independent Variables are all continuous numeric or categorical numeric values
Dependent Variables are always categorical numeric values
Types of Classification
Decision Trees
Random Forests
Nave Bayes
Ensembles are a divide-and-conquer approach used to improve performance. The main principle
behind ensemble methods is that a group of weak learners can come together to form a strong
learner. Each classifier, individually, is a weak learner, while all the classifiers taken together are a
strong learner.
Support
The Support Of An Itemset X, Supp(x) Is The Proportion Of Transaction In The Database In Which The
Item X Appears. It Signifies The Popularity Of An Itemset
Confidence
Signifies The Likelihood Of Item Y Being Purchased When Item X Is Purchased
Lift
This signifies the likelihood of the itemset Y being purchased when item X is purchased while taking
into account the popularity of Y
Conviction
This is interpreted as the ratio of the expected frequency that X occurs without Y that is to say, the
frequency that the rule makes an incorrect prediction