Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

LECTURE 6

Data analysis.
Data management.
1) Data analysis bases
2) Characteristics of data sample
3) Classification, Prediction
4) Classification by Decision Tree
Induction
5) What is data mining?
6) What is “big data”?
1) Data analysis bases
Data analysis is a process of inspecting, 
cleansing, transforming, and modeling data
 with the goal of discovering useful
information, informing conclusions, and
supporting decision-making.
Data analysis has multiple facets and
approaches, encompassing diverse techniques
under a variety of names, while being used in
different business, science, and social science
domains.
Data Analytics
 Accumulation of raw data captured from various
sources (i.e. discussion boards, emails, exam logs,
chat logs in e-learning systems) can be used to
identify fruitful patterns and relationships (Bose,
2009)
 Exploratory visualization – uses exploratory data
analytics by capturing relationships that are
perhaps unknown or at least less formally
formulated
 Confirmatory visualization - theory-driven
2) Characteristics of data sample
In any report or article, the structure of the
sample must be accurately described. It is
especially important to exactly determine the
structure of the sample (and specifically the size of
the subgroups) when subgroup analyses will be
performed during the main analysis phase.
The characteristics of the data sample can be
assessed by looking at:
- Basic statistics of important variables
- Scatter plots
- Correlations and associations
- Cross-tabulations
3) Classification, Prediction
Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction

models continuous-valued functions, for example, predicts unknown or missing values


Typical applications:

Credit approval

Target marketing

Medical diagnosis

Fraud detection
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined
classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
Classification Process
(1): Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 4 no (Model)
Mary Assistant Prof 10 yes
Bill Professor 5 yes
Jim Associate Prof 11 yes
IF rank = ‘professor’
Dave Assistant Prof 5 no
Anne Associate Prof 3 no
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction

Classifier

Testing
Data Unseen Data

(George, Professor, 5)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Issues (1): Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
Issues (2): Evaluating Classification
Methods
 Predictive accuracy
 Speed and scalability
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provded by the model
 Goodness of rules
 decision tree size
 compactness of classification rules
4) Classification by Decision
Tree Induction
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
Training Dataset

age income student credit_rating


This <=30 high no fair
<=30 high no excellent
follows 31…40 high no fair
an >40 medium no fair
example >40 low yes fair
>40 low yes excellent
from 31…40 low yes excellent
Quinlan’s <=30 medium no fair
ID3 <=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
What is Data Mining?
 Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data
 Alternative name
 Knowledge discovery in databases (KDD)
 Query processing
 Expert systems or statistical programs

15
Data Mining: A KDD Process
Data mining—core of
knowledge discovery process Pattern Evaluation

Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

16

Databases
Architecture: Typical Data Mining
System
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse 17
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Data Mining Visualization
Learning

Algorithm Other
Disciplines
18
Multi-Dimensional View of Data Mining
 Data to be mined
- Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
 Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
- Multiple/integrated functions and mining at multiple
levels
19
6) What is “big data”?
"Big Data are high-volume, high-
velocity, and/or high-variety information
assets that require new forms of processing
to enable enhanced decision making, insight
discovery and process optimization”.
Complicated (intelligent) analysis of
data may make a small data “appear” to be
“big”.
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as “big”.
Computational View of Big Data

Data Visualization

Data Access Data Analysis

Data Understanding Data Integration

Formatting, Cleaning

Storage Data
Question

1. Approval issuesWhat is BigData?


2. Explain
the origin of BigData
technology.
3. What data can be attributed to big
data?
4. Name the symbols that describe big
data.
Thank you for
attention!

You might also like