Data Analysis. Data Management

LECTURE 6
Data analysis.
Data management.
1) Data analysis bases
2) Characteristics of data sample
3) Classification, Prediction
4) Classification by Decision Tree
Induction
5) What is data mining?
6) What is “big data”?
1) Data analysis bases
Data analysis is a process of inspecting,
cleansing, transforming, and modeling data
with the goal of discovering useful
information, informing conclusions, and
supporting decision-making.
Data analysis has multiple facets and
approaches, encompassing diverse techniques
under a variety of names, while being used in
different business, science, and social science
domains.
Data Analytics
 Accumulation of raw data captured from various
sources (i.e. discussion boards, emails, exam logs,
chat logs in e-learning systems) can be used to
identify fruitful patterns and relationships (Bose,
2009)
 Exploratory visualization – uses exploratory data
analytics by capturing relationships that are
perhaps unknown or at least less formally
formulated
 Confirmatory visualization - theory-driven
2) Characteristics of data sample
In any report or article, the structure of the
sample must be accurately described. It is
especially important to exactly determine the
structure of the sample (and specifically the size of
the subgroups) when subgroup analyses will be
performed during the main analysis phase.
The characteristics of the data sample can be
assessed by looking at:
- Basic statistics of important variables
- Scatter plots
- Correlations and associations
- Cross-tabulations
3) Classification, Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction
models continuous-valued functions, for example, predicts unknown or missing values

Typical applications:
Credit approval
Target marketing
Medical diagnosis
Fraud detection
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined
classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
Classification Process
(1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 4 no (Model)
Mary Assistant Prof 10 yes
Bill Professor 5 yes
Jim Associate Prof 11 yes
IF rank = ‘professor’
Dave Assistant Prof 5 no
Anne Associate Prof 3 no
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data Unseen Data
(George, Professor, 5)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Issues (1): Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
Issues (2): Evaluating Classification
Methods
 Predictive accuracy
 Speed and scalability
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provded by the model
 Goodness of rules
 decision tree size
 compactness of classification rules
4) Classification by Decision
Tree Induction
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
Training Dataset
age income student credit_rating

This <=30 high no fair
<=30 high no excellent
follows 31…40 high no fair
an >40 medium no fair
example >40 low yes fair
>40 low yes excellent
from 31…40 low yes excellent
Quinlan’s <=30 medium no fair
ID3 <=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
30..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
What is Data Mining?
 Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data
 Alternative name
 Knowledge discovery in databases (KDD)
 Query processing
 Expert systems or statistical programs
15
Data Mining: A KDD Process
Data mining—core of
knowledge discovery process Pattern Evaluation
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
16
Databases
Architecture: Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine

Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse 17
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Data Mining Visualization
Learning
Algorithm Other
Disciplines
18
Multi-Dimensional View of Data Mining
 Data to be mined
- Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
 Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
- Multiple/integrated functions and mining at multiple
levels
19
6) What is “big data”?
"Big Data are high-volume, high-
velocity, and/or high-variety information
assets that require new forms of processing
to enable enhanced decision making, insight
discovery and process optimization”.
Complicated (intelligent) analysis of
data may make a small data “appear” to be
“big”.
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as “big”.
Computational View of Big Data
Data Visualization
Data Access Data Analysis
Data Understanding Data Integration
Formatting, Cleaning
Storage Data
Question
1. Approval issuesWhat is BigData?

2. Explain
the origin of BigData
technology.
3. What data can be attributed to big
data?
4. Name the symbols that describe big
data.
Thank you for
attention!

Data Analysis. Data Management

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis. Data Management

Uploaded by

Copyright:

Available Formats

LECTURE 6

predicts categorical class labels (discrete or nominal)

models continuous-valued functions, for example, predicts unknown or missing values

NAME RANK YEARS TENURED Classifier

age income student credit_rating

student? yes credit rating?

no yes excellent fair

Data mining engine

Data Access Data Analysis

Data Understanding Data Integration

1. Approval issuesWhat is BigData?

You might also like