Professional Documents
Culture Documents
Data Analysis. Data Management
Data Analysis. Data Management
Data analysis.
Data management.
1) Data analysis bases
2) Characteristics of data sample
3) Classification, Prediction
4) Classification by Decision Tree
Induction
5) What is data mining?
6) What is “big data”?
1) Data analysis bases
Data analysis is a process of inspecting,
cleansing, transforming, and modeling data
with the goal of discovering useful
information, informing conclusions, and
supporting decision-making.
Data analysis has multiple facets and
approaches, encompassing diverse techniques
under a variety of names, while being used in
different business, science, and social science
domains.
Data Analytics
Accumulation of raw data captured from various
sources (i.e. discussion boards, emails, exam logs,
chat logs in e-learning systems) can be used to
identify fruitful patterns and relationships (Bose,
2009)
Exploratory visualization – uses exploratory data
analytics by capturing relationships that are
perhaps unknown or at least less formally
formulated
Confirmatory visualization - theory-driven
2) Characteristics of data sample
In any report or article, the structure of the
sample must be accurately described. It is
especially important to exactly determine the
structure of the sample (and specifically the size of
the subgroups) when subgroup analyses will be
performed during the main analysis phase.
The characteristics of the data sample can be
assessed by looking at:
- Basic statistics of important variables
- Scatter plots
- Correlations and associations
- Cross-tabulations
3) Classification, Prediction
Classification
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction
Credit approval
Target marketing
Medical diagnosis
Fraud detection
Classification—A Two-Step
Process
Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
Classification Process
(1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(George, Professor, 5)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Issues (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Issues (2): Evaluating Classification
Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provded by the model
Goodness of rules
decision tree size
compactness of classification rules
4) Classification by Decision
Tree Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Training Dataset
age?
<=30 overcast
30..40 >40
no yes no yes
What is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data
Alternative name
Knowledge discovery in databases (KDD)
Query processing
Expert systems or statistical programs
15
Data Mining: A KDD Process
Data mining—core of
knowledge discovery process Pattern Evaluation
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
16
Databases
Architecture: Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data
Databases Warehouse 17
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Data Mining Visualization
Learning
Algorithm Other
Disciplines
18
Multi-Dimensional View of Data Mining
Data to be mined
- Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
- Multiple/integrated functions and mining at multiple
levels
19
6) What is “big data”?
"Big Data are high-volume, high-
velocity, and/or high-variety information
assets that require new forms of processing
to enable enhanced decision making, insight
discovery and process optimization”.
Complicated (intelligent) analysis of
data may make a small data “appear” to be
“big”.
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as “big”.
Computational View of Big Data
Data Visualization
Formatting, Cleaning
Storage Data
Question