Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1,2

1. data mining: automated analysis of massive data, extraction


of interesting patterns, known as KDD,
2. generalization: information integration and data warehouse
construction, cleaning, transformation, integration, (data
cube technology=scalable methods for computing
multidimensional aggregates), OLAP
3. association/correlation analysis: frequent patterns
4. classification: construct models based on training examples,
(decision tree, naive Bayesian), (credit card fraud detection,
(direct marketing=basket data analysis), (medical=cluster analysis)
5. cluster analysis: unsupervised learning, group data into new
categories, maximizing intra-class similarity, minimizing
interclass similarity
6. trend, time-series and deviation analysis: regression and
value prediction
7. mining data streams: ordered, time-varying, potentially infinite
8. data object: represents an entity, (ex. customers, students)
9. attribute: data field, representing a characteristic or feature of a
data object (ex. name, age, salary)
10. attribute(feature) vector: set of attributes describe an object
11. nominal: symbol/names, each value represents category or
state, referred to as categorical, possible to be presented as
numbers, (ex. hair color, marital status, customer ID), qualitative
12. binary: nominal with only 2 values, boolean, qualitative
a. symmetric: states equally valuable(same weight), ex.gender
b. Asymmetric: states are NOT equally important, ex.medical test
13. ordinal: qualitative, values have meaningful order, but magnitude
between successive values is unknown, ex. professional rank/ grade,
useful for data reduction of numerical attributes
14. numeric: quantitative
a. interval-scaled: equal-size units, DON'T have zero point,
impossible to be expressed as multiples, ex. temperature, year
b. ratio-scaled: have zero point, can be expressed as
multiples, ex. years of experience, weight, salary
2,3
1. discreteVScontinous: a finite or countable infinite set of
values(ex. hair color, smoker, medical test, ID), if an attribute is
not discrete it's continuous (ex. height, weight, age)
2. mean: average of values, problem of sensitivity to outlier,
(trimmed mean=chop off extreme values at both ends)
3. median: middle value of ordered set, expensive to compute for
large # of observations
4. mode: value occurs most frequently
5. variance&SD: indicate how spread out a data distributed is
6. rang: difference between the largest and smallest values
7. quantiles: points at regular intervals of distribution, divided into
(almost) equal-size sets, (most famous=percentile/Q1)
8. boxplots: visualization technique for five-number summary
9. whiskers: terminate at min&max OR most extreme observation
10. scatter plot: values treated as coordinates & plotted as point in plane
11. statistical description: attributes, (ex.age, weight,color, grade)
12. similarity/dissimilarity: objects, (ex. customers), measure proximity
13. data matrix: object-by-attribute, n-by-p, AKA. feature vectors
14. dissimilarity matrix: object-by-object, n-by-n
15. normalize: give all attributes equal weight (cm-grams, meters-
kilos), or map values to interval [-1,1] / [0,1]
16. distance of object to itself is 0, distance is a symmetric function
17. reprocess data to satisfy the req. of the intended use
18. accuracy: faulty instruments, errors by human/computer
19. completeness: different design phases, optional attributes
20. consistency: semantics, data type, field formats
21. believability: how much data are trusted by users
22. interpretability: how easy the data are understood
23. cleaning: filling in missing values, smoothing, identify or
remove outliers, resolve inconsistencies in the data
24. integration: data from multiple resources, map semantic concepts
25. reduction: reduced representation, smaller volume same result
26. discretization: raw data are replaced by ranges or higher concept lvl
27. transformation: normalization
3,4
1. incomplete: lacking attribute values, only aggregate data
2. inconsistent: containing discrepancies in codes and names
3. cleaning: missing value may not imply an error
a. ignore tuple: NOT effective, unless it contains several missings
b. mean or median: mean for normal symmetric data, median
for skewed data, use for samples belonging to the same class
c. most probable value: using regression (bayesian/decision tree)
4. noise: random error or variance in a measured variable
5. data smoothing: binning, regression, outlier analysis
6. binning: smooth a sorted data value by consulting its neighborhood
a. local smoothing: sorted values partitions into # of buckets/bins
b. equal-frequency bins: each bin has the same # of values
c. equal-width bins: interval range of values per bin is constant
7. regression: conform data values to a function
a. linear regression: find the best line to fit two attributes so
that one attribute can be used to predict the other
8. Potter's Wheel: automated interactive data cleaning tool
9. integration: reduce and avoid redundancies and
inconsistencies in the resulting data set
a. semantic heterogeneity: entity identification problem
b. structure of data: data dependencies and referential constraints
c. redundancy
10. metadata: avoid errors in schema integration & data
transformation
11. attribute is redundant if it can be "derived" from another attribute
12. redundancy can be detected by correlation analysis by
measuring how strongly one attribute implies the other (chi-
square for nominal), (correlation coefficient and covariance
for numeric)
13. correlation DOESN'T imply causality
14. tuple duplication: use of denormalized tables (often used to
improve performance by avoiding joins) is another source of data
redundancy
15. data value conflict: ex. grading system in two different ways
4
1. dimensionality reduction: reduce number of attributes
(wavelet transform, PCA, attribute subset selection)
a. parametric: a model is used to estimate data, only data
parameters are stored, "regression"
b. nonparametric: store reduced representation of data (ex.
histogram, clustering, sampling)
2. compression: transformations applied to obtain a
"compressed" representation of original data (lossless, lossy)
3. attribute subset selection: find a min set of attributes such that
the resulting probability distribution of data is as close as
possible to the original distribution using all attributes,
exhaustive search can be prohibitively expensive
4. attribute construction: e.g. area attribute based on height and
width attributes
5. regressions: data modeled to fit a straight line,
6. Regression line equation: y = wx + b, (w is slope, b is y-intercept)
7. histogram: partitions data into disjoint subsets (buckets/bins)
a. single attribute - value/frequency pair: singleton buckets
8. equal-width: width of each bucket range is uniform
9. equal-frequency(depth): frequency of each bucket is constant
10. sampling: large set represented by a smaller random data sample
11. SRSWOR: all tuples are equally likely to be sampled
12. smoothing: binning, regression
13. concept hierarchy: e.g. street generalized to higher-level concepts
(city or country), facilitate drilling and rolling to view data in multiple
granularity, can be explicitly specified by domain experts
14. to help avoid dependence on the choice of measurement units,
give all attributes equal weight using (min-max, z-score)
15. min-max normalization: linear transformation
16. z-score: attribute value normalized based on mean and SD
17. discretization: concept hierarchy can be automatically formed
for both numeric and nominal data
5,6
1. items frequently associated represented as association rules
2. support and confidence are measures of rule interestingness
3. if itemset satisfies min_support then it is a frequent itemset
4. if the rule satisfies min_support & min_confidence then it is
strong
5. all subsets of a frequent itemset MUST also be frequent
6. divide-and-conquer: to avoid costly candidate generation,
compress the database of frequent items into a frequent
pattern (FP) tree
7. NOT all association rules are interesting, so we use correlation
analysis
2
8. pattern evaluation methods: lift, X
9. classification: data analysis task where a model is
constructed to predict class labels(categories)
a. learning(training) step: construct classification model
b. classification step: predict class labels for given data (test
set)
10. decision tree: a flowchart-like tree structure, can be binary or
otherwise, NO domain knowledge required, NO parameter
setting, multidimensional data, fast
a. internal node: test on an attribute
b. branch: test outcome
c. leaf: class label
11. pure: a partition is pure if all its tuples belong to the same class
12. when an attribute is chosen to split the training data set, it's
removed from the attribute list
13. Terminating conditions:
a. all tuples in D belong to the same class
b. there are no remaining attributes on which the tuples may
be further partitioned, (majority voting=convert node
into a leaf and label it with the most common class in data
partition)
c. there are no tuples for a given branch, a leaf is created
with the majority class in data partition
6,7
1. attribute selection measure: heuristic for selecting the
splitting criterion that best splits a partition into smaller mutually
exclusive classes, (measures: information gain, gain ration)
2. information gain: minimize expected # of tests to classify a
tuple, guarantee simple tree is found, is the difference between
the original information required and the requirement
3. pruning: removes the least reliable branches
4. prepruning: statistically assess the goodness of a split before it takes
place, hard to choose thresholds for statistical significance
5. postpruning: remove sub-trees from already constructed
trees & replace with leaf node labeled with most frequent class
6. rules conflicts: tuple firing more than one rule with different
class predictions
7. size ordering: rule with largest antecedent (toughest) has
highest priority fires and returns class prediction
8. rule ordering: rules prioritized apriori accoriding to
a. class-based ordering: decreasing importance(most
frequent are highest order of prevalence)
b. rule-based ordering: measures of rule quality (accuracy,
size, domain expertise)
9. fallback rule: when no rules are triggered
10. naive bayesian classifier: statistical classifier that predicts
probability that a tuple belongs to a specific class, high
accuracy, speed, (class-conditional independence: attributes
effect on class determination is independent)
11. K-nearest neighbor: delay classification until new test data is
available, use similarity measure to compute distance between
test data tuple and each of the training data tuples (euclidian,
manhattan), k stands for # of closest neighbors according to
measured distance, majority voting of their class labels used to
determine class of test tuple
12. linear regression: model a relationship between two sets of
variables, to make predictions about data (Y: dependent)
(x:independent)
8,9
1.holdout: randomly allocate 2/3 of data for training and remining 1/3 for testing
2.random subsampling: repeat holdout k times and take average accuracy
3.k-fold cross-validation: randomly partition dataset into k mutually
exclusive folds of approximately equal size
4.stratified k-fold cross-validation: class distribution in each fold is the same
as in initial dataset (10-fold is recommended)
5.ensemble: set of classifiers, each with a vote for a class label & produced from
different partition, majority voting to compose an aggregate classification
6.bootstrap: same size as dataset, sampling with replacement
7.cluster analysis: discovers unknown groups, partition a set of objects into
subsets/clusters, objects in a cluster are similar, dissimilar to objects in other
clusters, clusters are implicit, (ex.: business intelligence, image recognition, web
search, biology, security), used for pre-processing and outlier detection
a.partitioning: find mutually exclusive clusters of spherical shape, distance-
based, use mean/medoid to represent cluster center, small-to-medium,
(K-means: divide dataset into k mutually exclusive clusters represented by
centroids, centroid is a cluster's center point & mean of points, min
distance), to measure cluster quality minimize sum of squared errors
b.hierarchical: multiple levels, CAN'T correct erroneous merges/split,
may consider object "linkages"
i.agglomeration: bottom-up (merge) composition, each object has its
own cluster, two close clusters merged into one, iteratively merge
ii.divise: top-down (split) composition, all objects in one big cluster,
divide into subclusters, recursively divide
c.density-based: finds arbitrarily shaped clusters, dense regions
separated by low-density region, each point must have a min # of
points within its "neighborhood", filter out outliers
i.DBSCAN: find core objects (dense neighborhoods),
(neighborhood density: # objects in neighborhood),( MinPts:
density threshold),(core object: object with E-neighborhood at least
MinPts), (p is density-reachable from q if q is core and p in the
neighborhood of q)(q&m are density-connected if o such that q&m
are both density-reachable from o)
d.grid-based: multi-resolution grid data structure, fast processing
10
1. Evaluation of clustering
a. assessing clustering tendency: determines whether a
given data set has a non-random structure (Hopkins
statistic: statistical tests for spatial randomness)
b. measuring clustering quality:
i. extrinsic: compare clustering against ground truth
(supervision) to capture:
1. cluster homogeneity: the purer the better clusters
represent separate class labels
2. cluster completeness: an object with a class label
belongs to the cluster representing that class label
3. rag bag: objects that can't be merged into clusters
belong to a rag bag
4. small cluster preservation: splitting a small category
is more harmful than splitting a large category
5. precision: how many objects in the same cluster
belong to the same category as the object
6. recall: how many objects of the same category are
assigned to the same cluster
ii. intrinsic: measure how well the clusters are separated,
1. silhouette coefficient: difference between:
a. average distance between object o and all other
objects in cluster (captures cluster
correctness) - smaller is better (more compact)
b. minimum average distance from o to all
clusters to which o does not belong (captures
degree of separation from other clusters) - larger
is better
c. compute average silhouette coefficient for all
objects in a cluster or over all of the dataset
i. +ve: clustering is good
ii. -ve: clustering is bad
11
1. outlier: data object that deviates significantly from the normal
objects, different from noise, (noise: random error/variance
and should be removed before outlier detection) (applications:
credit card fraud detection, telecom fraud detection)
a. global: deviate significantly from the rest of the dataset(ex.
Intrusion Detection), how to measure deviation
b. contextual: deviate with respect to context (time,
location), (ex. temperature),
c. collective: subset of objects collectively deviates significantly from
the dataset, (ex. multiple order delays, DoS attacks)
2. challenges for outlier detection: normal models are challenging
to build, distinction between normalcy and anomaly is
ambiguous, noise hides outliers, specify degree of an outlier
3. supervised: as classification problem, outliers are rare, recall
is more important than accuracy, more effective
4. unsupervised: normal objects are clustered into multiple groups,
can't detect collective outliers effectively(normal objects may not
share strong patterns, collective outliers may share high
similarity), have high false rate but still miss many real outliers
5. semi-supervised: labels could be outliers, normal objects or
both, a small number of labeled outliers may not cover the
possible outliers well, to improve the quality of detection models
for normal objects learned from unsupervised methods
6. statistical: model-based, stochastic model, data not following
model are outliers, effectiveness highly depends on whether
the assumptions of statistical model holds in real data
(parametric, non-parametric)
a. parametric: assume normal data is generated by a distribution
with parameter θ, PDF yields probability generated by distribution
(smaller means outlier), (boxplots: for univariate outliers,
parameters are mean and IQR), (x^2-statistic: for multivariate
outliers, parameter is mean)
b. nonparametric: learn normal model from input data
(histogram)
11
1. proximity-based: object is an outlier if the nearest neighbors
of the object are far away proximity is significantly deviates from
the proximity of most of the other objects in the same data set,
effectiveness highly relies on the proximity measure, proximity
or distance measures can't be obtained easily, it's challenging
to find a group of outliers which stay close to each other
a. distance-based: for an object o, examine the number of
other objects in its r-neighborhood (r: distance threshold,
π: fraction threshold, min # objects needed in the
neighborhood)
b. density-based: for an object o, examine its density relative
to the density of its local neighbors
i. local outlier factor (LOF): is computed in terms of the K-
NN of an object in comparison to its neighbors
2. cluster-based: normal data belong to large and dense
clusters, outliers belong to small or sparse clusters,
clustering is expensive, doesn't scale up well for large data
sets
3. classification-based: (one-class model: describe only the
normal class), learn decision boundry of normal class using
classification methods such as SVM, Any samples that do not
belong to the normal class (not within the decision boundary)
are declared as outliers, detect new outliers, normal objects
may belong to multiple classes, (brute-force approach:
biased, can't detect unseen anomaly)
a. semi-supervised learning: combining classification-based
and clustering-based, strength: outlier detection is fast,
bottleneck: quality heavily depends on the availability
and quality of the training set (difficult to obtain high-
quality training data)
4. contextual outliers: ex. cutomer with same age and living in the
same area with different behavior is an outlier
5. collective outliers: challenging and advanced area
12
1. business intelligence (BI): procedural and technical
infrastructure that collects, stores, and analyzes the data
produced by a company's activities, broad term that
encompasses data mining, process analysis, performance
benchmarking, and descriptive analytics, parses all data
generated by a business and presents easy-to-digest reports,
performance measures
2. big data: refers to a huge volume of data that can be structured,
semi-structured and unstructured.
3. volume: amount of data or size
4. variety: different types of data
5. velocity: how fast data is growing
6. veracity: uncertainty of data
7. value: data worth, and how we are getting benefit
8. machine learning: can look at patterns and learn from them to
adapt behavior for future incidents
9. data mining: doesn't learn and apply knowledge on its own
without human interaction, can't automatically see the
relationship between existing data like machine learning
10. web mining: comes under data mining but limited to web
related data
a. web content mining: discover different patterns that give a
significant insight
b. web structure mining: data from hyperlinks that lead to different
pages are gathered and prepared in order to discover a pattern
c. web usage mining: user's web activity through the
application logs are monitored and data mining is applied to it
11. data warehouse: environment where essential data from
sources is stored under single schema, used for reporting and
analysis (hierarical data, generalization, ETL, OLAP)
12. data mining: mostly structured data, data analysis can be
done on both structured, semi-structured or unstructured
data, doesn't involve visualization tool, data analysis is always
accompanied by visualization of result
12
1. both text mining & natural language extract
information from unstructured data
2. text mining: text documents, statistical and
probabilistic model
3. natural language: communication, natural language
processing(NLP): techniques for processing such data
to understand underlying meaning, data could be
speech, text, image, involve ML
4. data visualization: extracting and visualizing the
data without any form of reading or writing, convey
information efficiently and clearly without any
deviations
5. coumn chart: numerical comparisons between
categories, #of col. should not be too large
6. bar chart: # of bars can be relatively large
7. line chart: change of data over a continuous time
interval
8. area chart: filling the color can better highlight the
trend information
9. scatter plot: two variables in the form of points on
rectangular coordinate
10. bubble chart: multivariate chart, variant of scatter
plot, third value
11. funnel chart: series of steps and completion rate,
how something moves through different stages,
displays values as progressively decreasing
proportions amounting to 100 percent
12. pie chart: one static number divided into categories
13. gantt chart: the timing of the mission
14. box plot: distribution of data, across groups based on
five number summery
15. heatmap: relationship between two measures and
provides rating information displayed using colors

You might also like