1. Data mining involves automated analysis of large datasets to extract patterns and make predictions. Common techniques include classification, clustering, association rule mining and anomaly detection.
2. Data preprocessing is an important part of any data mining project and involves cleaning, transforming, reducing and discretizing data. This prepares the raw data for analysis.
3. Common data quality issues include incomplete, inconsistent or redundant data. Data cleaning techniques such as binning, smoothing and regression are used to resolve these issues.
1. Data mining involves automated analysis of large datasets to extract patterns and make predictions. Common techniques include classification, clustering, association rule mining and anomaly detection.
2. Data preprocessing is an important part of any data mining project and involves cleaning, transforming, reducing and discretizing data. This prepares the raw data for analysis.
3. Common data quality issues include incomplete, inconsistent or redundant data. Data cleaning techniques such as binning, smoothing and regression are used to resolve these issues.
1. Data mining involves automated analysis of large datasets to extract patterns and make predictions. Common techniques include classification, clustering, association rule mining and anomaly detection.
2. Data preprocessing is an important part of any data mining project and involves cleaning, transforming, reducing and discretizing data. This prepares the raw data for analysis.
3. Common data quality issues include incomplete, inconsistent or redundant data. Data cleaning techniques such as binning, smoothing and regression are used to resolve these issues.
1. data mining: automated analysis of massive data, extraction
of interesting patterns, known as KDD, 2. generalization: information integration and data warehouse construction, cleaning, transformation, integration, (data cube technology=scalable methods for computing multidimensional aggregates), OLAP 3. association/correlation analysis: frequent patterns 4. classification: construct models based on training examples, (decision tree, naive Bayesian), (credit card fraud detection, (direct marketing=basket data analysis), (medical=cluster analysis) 5. cluster analysis: unsupervised learning, group data into new categories, maximizing intra-class similarity, minimizing interclass similarity 6. trend, time-series and deviation analysis: regression and value prediction 7. mining data streams: ordered, time-varying, potentially infinite 8. data object: represents an entity, (ex. customers, students) 9. attribute: data field, representing a characteristic or feature of a data object (ex. name, age, salary) 10. attribute(feature) vector: set of attributes describe an object 11. nominal: symbol/names, each value represents category or state, referred to as categorical, possible to be presented as numbers, (ex. hair color, marital status, customer ID), qualitative 12. binary: nominal with only 2 values, boolean, qualitative a. symmetric: states equally valuable(same weight), ex.gender b. Asymmetric: states are NOT equally important, ex.medical test 13. ordinal: qualitative, values have meaningful order, but magnitude between successive values is unknown, ex. professional rank/ grade, useful for data reduction of numerical attributes 14. numeric: quantitative a. interval-scaled: equal-size units, DON'T have zero point, impossible to be expressed as multiples, ex. temperature, year b. ratio-scaled: have zero point, can be expressed as multiples, ex. years of experience, weight, salary 2,3 1. discreteVScontinous: a finite or countable infinite set of values(ex. hair color, smoker, medical test, ID), if an attribute is not discrete it's continuous (ex. height, weight, age) 2. mean: average of values, problem of sensitivity to outlier, (trimmed mean=chop off extreme values at both ends) 3. median: middle value of ordered set, expensive to compute for large # of observations 4. mode: value occurs most frequently 5. variance&SD: indicate how spread out a data distributed is 6. rang: difference between the largest and smallest values 7. quantiles: points at regular intervals of distribution, divided into (almost) equal-size sets, (most famous=percentile/Q1) 8. boxplots: visualization technique for five-number summary 9. whiskers: terminate at min&max OR most extreme observation 10. scatter plot: values treated as coordinates & plotted as point in plane 11. statistical description: attributes, (ex.age, weight,color, grade) 12. similarity/dissimilarity: objects, (ex. customers), measure proximity 13. data matrix: object-by-attribute, n-by-p, AKA. feature vectors 14. dissimilarity matrix: object-by-object, n-by-n 15. normalize: give all attributes equal weight (cm-grams, meters- kilos), or map values to interval [-1,1] / [0,1] 16. distance of object to itself is 0, distance is a symmetric function 17. reprocess data to satisfy the req. of the intended use 18. accuracy: faulty instruments, errors by human/computer 19. completeness: different design phases, optional attributes 20. consistency: semantics, data type, field formats 21. believability: how much data are trusted by users 22. interpretability: how easy the data are understood 23. cleaning: filling in missing values, smoothing, identify or remove outliers, resolve inconsistencies in the data 24. integration: data from multiple resources, map semantic concepts 25. reduction: reduced representation, smaller volume same result 26. discretization: raw data are replaced by ranges or higher concept lvl 27. transformation: normalization 3,4 1. incomplete: lacking attribute values, only aggregate data 2. inconsistent: containing discrepancies in codes and names 3. cleaning: missing value may not imply an error a. ignore tuple: NOT effective, unless it contains several missings b. mean or median: mean for normal symmetric data, median for skewed data, use for samples belonging to the same class c. most probable value: using regression (bayesian/decision tree) 4. noise: random error or variance in a measured variable 5. data smoothing: binning, regression, outlier analysis 6. binning: smooth a sorted data value by consulting its neighborhood a. local smoothing: sorted values partitions into # of buckets/bins b. equal-frequency bins: each bin has the same # of values c. equal-width bins: interval range of values per bin is constant 7. regression: conform data values to a function a. linear regression: find the best line to fit two attributes so that one attribute can be used to predict the other 8. Potter's Wheel: automated interactive data cleaning tool 9. integration: reduce and avoid redundancies and inconsistencies in the resulting data set a. semantic heterogeneity: entity identification problem b. structure of data: data dependencies and referential constraints c. redundancy 10. metadata: avoid errors in schema integration & data transformation 11. attribute is redundant if it can be "derived" from another attribute 12. redundancy can be detected by correlation analysis by measuring how strongly one attribute implies the other (chi- square for nominal), (correlation coefficient and covariance for numeric) 13. correlation DOESN'T imply causality 14. tuple duplication: use of denormalized tables (often used to improve performance by avoiding joins) is another source of data redundancy 15. data value conflict: ex. grading system in two different ways 4 1. dimensionality reduction: reduce number of attributes (wavelet transform, PCA, attribute subset selection) a. parametric: a model is used to estimate data, only data parameters are stored, "regression" b. nonparametric: store reduced representation of data (ex. histogram, clustering, sampling) 2. compression: transformations applied to obtain a "compressed" representation of original data (lossless, lossy) 3. attribute subset selection: find a min set of attributes such that the resulting probability distribution of data is as close as possible to the original distribution using all attributes, exhaustive search can be prohibitively expensive 4. attribute construction: e.g. area attribute based on height and width attributes 5. regressions: data modeled to fit a straight line, 6. Regression line equation: y = wx + b, (w is slope, b is y-intercept) 7. histogram: partitions data into disjoint subsets (buckets/bins) a. single attribute - value/frequency pair: singleton buckets 8. equal-width: width of each bucket range is uniform 9. equal-frequency(depth): frequency of each bucket is constant 10. sampling: large set represented by a smaller random data sample 11. SRSWOR: all tuples are equally likely to be sampled 12. smoothing: binning, regression 13. concept hierarchy: e.g. street generalized to higher-level concepts (city or country), facilitate drilling and rolling to view data in multiple granularity, can be explicitly specified by domain experts 14. to help avoid dependence on the choice of measurement units, give all attributes equal weight using (min-max, z-score) 15. min-max normalization: linear transformation 16. z-score: attribute value normalized based on mean and SD 17. discretization: concept hierarchy can be automatically formed for both numeric and nominal data 5,6 1. items frequently associated represented as association rules 2. support and confidence are measures of rule interestingness 3. if itemset satisfies min_support then it is a frequent itemset 4. if the rule satisfies min_support & min_confidence then it is strong 5. all subsets of a frequent itemset MUST also be frequent 6. divide-and-conquer: to avoid costly candidate generation, compress the database of frequent items into a frequent pattern (FP) tree 7. NOT all association rules are interesting, so we use correlation analysis 2 8. pattern evaluation methods: lift, X 9. classification: data analysis task where a model is constructed to predict class labels(categories) a. learning(training) step: construct classification model b. classification step: predict class labels for given data (test set) 10. decision tree: a flowchart-like tree structure, can be binary or otherwise, NO domain knowledge required, NO parameter setting, multidimensional data, fast a. internal node: test on an attribute b. branch: test outcome c. leaf: class label 11. pure: a partition is pure if all its tuples belong to the same class 12. when an attribute is chosen to split the training data set, it's removed from the attribute list 13. Terminating conditions: a. all tuples in D belong to the same class b. there are no remaining attributes on which the tuples may be further partitioned, (majority voting=convert node into a leaf and label it with the most common class in data partition) c. there are no tuples for a given branch, a leaf is created with the majority class in data partition 6,7 1. attribute selection measure: heuristic for selecting the splitting criterion that best splits a partition into smaller mutually exclusive classes, (measures: information gain, gain ration) 2. information gain: minimize expected # of tests to classify a tuple, guarantee simple tree is found, is the difference between the original information required and the requirement 3. pruning: removes the least reliable branches 4. prepruning: statistically assess the goodness of a split before it takes place, hard to choose thresholds for statistical significance 5. postpruning: remove sub-trees from already constructed trees & replace with leaf node labeled with most frequent class 6. rules conflicts: tuple firing more than one rule with different class predictions 7. size ordering: rule with largest antecedent (toughest) has highest priority fires and returns class prediction 8. rule ordering: rules prioritized apriori accoriding to a. class-based ordering: decreasing importance(most frequent are highest order of prevalence) b. rule-based ordering: measures of rule quality (accuracy, size, domain expertise) 9. fallback rule: when no rules are triggered 10. naive bayesian classifier: statistical classifier that predicts probability that a tuple belongs to a specific class, high accuracy, speed, (class-conditional independence: attributes effect on class determination is independent) 11. K-nearest neighbor: delay classification until new test data is available, use similarity measure to compute distance between test data tuple and each of the training data tuples (euclidian, manhattan), k stands for # of closest neighbors according to measured distance, majority voting of their class labels used to determine class of test tuple 12. linear regression: model a relationship between two sets of variables, to make predictions about data (Y: dependent) (x:independent) 8,9 1.holdout: randomly allocate 2/3 of data for training and remining 1/3 for testing 2.random subsampling: repeat holdout k times and take average accuracy 3.k-fold cross-validation: randomly partition dataset into k mutually exclusive folds of approximately equal size 4.stratified k-fold cross-validation: class distribution in each fold is the same as in initial dataset (10-fold is recommended) 5.ensemble: set of classifiers, each with a vote for a class label & produced from different partition, majority voting to compose an aggregate classification 6.bootstrap: same size as dataset, sampling with replacement 7.cluster analysis: discovers unknown groups, partition a set of objects into subsets/clusters, objects in a cluster are similar, dissimilar to objects in other clusters, clusters are implicit, (ex.: business intelligence, image recognition, web search, biology, security), used for pre-processing and outlier detection a.partitioning: find mutually exclusive clusters of spherical shape, distance- based, use mean/medoid to represent cluster center, small-to-medium, (K-means: divide dataset into k mutually exclusive clusters represented by centroids, centroid is a cluster's center point & mean of points, min distance), to measure cluster quality minimize sum of squared errors b.hierarchical: multiple levels, CAN'T correct erroneous merges/split, may consider object "linkages" i.agglomeration: bottom-up (merge) composition, each object has its own cluster, two close clusters merged into one, iteratively merge ii.divise: top-down (split) composition, all objects in one big cluster, divide into subclusters, recursively divide c.density-based: finds arbitrarily shaped clusters, dense regions separated by low-density region, each point must have a min # of points within its "neighborhood", filter out outliers i.DBSCAN: find core objects (dense neighborhoods), (neighborhood density: # objects in neighborhood),( MinPts: density threshold),(core object: object with E-neighborhood at least MinPts), (p is density-reachable from q if q is core and p in the neighborhood of q)(q&m are density-connected if o such that q&m are both density-reachable from o) d.grid-based: multi-resolution grid data structure, fast processing 10 1. Evaluation of clustering a. assessing clustering tendency: determines whether a given data set has a non-random structure (Hopkins statistic: statistical tests for spatial randomness) b. measuring clustering quality: i. extrinsic: compare clustering against ground truth (supervision) to capture: 1. cluster homogeneity: the purer the better clusters represent separate class labels 2. cluster completeness: an object with a class label belongs to the cluster representing that class label 3. rag bag: objects that can't be merged into clusters belong to a rag bag 4. small cluster preservation: splitting a small category is more harmful than splitting a large category 5. precision: how many objects in the same cluster belong to the same category as the object 6. recall: how many objects of the same category are assigned to the same cluster ii. intrinsic: measure how well the clusters are separated, 1. silhouette coefficient: difference between: a. average distance between object o and all other objects in cluster (captures cluster correctness) - smaller is better (more compact) b. minimum average distance from o to all clusters to which o does not belong (captures degree of separation from other clusters) - larger is better c. compute average silhouette coefficient for all objects in a cluster or over all of the dataset i. +ve: clustering is good ii. -ve: clustering is bad 11 1. outlier: data object that deviates significantly from the normal objects, different from noise, (noise: random error/variance and should be removed before outlier detection) (applications: credit card fraud detection, telecom fraud detection) a. global: deviate significantly from the rest of the dataset(ex. Intrusion Detection), how to measure deviation b. contextual: deviate with respect to context (time, location), (ex. temperature), c. collective: subset of objects collectively deviates significantly from the dataset, (ex. multiple order delays, DoS attacks) 2. challenges for outlier detection: normal models are challenging to build, distinction between normalcy and anomaly is ambiguous, noise hides outliers, specify degree of an outlier 3. supervised: as classification problem, outliers are rare, recall is more important than accuracy, more effective 4. unsupervised: normal objects are clustered into multiple groups, can't detect collective outliers effectively(normal objects may not share strong patterns, collective outliers may share high similarity), have high false rate but still miss many real outliers 5. semi-supervised: labels could be outliers, normal objects or both, a small number of labeled outliers may not cover the possible outliers well, to improve the quality of detection models for normal objects learned from unsupervised methods 6. statistical: model-based, stochastic model, data not following model are outliers, effectiveness highly depends on whether the assumptions of statistical model holds in real data (parametric, non-parametric) a. parametric: assume normal data is generated by a distribution with parameter θ, PDF yields probability generated by distribution (smaller means outlier), (boxplots: for univariate outliers, parameters are mean and IQR), (x^2-statistic: for multivariate outliers, parameter is mean) b. nonparametric: learn normal model from input data (histogram) 11 1. proximity-based: object is an outlier if the nearest neighbors of the object are far away proximity is significantly deviates from the proximity of most of the other objects in the same data set, effectiveness highly relies on the proximity measure, proximity or distance measures can't be obtained easily, it's challenging to find a group of outliers which stay close to each other a. distance-based: for an object o, examine the number of other objects in its r-neighborhood (r: distance threshold, π: fraction threshold, min # objects needed in the neighborhood) b. density-based: for an object o, examine its density relative to the density of its local neighbors i. local outlier factor (LOF): is computed in terms of the K- NN of an object in comparison to its neighbors 2. cluster-based: normal data belong to large and dense clusters, outliers belong to small or sparse clusters, clustering is expensive, doesn't scale up well for large data sets 3. classification-based: (one-class model: describe only the normal class), learn decision boundry of normal class using classification methods such as SVM, Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers, detect new outliers, normal objects may belong to multiple classes, (brute-force approach: biased, can't detect unseen anomaly) a. semi-supervised learning: combining classification-based and clustering-based, strength: outlier detection is fast, bottleneck: quality heavily depends on the availability and quality of the training set (difficult to obtain high- quality training data) 4. contextual outliers: ex. cutomer with same age and living in the same area with different behavior is an outlier 5. collective outliers: challenging and advanced area 12 1. business intelligence (BI): procedural and technical infrastructure that collects, stores, and analyzes the data produced by a company's activities, broad term that encompasses data mining, process analysis, performance benchmarking, and descriptive analytics, parses all data generated by a business and presents easy-to-digest reports, performance measures 2. big data: refers to a huge volume of data that can be structured, semi-structured and unstructured. 3. volume: amount of data or size 4. variety: different types of data 5. velocity: how fast data is growing 6. veracity: uncertainty of data 7. value: data worth, and how we are getting benefit 8. machine learning: can look at patterns and learn from them to adapt behavior for future incidents 9. data mining: doesn't learn and apply knowledge on its own without human interaction, can't automatically see the relationship between existing data like machine learning 10. web mining: comes under data mining but limited to web related data a. web content mining: discover different patterns that give a significant insight b. web structure mining: data from hyperlinks that lead to different pages are gathered and prepared in order to discover a pattern c. web usage mining: user's web activity through the application logs are monitored and data mining is applied to it 11. data warehouse: environment where essential data from sources is stored under single schema, used for reporting and analysis (hierarical data, generalization, ETL, OLAP) 12. data mining: mostly structured data, data analysis can be done on both structured, semi-structured or unstructured data, doesn't involve visualization tool, data analysis is always accompanied by visualization of result 12 1. both text mining & natural language extract information from unstructured data 2. text mining: text documents, statistical and probabilistic model 3. natural language: communication, natural language processing(NLP): techniques for processing such data to understand underlying meaning, data could be speech, text, image, involve ML 4. data visualization: extracting and visualizing the data without any form of reading or writing, convey information efficiently and clearly without any deviations 5. coumn chart: numerical comparisons between categories, #of col. should not be too large 6. bar chart: # of bars can be relatively large 7. line chart: change of data over a continuous time interval 8. area chart: filling the color can better highlight the trend information 9. scatter plot: two variables in the form of points on rectangular coordinate 10. bubble chart: multivariate chart, variant of scatter plot, third value 11. funnel chart: series of steps and completion rate, how something moves through different stages, displays values as progressively decreasing proportions amounting to 100 percent 12. pie chart: one static number divided into categories 13. gantt chart: the timing of the mission 14. box plot: distribution of data, across groups based on five number summery 15. heatmap: relationship between two measures and provides rating information displayed using colors