Professional Documents
Culture Documents
718h0408 Phạm Ngọc Uyên Vy Week4-2 s2122
718h0408 Phạm Ngọc Uyên Vy Week4-2 s2122
Data mining starts when collecting data on customers or users. It is applied in personalized coupons; in
banks and financial institutions, in marketing and in the large consulting firm. Some controversial uses of
data mining by businesses are in credit card companies, American Express and Target using customer’s
data to determine the personal information of customers. How to use data mining to take advantage of
those data? Depending on personal profile. How does the software know what to offer? Using data
mining in the background to look at your behaviors and compare them to others. How is that
happening? Collecting data and comparing that to others' to give a personalized offer. Some examples
are Pandora, Songkick, Facebook, Google Ads, Netflix and other entertainment services also try and
personalize. Governments, and other types of organizations, have also applied data mining. For example,
it is applied in the election of Obama, in the IRS and by English police. Moving to digital, some examples
are Xtract and Foursquare. Other examples of online learning and the research of electronic patient
records have nothing to do with money, yet it contributes a lot to society.
Performance in supervised learning is predicted from a set of inputs. There are 2 stages to use this
idea of data partitioning and the holdout set. The partitioning will happen early on, when just having the
data. Take a subset of the data and lock it up in a drawer and not open it until done with all the modeling.
When totally finishing and having a final model, open that secret drawer and try to apply the model to
these data that the model has never seen. What if the future looks different? Build model(s) - training
data → Evaluate model(s) - validation data → Reevaluate model(s) (optional) - test data →
Predict/classify using final model - new data. How will the model work? There's a blog post about a
reason that data mining projects fail, that is poor cross-validation. Evaluating performance doesn't have to
be, doesn't require a large data set, because there are also techniques for doing the same with smaller
samples. The holdout set is a critical part of performance evaluation.
Data visualization is a very important tool and a major component in business analytics. Spotfire,
tableau, jmp, QlikView, XLMiner and MicroStrategy are easy to manipulate, interactive, intuitive and fast.
One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk
that the vehicle might have serious issues that prevent it from being sold to customers. The auto
community calls these unfortunate purchases “kicks”. Basic charts: bar chart, histogram, boxplot,
line chart and scatter plot discover patterns and exceptions. BI is the reporting and the visualization step;
BA is the data mining part. We're going to look at Spotfire, but you're free to explore others as well. I'm
The data preparation stage: Collecting data: think very carefully about data origin which comes from
both in one place and multiple sources including data that are somehow outside, or purchased through a
third party, or publicly available, or social media data. Merging data sources needs a key like some kind
of ID that enables to merge these records. Duplications and corruption are very common. In Spotfire, it's
easy to merge data from multiple files if you have this key. Tableau is good at supporting this type of
combination. Domain knowledge is integrated in every step of the analysis, at the data cleaning and
preparation stage. First, understand the meaning of each variable like "read the label"; second, data
formatting; third, range of variables like generating a summary statistics table or charts; fourth,
duplications; finally, extreme numbers, outliers. Preparing data: Choice of variables → Choice of scales
(continuous/categorical). Binning and “unbinning” → Missing values (assess extent type of missingness;
drop observations? Drop variables? Replace with dummy?) → Imputation (mean, regression, more
advanced methods) → Explanatory vs predictive → Creating derived variables.
Cluster analysis (“data segmentation”) is an exploratory method for identifying homogeneous groups
(‘clusters”) of records. Similar records should belong to the same cluster. Dissimilar records should
belong to different clusters. Examples are “Exploring Stars” & “Fitting the Troops”. Market Segmentation:
segment customers based on demographic info and transaction history in order to tailor marketing
strategy for each segment. Investment: Cluster securities based on financial performance info (return,
volatility, beta) and other info (industry and market capitalization) to create a balanced portfolio. Industry
Analysis: for a given industry, cluster firms based on {growth rate, profitability, market size, product range,
presence in various international markets}, to understand industry structure (determine competitors).
Example: UG Business Programs in the US. Simple Clustering: 1-2 variables: visual inspection of data
(histogram, scatterplot). Clustering with >2 variables: 2 approaches: Distance: compute multivariate
distance between records, and group “close” records. Homogeneity: group records to increase
within-group homogeneity. 2 types of clustering algorithms: Hierarchical methods - agglomerative: begin
with n clusters; sequentially merge similar clusters until 1 cluster is left, useful when goal is to arrange the
clusters into a natural hierarchy, requires specifying distance measure. Non-hierarchical methods:
pre-specified number of clusters, assign records to each of the clusters, requires specifying # clusters,
computationally cheap.
Hierarchical clustering: Start with individual records and then start grouping them up into clusters. A
general algorithm starts with n clusters (n = sample size, the number of records; record = cluster). In
step 1, 2 closest records are merged into 1 cluster. At every step, pair of clusters with the smallest
distance are merged (either single record added to existing cluster, or 2 existing clusters are combined).
Define a definition of distance. Pairwise distance between records: Single measurement case: each
record has 1 value. Multiple measurement case: each record has a multiple values.
The dendrogram is a tree-like diagram that summarizes the clustering process. The idea is to have
all our records marked on the x-axis, and the y-axis is a similarity or a distance measure right there. It's
also a very useful tool to show other people the results of analysis. In the example, some universities got
clustered early on, a single university tagged along and helped join 1 big cluster. Northwestern and
University of Pennsylvania got clustered first. Using a different type of cluster analysis, the resulting tree
looks quite different from the one that came before. Example: in the US university system, the measures
are doing quite well for a private vs public university distinction. Spotfire shows a heat map for each one
of our records. Clustering is an exploratory technique that allows to look for natural groupings of records.
Looking at the shapes to see which ones of these clusters are different from each other.
Example: personal loan offer. KNN algorithm computes distance between the to-be-classified record and
each record in the training set; finds the k shortest distances; computes votes of the k neighbors.
Repeated for every record in the validation set. K-NN in XLMiner. Choosing K: under-smoothing vs.
over-smoothing; typically k<20; use validation set to find “best” k. Choosing a cutoff value. KNN: piecing
it together. Advantages of KNN: very flexible, data-driven; simple; good performance in large dataset;
useful for prediction. Disadvantages of KNN: no insight about importance of each predictor; danger of
overfitting, need extra test set; can be computationally intensive for large k; need lots of data.
Two common terms are ‘market basket analysis’ and ‘affinity analysis’ trying to study “what goes with
what”. Origin: study of customer transaction databases to determine dependencies between purchases
of different items. POS Transaction Data has a very large number of transaction records. Data is
collected by using bar-code scanners. Each record lists all items purchased by a customer on a single
purchase transaction. Example of beer and diapers: what would you do, would this affect how you change
the layout of your store, would this change your inventory management and marketing, would you offer
certain offers based on this. Other uses are event-based databases. Online Recommender Systems on
flimkart.com: collaborative filtering. Association Rules (HyperCITY, CROSSWORD, ...) vs
Recommender Systems (flimkart.com, amazon.com, NETFLIX,...). Example: cellphone faceplates: store
managers would like to know what colors of faceplates customers are likely to purchase together.
Basic idea: examine all possible rules between items in “if-then” format and select only rules most likely to
indicate true dependence. Many rules are possible. Terminology: “IF” part = antecedent and “THEN”
part = consequent. Problem: computation time grow exponentially as # items increase. Rule Generation:
Problem: Generating all possible rules is exponential in the number of distinct items. Solution: Frequent
item sets: Consider only combinations that occur with higher frequency in the database. Criterion for
“frequent”: Support of an item: % (or number) of transactions in which antecedent (IF) and consequent
(THEN) appear in the data. Generating frequent itemsets: The Apriori Algorithm: for k products…: 1.
Set minimum support criterion; 2. Generate list of one-item sets that meet the support criterion; 3. Use list
of one-item sets to generate list of two-item sets that meet support criterion; 4. Use list of two-item sets to
generate list of three-item sets that meet support criterion; 5. Continue up through k-item sets.
Performance measure #1: Confidence: % of antecedent (IF) transactions that also have the
consequent (THEN) item set. Example: detecting a flu outbreak. The weakness of confidence: if
antecedent and/or consequent have high support -> high confidence. (#transactions with both antecedent
& consequent item sets). Performance measure #2: Lift ratio = confidence (benchmark confidence).
Benchmark assumes independence between antecedent and consequence: P(antecedent &
consequent)=P(antecedent) x P(consequent). Benchmark confidence: P(C|A) = P(C) = …
Regression models for predicting Y from predictors: Linear regression; Added flexibility by
transforming Y and/or X’s; Logistic regression. The idea: capture a complex relationship between output
and inputs by creating layers of derived variables. Neural Net Architecture: multi-layer feed-forward, fully
connected. “Node” = derived variable. Example: consumer acceptance of cheese; NN with no hidden
layers. S-shaped (sigmoidal) activation functions. Training the network (weight estimation).
Backpropagation is the most popular error minimization algorithm in NN software. It has 2 options (case
updating - XLMiner & batch updating). Stop when weights change very little from 1 iteration to the next,
the misclassification rate reaches a required threshold and a limit on runs is reached. Danger: overfitting.
A neural net will have an advantage over regression-type methods.
Required user input: #1: Choose predictors: NN highly dependent on quality of predictors → #2:
Pre-process data: prediction (transform, scale), classification (create 1 dummy variable) → #3: Specify
network architecture: number of hidden layers, nodes in hidden layers and output nodes. → #4:
Specify algorithm parameters: “Learning Rate” (I) - low values “down-weight” new information from
errors at each iteration; This slows learning, but reduces tendency to overfit to local structure.
“Momentum” - high values keep parameters changing in same direction as previous iteration; helps avoid
overfitting to local structure, but also slows learning. → #5: Determine cutoff value (for classification):
cutoff on probability to obtain binary classification. Advantages of Neural Networks: Capture highly
complex relationships; High tolerance to noisy data. Disadvantages of Neural Networks: Needs lots of
training data; ‘Black-box’; No variable selection; Danger of overfitting; Extrapolation is a problem; Weights
might converge to local optimum; Computational complexity (run time).