Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

BADM 1.

1: Data Mining Applications

Data mining starts when collecting data on customers or users. It is applied in personalized coupons; in
banks and financial institutions, in marketing and in the large consulting firm. Some controversial uses of
data mining by businesses are in credit card companies, American Express and Target using customer’s
data to determine the personal information of customers. How to use data mining to take advantage of
those data? Depending on personal profile. How does the software know what to offer? Using data
mining in the background to look at your behaviors and compare them to others. How is that
happening? Collecting data and comparing that to others' to give a personalized offer. Some examples
are Pandora, Songkick, Facebook, Google Ads, Netflix and other entertainment services also try and
personalize. Governments, and other types of organizations, have also applied data mining. For example,
it is applied in the election of Obama, in the IRS and by English police. Moving to digital, some examples
are Xtract and Foursquare. Other examples of online learning and the research of electronic patient
records have nothing to do with money, yet it contributes a lot to society.

BADM 1.2: Data Mining in a Nutshell

Statistics: Macro-decisioning; Explain/describe population relationships; Small sample, few variables;


Find good fitting statistical model; Confidence intervals, hypothesis test, p-value. Data Mining:
Micro-decisioning; Predict values of new records; Large sample, many variables; Models/ algorithms with
high predictive power; Predictive power metrics and costs. Types of methods: “Supervised Learning”:
we have the inputs - the worms, and the output - which is the egg, and the idea is that given a set of
inputs we want to predict a certain output. Now, outputs are typically one of two types: they can be
prediction (numerical Y), or classification (categorical Y). Unsupervised learning involves a big set of
measurements on numerous people. The goal is to reduce the dimension in terms of the observations or
the records - so instead of talking about 1 million customers, I want to segment them into segments.
“What goes with what?”. The data mining process: Define purpose → Obtain data → Explore & clean
data → Determine DM task → Choose DM methods → Apply methods, select final model → Evaluate
performance → Deploy.

BADM 1.3: The Holdout Set

Performance in supervised learning is predicted from a set of inputs. There are 2 stages to use this
idea of data partitioning and the holdout set. The partitioning will happen early on, when just having the
data. Take a subset of the data and lock it up in a drawer and not open it until done with all the modeling.
When totally finishing and having a final model, open that secret drawer and try to apply the model to
these data that the model has never seen. What if the future looks different? Build model(s) - training
data → Evaluate model(s) - validation data → Reevaluate model(s) (optional) - test data →
Predict/classify using final model - new data. How will the model work? There's a blog post about a
reason that data mining projects fail, that is poor cross-validation. Evaluating performance doesn't have to
be, doesn't require a large data set, because there are also techniques for doing the same with smaller
samples. The holdout set is a critical part of performance evaluation.

BADM 2.1: Data Visualization

Data visualization is a very important tool and a major component in business analytics. Spotfire,
tableau, jmp, QlikView, XLMiner and MicroStrategy are easy to manipulate, interactive, intuitive and fast.
One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk
that the vehicle might have serious issues that prevent it from being sold to customers. The auto
community calls these unfortunate purchases “kicks”. Basic charts: bar chart, histogram, boxplot,
line chart and scatter plot discover patterns and exceptions. BI is the reporting and the visualization step;
BA is the data mining part. We're going to look at Spotfire, but you're free to explore others as well. I'm

718H0408_Phạm Ngọc Uyên Vy_G13


going to use an example of data from the Kaggle competition website, and there's a link to this data set
on the course's website. Now I call this part business intelligence, and there's a distinction between BI
(business intelligence), and business analytics. Interaction: change variables, compare, sort, aggregate,
add variables, re-scale, zoom, pan, filter, re-visualize, access details on demand, annotate. Specialized
charts are for special data structures. Visualization for a data mining task: Supervised learning: focus
on relationship between output and inputs. Numerical vs. categorical output; Unsupervised learning:
relationships between all variables.

BADM 2.2: Data Preparation

The data preparation stage: Collecting data: think very carefully about data origin which comes from
both in one place and multiple sources including data that are somehow outside, or purchased through a
third party, or publicly available, or social media data. Merging data sources needs a key like some kind
of ID that enables to merge these records. Duplications and corruption are very common. In Spotfire, it's
easy to merge data from multiple files if you have this key. Tableau is good at supporting this type of
combination. Domain knowledge is integrated in every step of the analysis, at the data cleaning and
preparation stage. First, understand the meaning of each variable like "read the label"; second, data
formatting; third, range of variables like generating a summary statistics table or charts; fourth,
duplications; finally, extreme numbers, outliers. Preparing data: Choice of variables → Choice of scales
(continuous/categorical). Binning and “unbinning” → Missing values (assess extent type of missingness;
drop observations? Drop variables? Replace with dummy?) → Imputation (mean, regression, more
advanced methods) → Explanatory vs predictive → Creating derived variables.

BADM 5.1 Clustering Examples

Cluster analysis (“data segmentation”) is an exploratory method for identifying homogeneous groups
(‘clusters”) of records. Similar records should belong to the same cluster. Dissimilar records should
belong to different clusters. Examples are “Exploring Stars” & “Fitting the Troops”. Market Segmentation:
segment customers based on demographic info and transaction history in order to tailor marketing
strategy for each segment. Investment: Cluster securities based on financial performance info (return,
volatility, beta) and other info (industry and market capitalization) to create a balanced portfolio. Industry
Analysis: for a given industry, cluster firms based on {growth rate, profitability, market size, product range,
presence in various international markets}, to understand industry structure (determine competitors).
Example: UG Business Programs in the US. Simple Clustering: 1-2 variables: visual inspection of data
(histogram, scatterplot). Clustering with >2 variables: 2 approaches: Distance: compute multivariate
distance between records, and group “close” records. Homogeneity: group records to increase
within-group homogeneity. 2 types of clustering algorithms: Hierarchical methods - agglomerative: begin
with n clusters; sequentially merge similar clusters until 1 cluster is left, useful when goal is to arrange the
clusters into a natural hierarchy, requires specifying distance measure. Non-hierarchical methods:
pre-specified number of clusters, assign records to each of the clusters, requires specifying # clusters,
computationally cheap.

BADM 5.2 Hierarchical Clustering Part 1

Hierarchical clustering: Start with individual records and then start grouping them up into clusters. A
general algorithm starts with n clusters (n = sample size, the number of records; record = cluster). In
step 1, 2 closest records are merged into 1 cluster. At every step, pair of clusters with the smallest
distance are merged (either single record added to existing cluster, or 2 existing clusters are combined).
Define a definition of distance. Pairwise distance between records: Single measurement case: each
record has 1 value. Multiple measurement case: each record has a multiple values.

718H0408_Phạm Ngọc Uyên Vy_G13


● dij = distance between observations i and j
● Distance requirements: Non-negative (dij >0), dij =0, Symmetry (dij=dji ), Triangle inequality (dij + djk ≥ dik )
⇒ Distance between any pair cannot exceed the sum of distances between the other 2 pairs.
● Notation: xi = (xi1 , xi2 ,..., xip) , xj = (xj1 , xj2 ,..., xjp).
Next, Euclidean Distance to compute a distance when have multiple measurements, taking the pairwise
differences, squaring them, then summing them up, finally take a square root over the entire number just
to bring the squares back to the original scale. Standardize if multiple variables (p>1): Euclidean
distance is influenced by the units of the different measurements. Solution: standardize (=normalize) each
variable before measuring distances. Lots of other distance metrics: Statistical (Mahalanobis) distance
uses correlation matrix. 'Manhattan distance' or 'city block distance': dij = |xi1 - xj1| + |xi2 - xj2|+...+|xip -
xjp|. Distances for binary data result from dummy variables. Similarity-based metrics based on 2x2 table
of counts. For >2 categories, distance=0 only if both items have same category. Otherwise =1.
Distances for Mixed (numerical + categorical) Data is simple, standardize numerical variables to [0,1],
then use Euclidean distance for all. Gower’s General Dissimilarity Coefficient: dij = Σk wijkdijk/Σk wijk , when
dijk=distance provided by kth variable; wijk = usually 1 or 0 depending whether the comparison is valid for
the kth variable. Distances between clusters: 'Single linkage' ('nearest neighbor') = minimum distance
between members of the 2 clusters; ‘Complete linkage’ (‘farthest neighbor’)=greatest distance between
members of the 2 clusters; ‘Average linkage’ = average of all distances between members of the 2
clusters; ‘Centroid linkage’ = distance between their centroids (centers). At this point, the distance matrix
is re-computed. Repeat the last step until a single cluster is formed.

BADM 5.3 Hierarchical Clustering Part 2

The dendrogram is a tree-like diagram that summarizes the clustering process. The idea is to have
all our records marked on the x-axis, and the y-axis is a similarity or a distance measure right there. It's
also a very useful tool to show other people the results of analysis. In the example, some universities got
clustered early on, a single university tagged along and helped join 1 big cluster. Northwestern and
University of Pennsylvania got clustered first. Using a different type of cluster analysis, the resulting tree
looks quite different from the one that came before. Example: in the US university system, the measures
are doing quite well for a private vs public university distinction. Spotfire shows a heat map for each one
of our records. Clustering is an exploratory technique that allows to look for natural groupings of records.
Looking at the shapes to see which ones of these clusters are different from each other.

BADM 5.4 K-Means Clustering

Non-Hierarchical Clustering: K-Means Clustering: Pre-determined number (K) of non-overlapping


clusters. Clusters are homogeneous yet dissimilar to other clusters. Need measures of within-cluster
similarity (or homogeneity), and between-cluster similarity. There's no hierarchy. End-product is final
cluster memberships (no dendrogram). Iterative procedure: Start from K initial clusters -> Each record
reassigned to cluster with “closest” centroid → Stop when further reassignments make clusters less
homogeneous. Algorithm minimizes within-cluster variance (heterogeneity). K-means algorithm: 1. For a
user-specified value of K, partition datatset into K initial clusters; 2. For each record, assign it to cluster
with the nearest centroid; 3. Recalculate centroids for the “losing” and “receiving” clusters. Can be done:
after reassignment of each record, or after 1 complete pass through all records (cheaper); 4. Repeat
steps 2-3 until no more assignments occur. Initial partition into K clusters: initial partitions can be
obtained by either user-specified initial partitions, or user-specified initial centroids (info from external
variable), or random partitions. Stability run algorithm with different initial partitions. Evaluating
usefulness of clustering: What characterizes each cluster? Can you give a “name” to each cluster?
Does this give us any insight? Selecting K: Re-run algorithm for different values of K. Tradeoff:
simplicity (interpretation) vs. adequacy (within-cluster homogeneity). Elbow graph: within-cluster variability

718H0408_Phạm Ngọc Uyên Vy_G13


as a function of K. Choice is subjective. Convergence/robustness of K-means: Procedure might
oscillate indefinitely. Convergence criterion stops when a cluster centroid moves less than a % of smallest
distance between any of the centroids. There's an interesting link http://www.clustan.com/k-means
critique.html to follow and find some critique of K-Means (some interesting points about outliers, different
starting points,...). Advantages of K-Means: Computationally fast for large datasets; Useful when certain
K needed. Disadvantages of K-Means: Can take long to terminate; Final solution not guaranteed to be
“globally optimal”; Different initial partitions can lead to different solutions; Must rerun the algorithm for
different values of K; No dendrogram.

BADM 7.1 K-Nearest Neighbors

Example: personal loan offer. KNN algorithm computes distance between the to-be-classified record and
each record in the training set; finds the k shortest distances; computes votes of the k neighbors.
Repeated for every record in the validation set. K-NN in XLMiner. Choosing K: under-smoothing vs.
over-smoothing; typically k<20; use validation set to find “best” k. Choosing a cutoff value. KNN: piecing
it together. Advantages of KNN: very flexible, data-driven; simple; good performance in large dataset;
useful for prediction. Disadvantages of KNN: no insight about importance of each predictor; danger of
overfitting, need extra test set; can be computationally intensive for large k; need lots of data.

BADM 12.1 Association Rules Part 1

Two common terms are ‘market basket analysis’ and ‘affinity analysis’ trying to study “what goes with
what”. Origin: study of customer transaction databases to determine dependencies between purchases
of different items. POS Transaction Data has a very large number of transaction records. Data is
collected by using bar-code scanners. Each record lists all items purchased by a customer on a single
purchase transaction. Example of beer and diapers: what would you do, would this affect how you change
the layout of your store, would this change your inventory management and marketing, would you offer
certain offers based on this. Other uses are event-based databases. Online Recommender Systems on
flimkart.com: collaborative filtering. Association Rules (HyperCITY, CROSSWORD, ...) vs
Recommender Systems (flimkart.com, amazon.com, NETFLIX,...). Example: cellphone faceplates: store
managers would like to know what colors of faceplates customers are likely to purchase together.
Basic idea: examine all possible rules between items in “if-then” format and select only rules most likely to
indicate true dependence. Many rules are possible. Terminology: “IF” part = antecedent and “THEN”
part = consequent. Problem: computation time grow exponentially as # items increase. Rule Generation:
Problem: Generating all possible rules is exponential in the number of distinct items. Solution: Frequent
item sets: Consider only combinations that occur with higher frequency in the database. Criterion for
“frequent”: Support of an item: % (or number) of transactions in which antecedent (IF) and consequent
(THEN) appear in the data. Generating frequent itemsets: The Apriori Algorithm: for k products…: 1.
Set minimum support criterion; 2. Generate list of one-item sets that meet the support criterion; 3. Use list
of one-item sets to generate list of two-item sets that meet support criterion; 4. Use list of two-item sets to
generate list of three-item sets that meet support criterion; 5. Continue up through k-item sets.

BADM 12.2 Association Rules Part 2

Performance measure #1: Confidence: % of antecedent (IF) transactions that also have the
consequent (THEN) item set. Example: detecting a flu outbreak. The weakness of confidence: if
antecedent and/or consequent have high support -> high confidence. (#transactions with both antecedent
& consequent item sets). Performance measure #2: Lift ratio = confidence (benchmark confidence).
Benchmark assumes independence between antecedent and consequence: P(antecedent &
consequent)=P(antecedent) x P(consequent). Benchmark confidence: P(C|A) = P(C) = …

718H0408_Phạm Ngọc Uyên Vy_G13


(#transactions with consequent item sets). Interpreting Lift: Lift>1 indicates a rule useful for finding
consequent items sets (i.e., more useful than selecting transactions randomly). Process of Rule
Selection: Generate all rules that meet specified support & confidence. Find frequent item sets (those
with sufficient support). From these item sets, generate rules with sufficient confidence. Cellphone
faceplates example. Alternate Data Format: Binary Matrix. For all rules (XLMiner). Interpretation
revisited: Lift ratio shows how effective the rule is in finding consequences vs. random (useful if finding
particular consequent is important). Confidence shows the rate at which consequences will be found
(useful in learning costs of promotion). Support measures overall impact (% transactions affected).
Caution: The Role of Chance: Random data can generate apparently interesting association rules. The
more rules you produce, the greater this danger. Rules based on large numbers of records are less
subject to this danger. Compressing Rules: example of Charles Book Club. XLMiner Output: Rules in
order of lift. Information can be compressed.

Neural Networks: Part I

Regression models for predicting Y from predictors: Linear regression; Added flexibility by
transforming Y and/or X’s; Logistic regression. The idea: capture a complex relationship between output
and inputs by creating layers of derived variables. Neural Net Architecture: multi-layer feed-forward, fully
connected. “Node” = derived variable. Example: consumer acceptance of cheese; NN with no hidden
layers. S-shaped (sigmoidal) activation functions. Training the network (weight estimation).
Backpropagation is the most popular error minimization algorithm in NN software. It has 2 options (case
updating - XLMiner & batch updating). Stop when weights change very little from 1 iteration to the next,
the misclassification rate reaches a required threshold and a limit on runs is reached. Danger: overfitting.
A neural net will have an advantage over regression-type methods.

Neural Networks: Part II

Required user input: #1: Choose predictors: NN highly dependent on quality of predictors → #2:
Pre-process data: prediction (transform, scale), classification (create 1 dummy variable) → #3: Specify
network architecture: number of hidden layers, nodes in hidden layers and output nodes. → #4:
Specify algorithm parameters: “Learning Rate” (I) - low values “down-weight” new information from
errors at each iteration; This slows learning, but reduces tendency to overfit to local structure.
“Momentum” - high values keep parameters changing in same direction as previous iteration; helps avoid
overfitting to local structure, but also slows learning. → #5: Determine cutoff value (for classification):
cutoff on probability to obtain binary classification. Advantages of Neural Networks: Capture highly
complex relationships; High tolerance to noisy data. Disadvantages of Neural Networks: Needs lots of
training data; ‘Black-box’; No variable selection; Danger of overfitting; Extrapolation is a problem; Weights
might converge to local optimum; Computational complexity (run time).

718H0408_Phạm Ngọc Uyên Vy_G13

You might also like