Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 75

Access XLMiner

Access XLMiner through Citrix

Install the Citrix client on your laptop: Go to http://www.mcmaster.ca/uts/vip/citrix/i nstallation.html and follow the instructions

Access XLMiner

Access the Citrix log-in page: https://macapps.mcmaster.ca and use your mac email and password to login Click on the RJC > Microsoft Office folder and then click on Excel Miner 3_3. xLMiner menu appears in the Addins tab on Excel.

Access XLMiner

The folder that appears by default is My Documents. This folder is not the one in your local computer, so you have to browse in the Look in option in the top of the dialog box in order to find your files. Your local drives will appear with a format similar to C$ on Client (V:). When you click in your drive, a new dialog box will show up. In this dialog box, you will be asked What access do you want to grant?. Please select the option Full access. It will give you the possibility to modify and save the files you have in your computer.

Data Preparation and Exploration

Sampling

Obtaining Data: Sampling


Data mining typically deals with huge databases Algorithms and models are typically applied to a sample from a database, to produce statisticallyvalid results XLMiner, e.g., limits the training partition to 10,000 records Once you develop and select a final model, you use it to score the observations in the larger database

Rare event oversampling


Often the event of interest is rare Examples: response to mailing, fraud in taxes, Sampling may yield too few interesting cases to effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling

Pre-processing Data

Types of Variables

Determine the types of pre-processing needed, and algorithms used Main distinction: Categorical vs. numeric

Numeric
Continuous Integer Categorical Ordered (low, medium, high) Unordered (male, female)

Variable handling

Numeric

Most algorithms in XLMiner can handle numeric data May occasionally need to bin into categories
Nave Bayes can use as-is In most other algorithms, must create binary dummies (number of dummies = number of categories 1)

Categorical

Detecting Outliers

An outlier is an observation that is extreme, being distant from the rest of the data (definition of distant is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data pre-processing is detecting outliers Once detected, domain knowledge is required to determine if it is an error, or truly extreme.

Examples of Outliers: Guinness World Records


Svetlana Pankratova, world's leggiest woman (at 52"), and He Pingping, the world's shortest man (at 2' 5.37").

Sultan Kosen from Turkey, is 2m 46.5cm (8ft 1in) tall

Detecting Outliers

In some contexts, finding outliers is the purpose of the DM exercise (airport security screening). This is called anomaly detection.

Handling Missing Data

Most algorithms will not process records with missing values. Default is to drop those records. Solution 1: Omission

If a small number of records have missing values, can omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical

Solution 2: Imputation

Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (non-missing) information

Normalizing (Standardizing) Data

Used in some techniques when variables with the largest scales would dominate and skew results Puts all variables on same scale Normalizing function: Subtract mean and divide by standard deviation (used in XLMiner) Alternative function: scale to 0-1 by subtracting minimum and dividing by the range Useful when the data contain dummies and numeric

Partitioning the Data

Partitioning the Data


Problem: How well will our model perform with new data? Solution: Separate data into two parts

Training partition to develop the model Validation partition to implement the model and evaluate its performance on new data

Addresses the issue of overfitting

Test Partition

When a model is developed on training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data

Using Excel and XLMiner for Data Mining

Excel is limited in data capacity However, the training and validation of DM models can be handled within the modest limits of Excel and XLMiner Models can then be used to score larger databases XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)

http://www.resample.com/xlminer/help/Index.htm

Boston House Median Value Prediction


CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B MEDV per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft. proportion of non-retail business acres per town. Charles River dummy variable (1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per $10,000 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town Median value of owner-occupied homes in $1000

PTRATIO pupil-teacher ratio by town LSTAT % lower status of the population

Example Linear Regression Boston Housing Data


A CRIM 0.006 0.027 0.027 0.032 0.069 0.030 0.088 0.145 0.211 0.170 B C D E NOX 0.54 0.47 0.47 0.46 0.46 0.46 0.52 0.52 0.52 0.52 F RM 6.58 6.42 7.19 7.00 7.15 6.43 6.01 6.17 5.63 6.00 G AGE 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 H I J K L M N O ZN INDUS CHAS 18 0 0 0 0 0 12.5 12.5 12.5 12.5 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 0 0 0 0 0 0 0 0 0 0 DIS RAD 4.09 4.97 4.97 6.06 6.06 6.06 5.56 5.95 6.08 6.59 1 2 2 3 3 3 5 5 5 5 TAX PTRATIO 296 242 242 222 222 222 311 311 311 311 CAT. B LSTAT MEDV MEDV 5 9 4 3 5 5 12 19 30 17 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 0 0 1 1 1 0 0 0 0 0 15.3 397 17.8 397 17.8 393 18.7 395 18.7 397 18.7 394 15.2 396 15.2 397 15.2 387 15.2 387

Partitioning the data

Using XLMiner for Multiple Linear Regression: Specify input and output var.

Specifying Output

RMS error

Error = actual - predicted RMS = Root-mean-squared error = Square root of average squared error In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable

Summary

Data Mining consists of supervised methods (Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data need to be characterized and pre-processed To evaluate performance and to avoid overfitting, data partitioning is used Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database

Data Exploration and Dimension Reduction

Exploring the data


Statistical summary of data: common metrics

Average (sample mean) Median (the middle point) Minimum Maximum Standard deviation Counts & percentages

Boston House Data


CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B MEDV per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft. proportion of non-retail business acres per town. Charles River dummy variable (1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per $10,000 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town Median value of owner-occupied homes in $1000

PTRATIO pupil-teacher ratio by town LSTAT % lower status of the population

Summary Statistics Boston Housing

Correlations Between Pairs of Variables:


Correlation Matrix from Excel
PTRATIO B PTRATIO 1 B -0.17738 1 LSTAT 0.374044 -0.36609 MEDV -0.50779 0.333461 LSTAT MEDV

1 -0.73766

PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT % lower status of the population MEDV Median value of owner-occupied homes in $1000's

Summarize Using Pivot Tables


Counts & percentages are useful for summarizing categorical data Boston Housing example: 471 neighborhoods border the Charles River (1) 35 neighborhoods do not (0)
Count of MEDV CHAS Total 0 471 1 35 Grand Total 506

Pivot Tables - cont.


Averages are useful for summarizing grouped numerical data

In Boston Housing example: Compare average home values in neighborhoods that border Charles River (1) and those that do not (0)

Average of MEDV CHAS Total 0 1 Grand Total

22.09 28.44 22.53

Pivot Tables, cont.


Group by multiple criteria: By # rooms and location E.g., neighborhoods on the Charles with 6-7 rooms have average house value of 25.92 ($000)
Average of MEDV CHAS RM 0 3-4 25.30 4-5 16.02 5-6 17.13 6-7 21.77 7-8 35.96 8-9 45.70 Grand Total 22.09 1 Grand Total 25.30 16.02 22.22 17.49 25.92 22.02 44.07 36.92 35.95 44.20 28.44 22.53

Graphs

Histograms
Boston Housing example:
Histogram
180 160

Histogram shows the distribution of the outcome variable (median house value)
Mode: the value with highest frequency

140
Frequency

120

100
80

60
40 20 0 5 10 15 20 25 30 35 40 45 50

MEDV

Box Plot
outliers max

Quartile 3 mean Median Quartile 1 min outlier

Top outliers defined as those above Q3+1.5(Q3-Q1). max is the maximum of nonoutliers Analogous definitions for bottom outliers and for min Details may differ across software

Boxplots
Side-by-side boxplots are useful for comparing subgroups
Box Plot
60 50 40

30 20
10 0
0

MEDV

CHAS

Boston Housing Example: Display distribution of outcome variable (MEDV) for neighborhoods on Charles (1) and not on Charles (0)

Y Values

Correlation Analysis

Below: Correlation matrix for portion of Boston Housing data Shows correlation between variable pairs
CRIM ZN INDUS CHAS NOX RM

CRIM ZN INDUS CHAS NOX RM

1 -0.20047 1 0.406583 -0.53383 1 -0.05589 -0.0427 0.062938 1 0.420972 -0.5166 0.763651 0.091203 -0.21925 0.311991 -0.39168 0.091251

1 -0.30219

Matrix Plot
Shows scatterplots for variable pairs
0.2 0.4 0.6 0.8 1

Matrix Plot
1.8 3.6 5.4 7.2 9 0 0.2 0.4 0.6 0.8 1

CRIM 101

INDUS 101

1.8 3.6 5.4 7.2

0.6 1.2 1.8 2.4

0.6 1.2 1.8 2.4

Example: scatterplots for 3 Boston Housing variables

ZN 102

Reducing Categories

A single categorical variable with m categories is typically transformed into m-1 dummy variables Each dummy variable takes the values 0 or 1

0 = no for the category 1 = yes

Problem: Can end up with too many variables Solution: Reduce by combining categories that are close to each other Use pivot tables to assess outcome variable sensitivity to the dummies Exception: Nave Bayes can handle categorical variables without transforming them into dummies

Principal Components Analysis

Principal Components Analysis


Goal: Reduce a set of numerical variables. The idea: Remove the overlap of information between these variable. [Information is measured by the sum of the variances of the variables.] Final product: a smaller number of numerical variables that contain most of the information

Principal Components Analysis


How does PCA do this? Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables). These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information. The new variables are called principal components (PC).

Example Breakfast Cereals


name 100%_Bran 100%_Natural_Bran All-Bran All-Bran_with_Extra_Fiber Almond_Delight Apple_Cinnamon_Cheerios Apple_Jacks Basic_4 Bran_Chex Bran_Flakes Cap'n'Crunch Cheerios Cinnamon_Toast_Crunch mfr N Q K K R G K G R P Q G G type calories protein C 70 4 C 120 3 C 70 4 C 50 4 C 110 2 C 110 2 C 110 2 C 130 3 C 90 2 C 90 3 C 120 1 C 110 6 C 120 1 rating 68 34 59 94 34 30 33 37 49 53 18 51 20

Description of Variables

Name: name of cereal mfr: manufacturer type: cold or hot calories: calories per serving protein: grams fat: g. sodium: mg. fiber: g.

carbo: g. complex carbohydrates sugars: g. potass: mg. vitamins: % FDA rec shelf: display shelf weight: oz. 1 serving cups: in one serving rating: Consumer Reports

Consider calories & ratings


calories ratings

calories ratings 379.63 -189.68 -189.68 197.32

Total variance (=information) is sum of individual variances: 379.63 + 197.32=576.95 Calories accounts for 379.63/576.95 = 66% of total variances

First & Second Principal Components


Z1 and Z2 are two linear combinations Z1 has the highest variation (spread of values) Z2 has the lowest variation
100 90 80 70

z1

rating

60
50 40 30 20 10 0 0 20 40 60 80 100

z2

120

140

160

180

calories

PCA output for these 2 variables

Top: weights to project original data onto z1 & z2 e.g., col. 1 scores are computed z1 scores using weights (-0.847, 0.532) Bottom: reallocated variance for new variables

Components Variable calories rating Variance Variance% Cum% P-value 1 -0.84705347 0.53150767 498.0244751 86.31913757 86.31913757 0 2 0.53150767 0.84705347 78.932724 13.68086338 100 1

Properties of the resulting variables


New distribution of information:

New variances = 498 (for z1) and 79 (for z2) Sum of variances = sum of variances for original variables calories and ratings New variable z1 has most of the total variance, might be used as proxy for both calories and ratings z1 and z2 have correlation of zero (no information overlap)

Generalization
X1, X2, X3, Xp, original p variables Z1, Z2, Z3, Zp, weighted averages of original variables All pairs of Z variables have 0 correlation Order Zs by variance (z1 largest, Zp smallest) Usually the first few Z variables contain most of the information, and so the rest can be dropped.

PCA on full data set


Variable calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating Variance Variance% Cum% 1 0.07624155 -0.00146212 -0.00013779 0.98165619 -0.00479783 0.01486445 0.00398314 -0.119053 0.10149482 -0.00093911 0.0005016 0.00047302 -0.07615706 7204.161133 55.52834702 55.52834702 2 -0.01066097 0.00873588 0.00271266 0.12513085 0.03077993 -0.01731863 -0.00013545 0.98861349 0.01598651 0.00443601 0.00098829 -0.00160279 0.07254035 4833.050293 37.25226212 92.78060913 3 0.61074823 0.00050506 0.01596125 -0.14073193 -0.01684542 0.01272501 0.09870714 0.03619435 0.7074821 0.01267395 0.00369807 0.00060208 -0.30776858 498.4260864 3.84177661 96.62238312 4 -0.61706442 0.0019389 -0.02595884 -0.00293341 0.02145976 0.02175146 -0.11555841 -0.042696 0.69835609 0.00574066 -0.0026621 0.00095916 0.33866307 357.2174377 2.75336623 99.37575531 5 0.45754826 0.05533375 -0.01839438 0.01588042 0.00872434 0.35580006 -0.29906386 -0.04644227 -0.02556211 -0.00823057 0.00318591 0.00280366 0.75365263 72.47863007 0.55865192 99.93440247 6 0.12601775 0.10379469 -0.12500292 0.02245871 0.271184 -0.56089228 0.62323487 -0.05091622 0.01341988 -0.05412053 0.00817035 -0.01087413 0.41805118 4.33980322 0.0334504 99.96785736

First 6 components shown First 2 capture 93% of the total variation


Note: data differ slightly from text

Normalizing data

In these results, sodium dominates first PC Just because of the way it is measured (mg), its scale is greater than almost all other variables Hence its variance will be a dominant component of the total variance Normalize each variable to remove scale effect

Divide by std. deviation (may subtract mean first)

Normalization (= standardization) is usually performed in PCA; otherwise measurement

PCA using standardized variables


Variable calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating Variance Variance% Cum% 1 0.32422706 -0.30220962 0.05846959 0.20198308 -0.43971062 0.17192839 0.25019819 -0.3834067 0.13955688 -0.13469705 0.07780685 0.27874646 -0.45326898 3.59530377 27.65618324 27.65618324 2 0.36006299 0.16462311 0.34051308 0.12548573 0.21760374 -0.18648526 0.3434512 0.32790738 0.16689315 0.27544045 0.43545634 -0.24295618 -0.22710647 3.16411042 24.3393116 51.99549484 3 0.13210163 0.2609871 -0.21144024 0.37701431 0.07857864 0.56368077 -0.34577203 0.08459517 0.38407779 0.01791886 0.27536476 0.14065795 0.18307236 1.86585701 14.35274601 66.34824371 4 0.30780381 0.43252215 0.37964511 -0.16090299 -0.10126047 0.20293142 -0.10401795 0.00463834 -0.52358848 -0.4340663 0.10600897 0.08945525 0.06392702 1.09171081 8.39777565 74.74601746 5 0.08924425 0.14542894 0.44644874 -0.33231756 -0.24595702 0.12910619 -0.27725372 -0.16622125 0.21541923 0.59693497 -0.26767638 0.06306333 0.03328028 0.96962351 7.45864248 82.20465851 6 -0.20683768 0.15786675 0.40349057 0.6789462 0.06016004 -0.25979191 -0.20437138 0.022951 0.03514972 -0.12134896 -0.38367996 0.06609894 -0.16606605 0.72342771 5.5648284 87.76948547

First component accounts for smaller part of variance Need to use more components to capture same amount of information

PCA in Classification/Prediction

Apply PCA to training data Decide how many PCs to use Use variable weights in those PCs with validation/new data This creates a new reduced set of predictors in validation/new data

Summary

Data summarization is important for data exploration Data summaries include numerical metrics (average, median, etc.) and graphical summaries Data reduction is useful for compressing the information in the data into a smaller subset Categorical variables can be reduced by combining similar categories Principal components analysis transforms an original set of numerical data into a smaller set of weighted averages of the original data that contain most of the original information in less variables.

You might also like