Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

IVY Professional School

Program:

KPO Training

Module:

Basic Statistics and Predictive Modeling

Session:

7 and 8

1
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Outline

Descriptive Statistics
Frequencies and Crosstabs
Correlations
Multiple Linear Regression
Logistic Regression
Time Series
Principal Component
Factor Analysis
Cluster Analysis

2
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Descriptive Statistics
R provides a wide range of functions for obtaining summary statistics. One method of
obtaining descriptive statistics is to use the sapply( ) function with a specified summary
statistic.
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
- Possible functions used in sapply include mean, sd, var, min, max, median, range, and
quantile.
# mean,median,25th and 75th quartiles,min,max
summary(mydata)
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores

3
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Frequencies
R provides many methods for creating frequency and contingency tables. Three are
described below. In the following examples, assume that A, B, and C represent categorical
variables.
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
table( ) can also generate multidimensional tables based on 3 or more categorical
variables. In this case, use the ftable( ) function to print the results more attractively.
# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
4
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Crosstabs
xtabs : The xtabs( ) function allows you to create crosstabulations using formula style input.
# 3-Way Frequency Table
mytable <- xtabs(~A+B+c, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence
Crosstable :
# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)

Note: Table ignores missing values. To include NA as a category in counts, include the table option
exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor
using newfactor <- factor(oldfactor, exclude=NULL).

5
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Correlations

Use the cor( ) function to produce correlations and the cov( ) function to produces covariances.
A simplified format is cor(x, use=, method= ) where

Example:
# Correlations/covariances among numeric variables in data frame mtcars. Use listwise deletion of
missing data.
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")

use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the
columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.
# Correlation matrix from mtcars with mpg, cyl, and disp as rows and hp, drat, and wt as columns
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
6

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Multiple (Linear) Regression

Fitting the Model


# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit)
# show results
# Other useful functions
coefficients(fit)
# model coefficients
confint(fit, level=0.95) # CIs for model parameters
fitted(fit)
# predicted values
residuals(fit)
# residuals
anova(fit)
# anova table
vcov(fit)
# covariance matrix for model parameters
influence(fit)
# regression diagnostics

Diagnostic Plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
Comparing Models : We can compare nested models with the anova( ) function. The following code
provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2
# compare models
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2)
anova(fit1, fit2)
7

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Logistic Regression

Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables.
# Logistic Regression where F is a binary factor and x1-x3 are continuous predictors
fit <- glm(F~x1+x2+x3,data=mydata,family=binomial())
summary(fit)
# display results
confint(fit)
# 95% CI for the coefficients
exp(coef(fit))
# exponentiated coefficients
exp(confint(fit))
# 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals

Note : One can use anova(fit1,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F~x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x variable.

8
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Time Series

Creating a time series


The ts() function will convert a numeric vector into an R time series object. The format is ts(vector,
start=, end=, frequency=) where start and end are the times of the first and last observation and
frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
# save a numeric vector containing 48 monthly observations from Jan 2009 to Dec 2014 as a time
series object
myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)
# subset the time series (June 2014 to December 2014)
myts2 <- window(myts, start=c(2014, 6), end=c(2014, 12))

# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
stl() function. Note that a series with multiplicative effects can often by transformed into series with
additive effects through a log transformation (i.e., newts <- log(myts)).
# Seasonal decompostion
fit <- stl(myts, s.window="period")
plot(fit)
# additional plots
monthplot(myts)
library(forecast)
seasonplot(myts)
9

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Principal Component Analysis

The princomp( ) function produces an unrotated principal component analysis.


# Pricipal Components Analysis
# entering raw data and extracting PCs from the correlation matrix
fit <- princomp(mydata, cor=TRUE)
summary(fit)
# print variance accounted for
loadings(fit)
# pc loadings
plot(fit,type="lines")
# scree plot
fit$scores
# the principal components
biplot(fit)
Note: Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat=
option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the
option n.obs=.

The principal( ) function in the psych package can be used to extract and rotate principal
components.
# Varimax Rotated Principal Components
# retaining 5 components
library(psych)
fit <- principal(mydata, nfactors=5, rotate="varimax")
fit # print results
Note: mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is
used. rotate can "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
10

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Factor Analysis

The factanal( ) function produces maximum likelihood factor analysis.


# Maximum Likelihood Factor Analysis entering raw data and extracting 3 factors, with varimax rotation
fit <- factanal(mydata, 3, rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
plot(load,type="n")
# set up plot
text(load,labels=names(mydata),cex=.7) # add variable names
Note: Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat=
option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the
option n.obs=.
The factor.pa( ) function in the psych package offers a number of factor analysis related functions,
including principal axis factoring.
# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results
Determining the Number of Factors to Extract
# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
11

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Cluster Analysis

Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescale variables for
comparability.
# Prepare Data
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables

Partitioning
K-means clustering is the most popular partitioning method. It requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help determine the appropriate number of clusters.
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
12

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Cluster Analysis

(continued..)

Hierarchical Agglomerative
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)

13
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Cluster Analysis

(continued..)

Plotting Cluster Solutions


# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
library(fpc)
plotcluster(mydata, fit$cluster)

14
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

THANK YOU

15
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

You might also like