Professional Documents
Culture Documents
Basic Matrix Algebra and Implementing in R: Ax X A Ix
Basic Matrix Algebra and Implementing in R: Ax X A Ix
Basic Matrix algebra and implementing in R
• Addition , multiplication of matrices
• Trace, determinant of a matrix
• Inverse of a matrix – solution of a system of linear equations
• Linear (in)dependence of vectors
• Rank of a matrix
• Eigenvalue and eigenvectors:
22
1 3 12
x1 4 ; x2 12 ; x3 2 x2 3 x1 , x3 2 x1 10
2 6 14
1 2 1
y1 1 ; y2 1 ; y3 2
2 1 1
|| x || x ' x A x x ( A I )x 0
p p p 1 1 1 p 1
A x1 1 x1 A x2 1 x2
for any constant c, x2 cx1 is also an eigenvector of A corresponding to 1
x, y are said to be linearly related if there exists constants a and b such that
y ax b px qy r
23
1
30‐06‐2021
Practice with datasets
• Husband‐wife data
• US air pollution
• January temperature data in cities
• IRIS data: petal and sepal length and width of 3 specifies of IRIS flowers (50
each)
• Pottery
26
How to deal with missing data
With many variables being measured, there is increasing possibility of observations being
missing for some variables.
Option 1: Complete‐case analysis – Deleting any observations that have any
missing values for any variable
ok, if missing values are rare
Problem 1: With many variables being measured, this can lead to a lot of observations being deleted
and highly reduced sample size.
Problem 2: Can lead to biased estimates unless the missing data are missing completely at random.
Option 2: Imputation ‐ “fill in” missing values using `sound’ statistical principle
not a panacea
what is a ‘sound’ principle?
Do not employ analytics mechanically – need to understand methodology
Can overestimate impact of sample size
27
2
30‐06‐2021
Humans are good at discerning
subtle patterns that are really
there, but equally so at
Multivariate imagining them when they are
altogether absent.
Data Carl Sagan
Visualization
We see only what we like to see
– we hear what we like to hear ‐
the rest flows away.
28
29
3
30‐06‐2021
Scatter plot
The simplest such plot is the 2‐dimensional scatterplot.
A scatterplot can be enhanced with:
rug plots to show marginal distributions
labeling points with abbreviations to identify interesting observations
a regression line or curve overlain on the plot
Pollutions in US cities
30
31
4
30‐06‐2021
Scatterplot
matrix using
Pairs
32
33
5
30‐06‐2021
Bivariate Boxplot
Instead of “boxes,” it displays two ellipses centered at the same point (the bivariate center).
The smaller ellipse contains the “central 50%” of the data and the wider ellipse
contains all the data except the “outliers.”
Two regression lines are displayed (the regression of Y on X and that of X on Y ).
The intersection of the regression lines is the bivariate center of the data.
If the angle between the lines is highly acute, this indicates a strong correlation
between the two variables.
If the angle is almost a right angle, the correlation between the two variables is weak.
This plot can be done with the bivbox function (EH).
Options exist to use robust measures of center, spread, and association when constructing
the bivariate boxplot.
sy sx
m1 r ; m2 r m1m2 r 2
sx sy
34
Bivariate
Box‐plot
35
6
30‐06‐2021
Air pollution in US cities: impact of outliers
• Correlation between population and # manufacturing
enterprises(>=20 workers) increases from 0.7955 to 0.9552 because
of 4 cities: Chicago, Detroit, Cleveland and Philadelphia
36
Convex Hull
When examining a data set, we sometimes seek
robust measures which ignore outliers.
With multivariate data, we define the convex hull
of the data set as the points on the edge of the
smallest (convex) perimeter surrounding the data
Trimming off these outlying observations on the
boundary of convex hull may display the
relationship between the variables better.
Robust summary measurements (such as the
correlation) can be obtained from this trimmed
data set.
Finding the convex hull can be done with the chull
function in R.
37
7
30‐06‐2021
The chi‐plot is a graphical
method for judging whether
two variables are independent, Chi‐plot
based on sample data.
It is based on whether larger
values of one variable tend to
occur with larger (or smaller)
values of the other variable
This plot can be done with the
chiplot function (Everitt).
When the two variables are
truly independent, the points in
the chi‐plot will fall within a
central region.
If the points fall outside this
center region, this indicates the
two variables may be related.
38
Plots to Display > 2 Variables
A bubble plot can show three variables on a regular 2‐D plot.
Two variables are plotted as usual on a scatterplot.
Circular bubbles are drawn around each point, with the size of the bubble
representing the value of a third variable for that observation.
The bubbles can be added using the symbols functions in R.
A scatterplot matrix is a q × q array of individual 2‐D scatterplots.
This shows the relationship between all possible pairs of variables, but not any
3‐way associations, for example.
This can be done in R with the pairs function.
39
8
30‐06‐2021
Conditioning Plots / Trellis graphics
These plots show scatterplots of two variables, conditional on the
value of a third variable.
These are separate scatterplots for different values of the third
variable.
These are especially useful if the third variable is a categorical
(grouping) variable.
This can be done in R with the coplot function or with the xyplot
function in the lattice package.
40
Trellis
graphics
41
9
30‐06‐2021
pollution <‐ with(USpol, equal.count(SO2,4))
plot(cloud(precip ~ temp * wind | pollution,
panel.aspect = 0.9, data = USpol,
main="3D plots of temp, wind, and precip
conditioned on levels of SO2"))
42
Star Plots
In a star plot, the magnitudes of p variables can be represented graphically for
each observation.
There are p points on the star, and the length of each point represents the value
for that observation.
may be useful to scale or standardize the variables first.
This can be done in R with the stars function.
Alternatively, the stars could be placed on a 2‐D scatterplot, and (like with a
bubble plot), stars having p − 2 points could be placed over each observa on.
43
10
30‐06‐2021
44
45
11
30‐06‐2021
46
Star plot of
the air
pollution
data.
47
12
30‐06‐2021
48
Stalactite plot [EH page 54‐55]
49
13
30‐06‐2021
Three‐Dimensional Scatterplots
3‐D scatterplots can be drawn in R (using the cloud function in the
lattice package) , but they are often not easy to interpret visually.
Using “drop lines” may make such plots more interpretable.
50
Chernoff Faces
With Chernoff Faces, each multivariate observation is displayed with
a face.
p variables are represented using p specific characteristics of the
face (eye size, mouth, angle, nose size, etc.).
For example, a wide mouth might correspond to a large value for the
fifth variable, say.
Can be done in R with the faces function, in the aplpack/
TeachingDemos package.
51
14
30‐06‐2021
Profile Plots
Profile plots are a simple way to display several multivariate observations
graphically.
They show connected line segments where the height at each joining
point is the value of a variable for that observation.
Another form of profile plot uses a series of bars to represent the variable
values for each multivariate observation.
These work best when there are relatively few observations (and
variables) so that the plot looks “cleaner”.
Plotting the different observations in different colors (or line types) helps
distinguish the observations.
Profile plots for the same data set can appear different visually when the
variables are reordered (a possible weakness).
52
Andrews Plot
These convert each data vector into a Fourier series, and the resulting curves are
plotted together.
a1 , a2 , a3 , a4 , a5
For a data vector , the corresponding Fourier curve would be
a1
f (t ) a2 sin t a3 cos t a4 sin 2t a5 cos 2t t [ , ]
2
The order of the variables does affect the appearance of the plot.
Advantage: The distances between curves in an Andrews plot reflect correctly
the pairwise distances between observations.
Disadvantage: The roles of the individual variables are not as apparent in
Andrews plots as in the other types of plots.
53
15