Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Assignment 1

Total Marks: 100; Weighting: 20%


Due: 14/4/20 11.55pm

Instructions:

 Submit only one file in pdf format to the link on the Study Desk.
 Assume that your report will be read by someone familiar with the data set but
with limited statistical knowledge. Fully explain plots and when stating statistics
or results explain what they mean statistically AND in context of the data.
 Presentation should be neat, consistent, spell-checked and proof read. All
questions should be clearly labelled, and all answers should clearly and concisely
address the questions.
 If you convert a Word document to pdf for submission check that all symbols,
equations etc. have converted correctly, i.e., proof-read your work.
 All answers must be typed – do not include handwritten/scanned or stylus/tablet
written responses in your document.
 If you do not use knitr to compile your submission, where asked to provide R
code, paste relevant code within the assignment document and italicise (or
otherwise highlight or distinguish from other content). Do not include code in an
appendix.
 Do not include an appendix at all. Any work included in an appendix will not be
marked.
 Please note that referencing text books and other resources is not the goal of this
assessment. This work requires students to demonstrate their understanding of
the analysis and interpretation, not provide quotes from resources.
 When interpreting output, you are expected to do so in context of the data and
the method (i.e. ensure you comment on aspects of the method that affect your
interpretation with the respect to the variables and sample).
 A maximum of 10 marks will be deducted from your total marks for poor
presentation.

Marks:
 Question 1: 25
 Question 2: 20
 Question 3: 30
 Question 4: 25

Page 1 of 5
Data File:
The same data set will be used for all four questions in this assignment.
The data file ‘europegroup.txt’ contains data for the percentage of employment by
country (n=30 countries). The first variable identifies the region of the country (Group)
and the next nine variables represent different employment sectors: AGR=agriculture,
forests and fishing; MIN=mining; MAN=manufacturing; PS=power and water supplies;
CON=construction; SER=services; FIN=finance; SPS=social and personal services;
TC=transport and communication. Although you may not find these data to be MVN in
Question 1, you should proceed with all analysis requested in Questions 2 to 4 assuming
MVN, and comment on this limitation where relevant.

Question 1 (25 marks):


Provide R code, output and written interpretation for parts a) to d) and part g) of this
question. Provide only output that is directly relevant to address each section.
Test for multivariate normality (MVN) by:

a) Describe the structure of the ‘europegroup.txt data. (1 mark total)


b) Produce (2 marks) and interpret (2 marks) univariate QQ plots and histograms and
univariate Shapiro-Wilks tests of normality for each of the nine employment
variables. Which are the most non-normally distributed variables (1 mark)? (5 marks
total)
c) Produce (1 mark) and interpret (1 mark) perspective and contour plots for the SER
and FIN variables. What is an inherent problem with using these plots to assess MVN
(1 mark)? (3 marks total)
d) Perform the analysis necessary to provide the results of the Mardia, Henze-Zirkler
and Royston tests of MVN based on all nine employment variables. Include in your
interpretation: (8 marks total)
i. The Chi-Square QQ plot (1 mark) and interpretation (1 mark)
ii. Describe how the QQ plot is constructed and its relationship to the univariate
normal QQ plots (2 marks).
iii. Output and interpretation for the 3 tests (3 marks).

iv. What is a key limitation of these MVN statistical tests (1 mark)?

e) One way to try and meet the MVN assumption could be to remove some of the
variables from the multivariate analysis (do not perform this analysis). Suggest three
additional ways that you might improve univariate and multivariate normality for
data sets in general. (3 marks total)

Page 2 of 5
f) In part e) we suggested removing some variables to try and help the data approach
MVN. Suggest one other reason why reducing the number of variables used in
multivariate analysis may be important (this question does not relate specifically to
this particular data set)? (2 marks total)
g) In part e) we suggested removing some variables to try and help the data approach
MVN. Check to see if the data is MVN if only those variables that are univariate
normal (UVN) are used (1 mark). Is your result reasonable given your understanding
of the relationship between UVN and MVN (2 marks)? (3 marks total)

Question 2 (20 marks):


For all of question 2, do not use all nine employment variables – use only those 4
identified in Question 1 as UVN.
Provide R code, output and written interpretation for parts a) to e) of this question.

a) Produce a draftsman display for the employment variables. Use the function
scatterplotMatrix (from week 2) and check the help documentation
(?scatterplotMatrix) to help you produce a plot with observations grouped by regional
group using different colours and include the associated legend. Your plot should not
include smoothing, regression lines, or distribution curves in the diagonal panels of
the plot (1 mark). Interpret these plots, relating back to the original data where it
may add to the interpretation (2 marks). What are the y and x axes on plot [3,2] of
the draftsman plot (1 mark)? (4 marks total)
b) In the context of MANOVA, list the dependent and independent variables (1 mark)
and define the relationship that the MANOVA would test (1 mark). (2 marks total)
c) Using MANOVA in R, test for differences in ‘percentage of employment’ between the
four country regions. Include tests using all four test statistics covered in this course
(2 marks) and interpret output (3 marks). (5 marks total)
d) Explain how Wilk’s Lambda statistic is calculated and why a small statistic is likely to
indicate significant differences between at least some groups (2 marks). Which of the
four tests used in part c) would be the best to interpret if there are concerns about
multivariate normality or covariance equality (1 mark)? (3 marks total)
e) Produce output that specifically compares each of the regions (Group) with each
other (you should have 6 comparisons) using Hotelling’s T2 test and a significance
level of 0.05 (2 marks). Determine the multiple test corrected significance level (1
mark). Do not provide R output; instead reproduce and complete the following table
for all comparisons and interpret. What were the sample sizes for each region and

Page 3 of 5
how may sample sizes have affected these results and those in part c) (2 marks)?
Will deviation from MVN influence these results (1 mark)? (6 marks total)

Comparison Hotelling’s Significant Significant after


Region 1 Region 2 p-value (Y/N) correction (Y/N)

Question 3 (30 marks):


For all of question 3, do not use all nine employment variables – use only those 4
identified in Question 1 as UVN.
Provide R code, output and written interpretation for parts a) to e) of this question.

a) Produce (2 marks) and interpret (2 marks) the correlation and covariance matrices
(2 marks). Explain the difference between these matrices in detail (i.e. explain
clearly how the values are adjusted mathematically and the effect of these changes)
(2 marks). Would using the covariance matrix in PCA on this data be appropriate (1
mark)? Why (1 mark)? (8 marks total)
b) Perform PCA analysis on the 4 employment variables using the prcomp function.
Provide the eigenvalues (1 mark), %variation (1 mark) and scree plot (1 mark).
Interpret each of these results (3 marks) and discuss how they influence your
decision on how many PCs to interpret from this analysis (2 marks). Remember to
keep in mind the overall purpose of PCA (8 marks total).
c) Interpret (2 marks) the first PC. Include the Z equation (1 mark) and a plot of the
loadings on the first PC in your answer (1 mark). (4 marks total)
d) What is the correlation between the first and second PCs and what does this tell you?
(2 marks total)
e) Produce (1 mark) and interpret (2 marks) a biplot based on the first 2 PCs. In
particular, explain your interpretation of the employment variables in country 19
compared to country 9 (1 mark). Relate your interpretation back to the original data
(1 mark). (5 marks total)
f) Was this a useful analysis for this data set? Explain with specific reference to the
results of your prior analysis in this question. (3 marks total)

Page 4 of 5
Question 4 (25 marks):
For all of question 4, do not use all nine employment variables – use only those 4
identified in Question 1 as UVN.
Provide R code, output and written interpretation for parts a) to e) of this question.

a) Perform a Factor Analysis using the factanal function. Initially use the number of
components you identified as informative in Question 3 (do not use parallel analysis
to help inform your decision here) and apply no rotation. You will get an error
message. In order to problem solve this issue and make further decisions about your
analysis you will need to have read the additional notes available in the Week 6 block
on the Studydesk called “notes on df limiting number of factors.pdf”. Provide your
initial line of code, subsequent error message and your final line of code that
successfully performs the factanal analysis (2 marks). What did you need to change
and why (2 marks)? (4 marks total)
b) From your successful factanal analysis in part a) provide output and interpretation for
(8 marks total):
• Variance explained (2 marks)
• Chi-square test (2 marks)
• Variable loadings (2 marks)
• Difference in uniqueness values for the variables FIN and SPS (2 marks)
c) How would your results change if you applied a rotation? Explain your reasoning. (4
marks total)
d) Perform parallel analysis using a seed value of 245 and 500 iterations. Produce the
scree plot for the PC results only (1 mark). Discuss how many PC’s are recommended
by this analysis and use the plot to help you explain these results (2 marks). As part
of your explanation provide the values for the 95th percentile for components 1 and 2
(1 mark). (4 marks total)
e) Explain in your own words how the parallel analysis works. (5 marks total)

Page 5 of 5

You might also like