Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Homework 1

Homework should be presented in a clear manner, not scratch notes paper will be allow. The home-
work solution format will be .docx, .doc and .pdf. You can also add your code as part of the supple-
mental material. You must include (if any) your code, but do not include a screenshot of the software
input/output. Outputs may contain irrelevant information for the purpose of the question and can lead
to a wrong interpretation. To be consider for full credit it will need to submitted before the due date.
1. Indicate whether we would generally expect the performance of a flexible statistical learning method
to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.
(b) The number of predictors p is extremely large, and the number of observations n is small.
(c) The relationship between the predictors and response is highly non-linear.
(d) The variance of the error terms, i.e. σ 2 = V ar(ϵ), is extremely high.

2. Explain carefully
(a) Describe three real-life applications in which classification might be useful. Describe the re-
sponse, as well as the predictors. Is the goal of each application inference or prediction?
Explain your answer.
(b) Describe three real-life applications in which regression might be useful. Describe the response,
as well as the predictors. Is the goal of each application inference or prediction? Explain your
answer.
(c) Describe three real-life applications in which cluster analysis might be useful.
3. The Boston data set is part of the ISLR2 library in R and ISLP library in Python. You will need to
install the ISLR2 or ISLP package. To install ISLR2 you can use the install.packages() from R.
Assuming you have Python3 install, type pip install ISLP; this also installs most other packages
needed in the labs.

(a) How many rows are in this data set? How many columns? What do the rows and columns
represent?
(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your
findings.
(c) Make any other multivariate visualization technique. Describe your results.
(d) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
(e) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates?
Pupil-teacher ratios? Comment on the range of each predictor.
(f) How many of the census tracts in this data set bound the Charles river?
(g) What is the median pupil-teacher ratio among the towns in this data set?

1
ESMA 6835 Topics of Statistics Dr. Almodóvar-Rivera

(h) Which census tract of Boston has lowest median value of owner-occupied homes? What are
the values of the other predictors for that census tract, and how do those values compare to
the overall ranges for those predictors? Comment on your findings.
(i) In this data set, how many of the census tracts average more than seven rooms per dwelling?
More than eight rooms per dwelling? Comment on the census tracts that average more than
eight rooms per dwelling.

4. Sales of Riding Mowers: Scatter Plots. A company that manufactures riding mowers wants to
identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer
is interested in classifying households as prospective owners or nonowners on the basis of Income
(in $1000s) and Lot Size (in 1000 ft 2). The marketing expert looked at a random sample of 24
households, given in the file RidingMowers.csv.

(a) Using R/Python, create a scatter plot of Lot Size vs. Income, color-coded by the outcome
variable owner/nonowner. Make sure to obtain a well-formatted plot (create legible labels and
a legend, etc.).

5. Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The file LaptopSalesJan-
uary2008.csv contains data for all sales of laptops at a computer chain in London in January 2008.
This is a subset of the full dataset that includes data for the entire year.
(a) Create a bar chart, showing the average retail price by store. Which store has the highest
average? Which has the lowest?
(b) To better compare retail prices across stores, create side-by-side boxplots of retail price by
store. Now compare the prices in the two stores from (a). Does there seem to be a difference
between their price distributions?
6. Laptop Sales at a London Computer Chain: Interactive Visualization. The next exercises are
designed for using an interactive visualization tool. The file LaptopSales.txt is a comma-separated
file with nearly 300,000 rows. ENBIS (the European Network for Business and Industrial Statistics)
provided these data as part of a contest organized in the fall of 2009. Scenario: Imagine that you
are a new analyst for a company called Acell (a company selling laptops). You have been provided
with data about products and sales. You need to help the company with their business goal of
planning a product strategy and pricing policies that will maximize Acell’s projected revenues in
2009. Using an interactive visualization tool, answer the following questions.
(a) Price Questions:
i. At what price are the laptops actually selling?
ii. Does price change with time? (Hint: Make sure that the date column is recognized as such.
The software should then enable different temporal aggregation choices, e.g., plotting the
data by weekly or monthly aggregates, or even by day of week.)
iii. Are prices consistent across retail outlets?
iv. How does price change with configuration?
(b) Location Questions:
i. Where are the stores and customers located?
ii. Which stores are selling the most?
iii. How far would customers travel to buy a laptop?
iv. Try an alternative way of looking at how far customers traveled. Do this by creating a
new data column that computes the distance between customer and store.

2
ESMA 6835 Topics of Statistics Dr. Almodóvar-Rivera

(c) Revenue Questions:


i. How do the sales volume in each store relate to Acell’s revenues?
ii. How does this relationship depend on the configuration?
(d) Configuration Questions:
i. What are the details of each configuration? How does this relate to price?
ii. Do all stores sell all configurations?

You might also like