Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

\

CS102 Final Exam


March 18, 2019

​ nd​ of the exam. Three copies are


The dataset used by Problems 10-12 & 14 is provided at the e
included; feel free to tear off the last three pages.

The exam has a total of 16 problems with a varying number of points per problem, for a total of
120 points in 120 minutes​.

You may refer to your three pages of prepared notes. No additional notes, books, phones,
tablets, or laptops are permitted.

Please write your answers in the spaces provided on the exam. Make sure your answers are
neat and clearly marked. You may use the blank areas and backs of the exam pages for
ungraded scratch work.

Name _______________________________________________

SUNet ID ______________________________ ​@ stanford.edu

In accordance with both the letter and spirit of the Honor Code, I have neither given nor
received assistance on this examination.

Signature _____________________________________________

0
\

Problem 2 ​(10 points)​ ​- ​The right tool for the job

For each of the scenarios in the table below, specify the technique you think is most suitable for
the particular problem. Assume you have a large dataset available for analysis or to train a
machine learning algorithm.
For each scenario, choose ​exactly one​ from the following seven techniques. Since there are 10
scenarios and 7 techniques, you will need to list some techniques more than once (and it is not
necessary to use all of them). Please just write the technique name with no explanation.

Regression - Classification - Clustering


Frequent itemsets - Association rules - Network analysis - Text analysis

Scenario Most suitable technique

Find groups of four or more students who often study


together

Predict a student’s final exam score based on their


midterm score

Guess whether a set of answers on a student survey


are sincere or fabricated

Predict how far a rumor can spread through online


friends

Estimate the quality of a professor’s teaching based


on course evaluation write-in comments

Divide a class into ten project groups, where each


group is comprised of students with similar
backgrounds and skills

Predict which major a student will choose based on


club memberships, nationality, and dormitory

Find the fewest number of flights a student needs to


take to get from SFO (the San Francisco airport) to
their hometown airport

When two courses are often taken together in the


same quarter, recommend a third course to take with
them

Predict how many inches of annual rainfall are


expected at each of the Pac-12 colleges based on
elevation and distance from the ocean

1
\

Problem 3 ​(6 points)​ ​- ​r and R2​

Consider the following three scatterplots in x-y coordinate space. For each one, suggest rough
estimate values for r (the Pearson or correlation coefficient), and for R​2​ (the coefficient of
determination) for a linear regression. ​Guess each value to at most two decimal places​. Rather
than exact numbers, we will be looking for the right basic idea with the estimates, and that your
estimates are consistent with each other with respect to the plots.

r:

R​2 ​:

r:

R​2 ​:

r:

R​2​ :

2
\

Problem 4 ​(3 points)​ ​- ​Regression and R​2

Consider the following scatterplot, which we looked at in class a few times:

Is it possible to get a “perfect fit” (R​2​ = 1) of these points using regression? If so, briefly explain
how. If not, briefly explain why not.

3
\

Problem 5 ​(6 points)​ ​- ​Classification with k-nearest-neighbors

The following dataset contains eight items with features ​x​ and ​y​, and label ​color​. (The grid to
the right is explained below.) For this problem, you can think of each item as a colored point on
the x-y plane.

x y color

1 1 blue

1 3 red

2 5 yellow

3 5 red

4 1 yellow

4 4 blue

5 3 yellow

5 4 red

Using this data as the training set, run the k-nearest-neighbors classification algorithm
(manually) to decide the most likely color for a new item with ​x = 3​ and ​y = 3​. The distance
between points is the actual distance on the x-y plane (also called Euclidean distance).

Although not required, you may find it helpful to sketch an x-y scatter-plot with colored/labeled
points. We’ve provided a grid for your use, above.

If you run the algorithm with k=1 what color is assigned to the new item?

If you run the algorithm with k=4 what color is assigned to the new item?

4
\

Problem 6 ​(12 points)​ ​- ​Classification with Naive Bayes

The following dataset is the same as Problem 4, except instead of thinking about the items as
points on the x-y plane, we treat the first two columns as generic feature values.

F1 F2 color

1 1 blue

1 3 red

2 5 yellow

3 5 red

4 1 yellow

4 4 blue

5 3 yellow

5 4 red

Using this data as the training set, we’ll step through the process of Naive Bayes classification.

Step 1​ (2 points) -​ ​Calculate the probability (frequency) that an item falls into each of the three
​ ellow​. Please use decimals, and if it’s helpful note that ⅜ = 0.375.
categories: ​blue​, ​red​,​ y

blue

red

yellow

(problem continues on next page)

5
\

Step 2 ​(6 points) -​ ​For each category (​blue​, ​red,yellow​), calculate the probability (frequency)
of each of the different feature values within that category. ​Please organize and label your
numbers so we can understand them easily​. Use decimals and assume ⅓ = 0.33.

blue

F1:

F2:

red

F1:

F2:

yellow

F1:

F2:

Step 3 ​(4 points) -​ ​With the probabilities from Steps 1 and 2 and using the Naive Bayes method,
predict the most likely color for a new item with ​F1 = 1​ and ​F2 = 4​. Show your calculation, and
circle the color that the method would predict​. Arithmetic calculations you may find useful are
given below the box. (If none of them are useful you’re probably doing something wrong!)

blue

red

yellow

0.25 × 0.33 × 0.33 = 0.027225 0.375 × 0.33 × 0.33 = 0.0408375


0.25 × 0.33 × 0.5 = 0.04125 0.375 × 0.33 × 0.5 = 0.061875
0.25 × 0.5 × 0.5 = 0.0625 0.375 × 0.5 × 0.5 = 0.09375

6
\

Problem 7 ​(4 points)​ ​- ​Classification with decision trees

Continuing with the same dataset from Problem 5, consider the following decision tree for using
features F1 and F2 to predict color:

If we test the decision tree on our sample data (below), what accuracy will we achieve in our
predictions? We’re looking for a number between 0 and 1, like the measurements we made
when we experimented with the machine learning packages in Python.

Accuracy:

F1 F2 color

1 1 blue

1 3 red

2 5 yellow

3 5 red

4 1 yellow

4 4 blue

5 3 yellow

5 4 red

7
\

Problem 8 ​(4 points)​ ​- ​Classification short answer

Consider performing classification with two features F1 and F2, and label L, as in the previous
two problems. Suppose we have a training set with 1000 items.

If we use k-nearest-neighbors and set k=1000, what effect does this setting have on
classification of new items? ​Please answer in one sentence​.

Is it possible to create a similar effect if we use decision trees? If yes, explain how; if no, explain
why not. Again, ​please limit your answer to one sentence​.

Problem 9 ​(8 points)​ ​- ​Clustering

The four plots on the next page are based on the European cities data we used in class,
specifically the merged cities and countries data we often called ​CitiesExt.​ The relevant features
of the data for this problem are:

● longitude
● latitude
● temperature
● EU
● coastline

As we did frequently in class, the plots show a dot for each city based on its longitude (x-axis)
and latitude (y-axis), creating a virtual map.

Each plot shows a clustering of the data into either three or five clusters, using either one or two
of the five features listed above for the distance function. (For clustering, EU and coastline are
converted from yes/no to scaled numeric values.) Below each plot, state which feature(s) must
have been used for the clustering. There is one correct answer for each plot.

If you are color blind or otherwise having trouble seeing the colors please let us know.

8
\

Three clusters based on one feature Five clusters based on two features
The feature is: The features are:

Three clusters based on two features Three clusters based on two features
The features are: The features are:

9
\

The next few problems use a dataset about courses that is provided at the end of the exam.
Three copies of the dataset are on the last three sheets. Feel free to tear them off.

Problem 10 ​(6 points)​ ​- ​R understanding

Suppose the courses dataset from the end of the exam is loaded into an R dataframe called
took,​ and the following R code is run:

A <- took[took$quarter == 'fall', c('student','course')]


B <- took[took$course == '102', c('student', 'quarter')]
merge(A,B)

What is the result?

Problem 11 ​(8 points)​ ​- ​R coding

Suppose again the courses dataset from the end of the exam is loaded into an R dataframe
called ​took.​ Write R code to find all students who took a 106 course, returning the student and
the quarter in which they took the course (but not the course itself). The result should be
returned in reverse alphabetical order by student name.

● For full credit, your solution must work even if more 106 courses are added -- 106C,
106D, etc.

● Alternatively, partial credit will be given for solutions that only work for 106A and 106B.
(Hint: the “or” operator in R is a vertical bar.)

10
\

Problem 12 ​(14 points) - ​Data mining algorithms

Using the courses dataset at the end of the exam, consider courses to be ​items​ and students to
define ​transactions​, i.e., in data-mining terminology a “basket” is a set of courses taken by one
student. Column​ quarter​ is not used in this problem.

(a) List all ​frequent itemsets​ of size ​two or more​ with support > 0.5.

(b) List all ​association rules​ that have ​two items​ (courses) on the left-hand side and ​one item
(course) on the right-hand side, with support > 0.5 and confidence > 0.5. ​Hint:​ Your answer for
part (a) can be a starting point for this problem.

11
\

Problem 13​ (6 points) - ​Data mining short answer

(a) In association rule mining, the concept of ​lift​ in place of ​confidence​ for rules ​S → i​ addresses
a shortcoming with the confidence metric that occurs in which scenario? Fill in one bubble:
◯​ ​Some ​i​ occur very rarely in the data
◯ ​ ​Some ​i​ occur very frequently in the data
◯​ ​Some ​S​ occur very rarely in the data
◯​ ​Some ​S​ occur very frequently in the data

(b) When ​lift​ is significantly less than 1.0 for a rule ​S → i​ , what does that tell us? Fill in one
bubble:
◯ ​ ​When ​S​ occurs,​ i​ is more likely to occur than it occurs overall in the data set
◯​ ​When ​S​ occurs,​ i​ is less likely to occur than it occurs overall in the data set
◯​ ​It is not possible to have ​confidence​ > 0 for rule ​S → i

(c) The ​A-priori​ algorithm can be used to speed up which data mining algorithm(s)? Fill in one
bubble:
◯​ ​Frequent itemsets
◯​ ​Association rules
◯​ ​Both frequent itemsets and association rules

Problem 14 ​(14 points) - ​Network analysis

Using the courses dataset at the end of the exam, draw an undirected graph where the nodes
are the students (provided), and there is an edge between two students if they took the same
course in the same quarter at least once. Then answer the questions on the next page based on
your graph.

12
\

What is the diameter of the graph?

How many cliques are there with at least three nodes?

What is the density of graph (ratio of edges to possible edges)? Please express as a fraction.

Which node has the highest betweenness centrality?

True or False: In general (not just in the example data), a clique with three nodes represents
three students taking the same course in the same quarter at least once. Fill in one bubble:
◯ ​ ​True ◯​ ​False

Suppose the graph has been loaded into Python using the ​networkx​ package and named G,
and the following code is run (still using networkx):

B = list(G.neighbors('Ben'))
C = list(G.neighbors('Cal'))
for n1 in B + C:
for n2 in G.neighbors(n1):
if n2 != 'Ben' and n2 != 'Cal': print n2

What does the program print? (don’t worry about output ordering)

13
\

Problem 15 ​(8 points) - ​Text Analysis

We took two sentences from the CS102 course description and put them into a CSV file with a
header “description”:

Now suppose we load the descriptions into a pandas dataframe called ​T​ (with two rows and one
column), we import the ​re ​package for regular expressions, and we run the following code:

for i in range(len(T)):
text = T.loc[i].description
s = re.search('techniques(.*)(apply|hands)', text)
print text[s.start():s.end()]

What does the program print?

Now suppose we run the following code on the second description (line 1), which token-izes
(line 2), removes punctuation (lines 3-4) and stopwords (lines 5-6), and creates trigrams (line 7).

text = T.loc[1].description
tokens = nltk.wordpunct_tokenize(text)
punct = list(string.punctuation)
tokens = [word for word in tokens if word not in punct]
stop = stopwords.words('english')
words = [word for word in tokens if word not in stop]
trigrams = nltk.ngrams(words, 3)

How many trigrams are there? Write one number. (For this problem, assume any word with
three or fewer characters is a stopword.)

14
\

Problem 16 ​(3 points) - ​Image Analysis

Consider an image that is 20 pixels by 30 pixels, and the color of each pixel is encoded in RGB
format. How many different images are possible? Feel free to provide an arithmetic expression
rather than the final number.

Problem 17​ (8 points) - ​Short answer from week 10

(a) What techniques are used to handle data very large scale? Fill in ​one or more​ bubbles:

◯​ ​Run on special hardware


◯ ​ ​Use special programming languages
◯​ ​Use parallel execution

(b) True or false: ​reinforcement learning​ requires labeled training data. Fill in one bubble:

◯​ ​True
◯ ​ ​False

(c) According to Vera, what was the key to the second-generation “AlphaGo Zero” system that
beat the world Go champion? Fill in one bubble:

◯​ ​Lots of data from previous games played by humans


◯ ​ ​The system playing the world champion many times and learning his strategies
◯​ ​The system playing itself many, many times and learning what strategies work

(d) Which of the following simple prediction strategies beat quite a few (more than 20) of the
student submissions on the Project #2 leaderboard, for fractional ratings? Fill in ​one or more
bubbles:

◯​ ​Always predicting 3
◯​ ​Always predicting 4
◯​ ​Always predicting the average rating for the movie
◯​ ​Always predicting the average rating for the user
◯​ ​Choosing a random number between 1 and 5

15
\

The following data is used for Problems 10-12 & 14. It contains (fictitious) information about
students and the CS courses they took in each quarter.

We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.

Took

16
\

We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.

Took

17
\

We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.

Took

18

You might also like