Practice Exam - Gradescope Ver.

\
CS102 Final Exam

March 18, 2019
nd of the exam. Three copies are

The dataset used by Problems 10-12 & 14 is provided at the e
included; feel free to tear off the last three pages.
The exam has a total of 16 problems with a varying number of points per problem, for a total of
120 points in 120 minutes.
You may refer to your three pages of prepared notes. No additional notes, books, phones,
tablets, or laptops are permitted.
Please write your answers in the spaces provided on the exam. Make sure your answers are
neat and clearly marked. You may use the blank areas and backs of the exam pages for
ungraded scratch work.
Name _______________________________________________
SUNet ID ______________________________ @ stanford.edu
In accordance with both the letter and spirit of the Honor Code, I have neither given nor
received assistance on this examination.
Signature _____________________________________________
0
\
Problem 2 (10 points) - The right tool for the job
For each of the scenarios in the table below, specify the technique you think is most suitable for
the particular problem. Assume you have a large dataset available for analysis or to train a
machine learning algorithm.
For each scenario, choose exactly one from the following seven techniques. Since there are 10
scenarios and 7 techniques, you will need to list some techniques more than once (and it is not
necessary to use all of them). Please just write the technique name with no explanation.
Regression - Classification - Clustering

Frequent itemsets - Association rules - Network analysis - Text analysis
Scenario Most suitable technique
Find groups of four or more students who often study

together
Predict a student’s final exam score based on their

midterm score
Guess whether a set of answers on a student survey

are sincere or fabricated
Predict how far a rumor can spread through online

friends
Estimate the quality of a professor’s teaching based

on course evaluation write-in comments
Divide a class into ten project groups, where each

group is comprised of students with similar
backgrounds and skills
Predict which major a student will choose based on

club memberships, nationality, and dormitory
Find the fewest number of flights a student needs to

take to get from SFO (the San Francisco airport) to
their hometown airport
When two courses are often taken together in the

same quarter, recommend a third course to take with
them
Predict how many inches of annual rainfall are

expected at each of the Pac-12 colleges based on
elevation and distance from the ocean
1
\
Problem 3 (6 points) - r and R2
Consider the following three scatterplots in x-y coordinate space. For each one, suggest rough
estimate values for r (the Pearson or correlation coefficient), and for R2 (the coefficient of
determination) for a linear regression. Guess each value to at most two decimal places. Rather
than exact numbers, we will be looking for the right basic idea with the estimates, and that your
estimates are consistent with each other with respect to the plots.
r:
R2 :
r:
R2 :
r:
R2 :
2
\
Problem 4 (3 points) - Regression and R2
Consider the following scatterplot, which we looked at in class a few times:
Is it possible to get a “perfect fit” (R2 = 1) of these points using regression? If so, briefly explain
how. If not, briefly explain why not.
3
\
Problem 5 (6 points) - Classification with k-nearest-neighbors
The following dataset contains eight items with features x and y, and label color. (The grid to
the right is explained below.) For this problem, you can think of each item as a colored point on
the x-y plane.
x y color
1 1 blue
1 3 red
2 5 yellow
3 5 red
4 1 yellow
4 4 blue
5 3 yellow
5 4 red
Using this data as the training set, run the k-nearest-neighbors classification algorithm
(manually) to decide the most likely color for a new item with x = 3 and y = 3. The distance
between points is the actual distance on the x-y plane (also called Euclidean distance).
Although not required, you may find it helpful to sketch an x-y scatter-plot with colored/labeled
points. We’ve provided a grid for your use, above.
If you run the algorithm with k=1 what color is assigned to the new item?
If you run the algorithm with k=4 what color is assigned to the new item?
4
\
Problem 6 (12 points) - Classification with Naive Bayes
The following dataset is the same as Problem 4, except instead of thinking about the items as
points on the x-y plane, we treat the first two columns as generic feature values.
F1 F2 color
1 1 blue
1 3 red
2 5 yellow
3 5 red
4 1 yellow
4 4 blue
5 3 yellow
5 4 red
Using this data as the training set, we’ll step through the process of Naive Bayes classification.
Step 1 (2 points) - Calculate the probability (frequency) that an item falls into each of the three
ellow. Please use decimals, and if it’s helpful note that ⅜ = 0.375.
categories: blue, red, y
blue
red
yellow
(problem continues on next page)
5
\
Step 2 (6 points) - For each category (blue, red,yellow), calculate the probability (frequency)
of each of the different feature values within that category. Please organize and label your
numbers so we can understand them easily. Use decimals and assume ⅓ = 0.33.
blue
F1:
F2:
red
F1:
F2:
yellow
F1:
F2:
Step 3 (4 points) - With the probabilities from Steps 1 and 2 and using the Naive Bayes method,
predict the most likely color for a new item with F1 = 1 and F2 = 4. Show your calculation, and
circle the color that the method would predict. Arithmetic calculations you may find useful are
given below the box. (If none of them are useful you’re probably doing something wrong!)
blue
red
yellow
0.25 × 0.33 × 0.33 = 0.027225 0.375 × 0.33 × 0.33 = 0.0408375

0.25 × 0.33 × 0.5 = 0.04125 0.375 × 0.33 × 0.5 = 0.061875
0.25 × 0.5 × 0.5 = 0.0625 0.375 × 0.5 × 0.5 = 0.09375
6
\
Problem 7 (4 points) - Classification with decision trees
Continuing with the same dataset from Problem 5, consider the following decision tree for using
features F1 and F2 to predict color:
If we test the decision tree on our sample data (below), what accuracy will we achieve in our
predictions? We’re looking for a number between 0 and 1, like the measurements we made
when we experimented with the machine learning packages in Python.
Accuracy:
F1 F2 color
1 1 blue
1 3 red
2 5 yellow
3 5 red
4 1 yellow
4 4 blue
5 3 yellow
5 4 red
7
\
Problem 8 (4 points) - Classification short answer
Consider performing classification with two features F1 and F2, and label L, as in the previous
two problems. Suppose we have a training set with 1000 items.
If we use k-nearest-neighbors and set k=1000, what effect does this setting have on
classification of new items? Please answer in one sentence.
Is it possible to create a similar effect if we use decision trees? If yes, explain how; if no, explain
why not. Again, please limit your answer to one sentence.
Problem 9 (8 points) - Clustering
The four plots on the next page are based on the European cities data we used in class,
specifically the merged cities and countries data we often called CitiesExt. The relevant features
of the data for this problem are:
● longitude
● latitude
● temperature
● EU
● coastline
As we did frequently in class, the plots show a dot for each city based on its longitude (x-axis)
and latitude (y-axis), creating a virtual map.
Each plot shows a clustering of the data into either three or five clusters, using either one or two
of the five features listed above for the distance function. (For clustering, EU and coastline are
converted from yes/no to scaled numeric values.) Below each plot, state which feature(s) must
have been used for the clustering. There is one correct answer for each plot.
If you are color blind or otherwise having trouble seeing the colors please let us know.
8
\
Three clusters based on one feature Five clusters based on two features
The feature is: The features are:
Three clusters based on two features Three clusters based on two features
The features are: The features are:
9
\
The next few problems use a dataset about courses that is provided at the end of the exam.
Three copies of the dataset are on the last three sheets. Feel free to tear them off.
Problem 10 (6 points) - R understanding
Suppose the courses dataset from the end of the exam is loaded into an R dataframe called
took, and the following R code is run:
A <- took[took$quarter == 'fall', c('student','course')]

B <- took[took$course == '102', c('student', 'quarter')]
merge(A,B)
What is the result?
Problem 11 (8 points) - R coding
Suppose again the courses dataset from the end of the exam is loaded into an R dataframe
called took. Write R code to find all students who took a 106 course, returning the student and
the quarter in which they took the course (but not the course itself). The result should be
returned in reverse alphabetical order by student name.
● For full credit, your solution must work even if more 106 courses are added -- 106C,
106D, etc.
● Alternatively, partial credit will be given for solutions that only work for 106A and 106B.
(Hint: the “or” operator in R is a vertical bar.)
10
\
Problem 12 (14 points) - Data mining algorithms
Using the courses dataset at the end of the exam, consider courses to be items and students to
define transactions, i.e., in data-mining terminology a “basket” is a set of courses taken by one
student. Column quarter is not used in this problem.
(a) List all frequent itemsets of size two or more with support > 0.5.
(b) List all association rules that have two items (courses) on the left-hand side and one item
(course) on the right-hand side, with support > 0.5 and confidence > 0.5. Hint: Your answer for
part (a) can be a starting point for this problem.
11
\
Problem 13 (6 points) - Data mining short answer
(a) In association rule mining, the concept of lift in place of confidence for rules S → i addresses
a shortcoming with the confidence metric that occurs in which scenario? Fill in one bubble:
◯ Some i occur very rarely in the data
◯ Some i occur very frequently in the data
◯ Some S occur very rarely in the data
◯ Some S occur very frequently in the data
(b) When lift is significantly less than 1.0 for a rule S → i , what does that tell us? Fill in one
bubble:
◯ When S occurs, i is more likely to occur than it occurs overall in the data set
◯ When S occurs, i is less likely to occur than it occurs overall in the data set
◯ It is not possible to have confidence > 0 for rule S → i
(c) The A-priori algorithm can be used to speed up which data mining algorithm(s)? Fill in one
bubble:
◯ Frequent itemsets
◯ Association rules
◯ Both frequent itemsets and association rules
Problem 14 (14 points) - Network analysis
Using the courses dataset at the end of the exam, draw an undirected graph where the nodes
are the students (provided), and there is an edge between two students if they took the same
course in the same quarter at least once. Then answer the questions on the next page based on
your graph.
12
\
What is the diameter of the graph?
How many cliques are there with at least three nodes?
What is the density of graph (ratio of edges to possible edges)? Please express as a fraction.
Which node has the highest betweenness centrality?
True or False: In general (not just in the example data), a clique with three nodes represents
three students taking the same course in the same quarter at least once. Fill in one bubble:
◯ True ◯ False
Suppose the graph has been loaded into Python using the networkx package and named G,
and the following code is run (still using networkx):
B = list(G.neighbors('Ben'))
C = list(G.neighbors('Cal'))
for n1 in B + C:
for n2 in G.neighbors(n1):
if n2 != 'Ben' and n2 != 'Cal': print n2
What does the program print? (don’t worry about output ordering)
13
\
Problem 15 (8 points) - Text Analysis
We took two sentences from the CS102 course description and put them into a CSV file with a
header “description”:
Now suppose we load the descriptions into a pandas dataframe called T (with two rows and one
column), we import the re package for regular expressions, and we run the following code:
for i in range(len(T)):
text = T.loc[i].description
s = re.search('techniques(.*)(apply|hands)', text)
print text[s.start():s.end()]
What does the program print?
Now suppose we run the following code on the second description (line 1), which token-izes
(line 2), removes punctuation (lines 3-4) and stopwords (lines 5-6), and creates trigrams (line 7).
text = T.loc[1].description
tokens = nltk.wordpunct_tokenize(text)
punct = list(string.punctuation)
tokens = [word for word in tokens if word not in punct]
stop = stopwords.words('english')
words = [word for word in tokens if word not in stop]
trigrams = nltk.ngrams(words, 3)
How many trigrams are there? Write one number. (For this problem, assume any word with
three or fewer characters is a stopword.)
14
\
Problem 16 (3 points) - Image Analysis
Consider an image that is 20 pixels by 30 pixels, and the color of each pixel is encoded in RGB
format. How many different images are possible? Feel free to provide an arithmetic expression
rather than the final number.
Problem 17 (8 points) - Short answer from week 10
(a) What techniques are used to handle data very large scale? Fill in one or more bubbles:
◯ Run on special hardware

◯ Use special programming languages
◯ Use parallel execution
(b) True or false: reinforcement learning requires labeled training data. Fill in one bubble:
◯ True
◯ False
(c) According to Vera, what was the key to the second-generation “AlphaGo Zero” system that
beat the world Go champion? Fill in one bubble:
◯ Lots of data from previous games played by humans

◯ The system playing the world champion many times and learning his strategies
◯ The system playing itself many, many times and learning what strategies work
(d) Which of the following simple prediction strategies beat quite a few (more than 20) of the
student submissions on the Project #2 leaderboard, for fractional ratings? Fill in one or more
bubbles:
◯ Always predicting 3
◯ Always predicting 4
◯ Always predicting the average rating for the movie
◯ Always predicting the average rating for the user
◯ Choosing a random number between 1 and 5
15
\
The following data is used for Problems 10-12 & 14. It contains (fictitious) information about
students and the CS courses they took in each quarter.
We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.
Took
16
\
Took
17
\
Took
18

Practice Exam - Gradescope Ver.

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practice Exam - Gradescope Ver.

Uploaded by

Copyright:

Available Formats

\

CS102 Final Exam

​ nd​ of the exam. Three copies are

SUNet ID ______________________________ ​@ stanford.edu

Problem 2 ​(10 points)​ ​- ​The right tool for the job

Regression - Classification - Clustering

Scenario Most suitable technique

Find groups of four or more students who often study

Predict a student’s final exam score based on their

Guess whether a set of answers on a student survey

Predict how far a rumor can spread through online

Estimate the quality of a professor’s teaching based

Divide a class into ten project groups, where each

Predict which major a student will choose based on

Find the fewest number of flights a student needs to

When two courses are often taken together in the

Predict how many inches of annual rainfall are

Problem 3 ​(6 points)​ ​- ​r and R2​

Problem 4 ​(3 points)​ ​- ​Regression and R​2

Consider the following scatterplot, which we looked at in class a few times:

Problem 5 ​(6 points)​ ​- ​Classification with k-nearest-neighbors

Problem 6 ​(12 points)​ ​- ​Classification with Naive Bayes

(problem continues on next page)

0.25 × 0.33 × 0.33 = 0.027225 0.375 × 0.33 × 0.33 = 0.0408375

Problem 7 ​(4 points)​ ​- ​Classification with decision trees

Problem 8 ​(4 points)​ ​- ​Classification short answer

Problem 9 ​(8 points)​ ​- ​Clustering

Problem 10 ​(6 points)​ ​- ​R understanding

A <- took[took$quarter == 'fall', c('student','course')]

What is the result?

Problem 11 ​(8 points)​ ​- ​R coding

Problem 12 ​(14 points) - ​Data mining algorithms

Problem 13​ (6 points) - ​Data mining short answer

Problem 14 ​(14 points) - ​Network analysis

What is the diameter of the graph?

How many cliques are there with at least three nodes?

Which node has the highest betweenness centrality?

Problem 15 ​(8 points) - ​Text Analysis

What does the program print?

Problem 16 ​(3 points) - ​Image Analysis

Problem 17​ (8 points) - ​Short answer from week 10

◯​ ​Run on special hardware

◯​ ​Lots of data from previous games played by humans

You might also like

nd of the exam. Three copies are

SUNet ID ______________________________ @ stanford.edu

Problem 2 (10 points) - The right tool for the job

Problem 3 (6 points) - r and R2

Problem 4 (3 points) - Regression and R2

Problem 5 (6 points) - Classification with k-nearest-neighbors

Problem 6 (12 points) - Classification with Naive Bayes

Problem 7 (4 points) - Classification with decision trees

Problem 8 (4 points) - Classification short answer

Problem 9 (8 points) - Clustering

Problem 10 (6 points) - R understanding

Problem 11 (8 points) - R coding

Problem 12 (14 points) - Data mining algorithms

Problem 13 (6 points) - Data mining short answer

Problem 14 (14 points) - Network analysis

Problem 15 (8 points) - Text Analysis

Problem 16 (3 points) - Image Analysis

Problem 17 (8 points) - Short answer from week 10

◯ Run on special hardware

◯ Lots of data from previous games played by humans