Professional Documents
Culture Documents
Practice Exam - Gradescope Ver.
Practice Exam - Gradescope Ver.
The exam has a total of 16 problems with a varying number of points per problem, for a total of
120 points in 120 minutes.
You may refer to your three pages of prepared notes. No additional notes, books, phones,
tablets, or laptops are permitted.
Please write your answers in the spaces provided on the exam. Make sure your answers are
neat and clearly marked. You may use the blank areas and backs of the exam pages for
ungraded scratch work.
Name _______________________________________________
In accordance with both the letter and spirit of the Honor Code, I have neither given nor
received assistance on this examination.
Signature _____________________________________________
0
\
For each of the scenarios in the table below, specify the technique you think is most suitable for
the particular problem. Assume you have a large dataset available for analysis or to train a
machine learning algorithm.
For each scenario, choose exactly one from the following seven techniques. Since there are 10
scenarios and 7 techniques, you will need to list some techniques more than once (and it is not
necessary to use all of them). Please just write the technique name with no explanation.
1
\
Consider the following three scatterplots in x-y coordinate space. For each one, suggest rough
estimate values for r (the Pearson or correlation coefficient), and for R2 (the coefficient of
determination) for a linear regression. Guess each value to at most two decimal places. Rather
than exact numbers, we will be looking for the right basic idea with the estimates, and that your
estimates are consistent with each other with respect to the plots.
r:
R2 :
r:
R2 :
r:
R2 :
2
\
Is it possible to get a “perfect fit” (R2 = 1) of these points using regression? If so, briefly explain
how. If not, briefly explain why not.
3
\
The following dataset contains eight items with features x and y, and label color. (The grid to
the right is explained below.) For this problem, you can think of each item as a colored point on
the x-y plane.
x y color
1 1 blue
1 3 red
2 5 yellow
3 5 red
4 1 yellow
4 4 blue
5 3 yellow
5 4 red
Using this data as the training set, run the k-nearest-neighbors classification algorithm
(manually) to decide the most likely color for a new item with x = 3 and y = 3. The distance
between points is the actual distance on the x-y plane (also called Euclidean distance).
Although not required, you may find it helpful to sketch an x-y scatter-plot with colored/labeled
points. We’ve provided a grid for your use, above.
If you run the algorithm with k=1 what color is assigned to the new item?
If you run the algorithm with k=4 what color is assigned to the new item?
4
\
The following dataset is the same as Problem 4, except instead of thinking about the items as
points on the x-y plane, we treat the first two columns as generic feature values.
F1 F2 color
1 1 blue
1 3 red
2 5 yellow
3 5 red
4 1 yellow
4 4 blue
5 3 yellow
5 4 red
Using this data as the training set, we’ll step through the process of Naive Bayes classification.
Step 1 (2 points) - Calculate the probability (frequency) that an item falls into each of the three
ellow. Please use decimals, and if it’s helpful note that ⅜ = 0.375.
categories: blue, red, y
blue
red
yellow
5
\
Step 2 (6 points) - For each category (blue, red,yellow), calculate the probability (frequency)
of each of the different feature values within that category. Please organize and label your
numbers so we can understand them easily. Use decimals and assume ⅓ = 0.33.
blue
F1:
F2:
red
F1:
F2:
yellow
F1:
F2:
Step 3 (4 points) - With the probabilities from Steps 1 and 2 and using the Naive Bayes method,
predict the most likely color for a new item with F1 = 1 and F2 = 4. Show your calculation, and
circle the color that the method would predict. Arithmetic calculations you may find useful are
given below the box. (If none of them are useful you’re probably doing something wrong!)
blue
red
yellow
6
\
Continuing with the same dataset from Problem 5, consider the following decision tree for using
features F1 and F2 to predict color:
If we test the decision tree on our sample data (below), what accuracy will we achieve in our
predictions? We’re looking for a number between 0 and 1, like the measurements we made
when we experimented with the machine learning packages in Python.
Accuracy:
F1 F2 color
1 1 blue
1 3 red
2 5 yellow
3 5 red
4 1 yellow
4 4 blue
5 3 yellow
5 4 red
7
\
Consider performing classification with two features F1 and F2, and label L, as in the previous
two problems. Suppose we have a training set with 1000 items.
If we use k-nearest-neighbors and set k=1000, what effect does this setting have on
classification of new items? Please answer in one sentence.
Is it possible to create a similar effect if we use decision trees? If yes, explain how; if no, explain
why not. Again, please limit your answer to one sentence.
The four plots on the next page are based on the European cities data we used in class,
specifically the merged cities and countries data we often called CitiesExt. The relevant features
of the data for this problem are:
● longitude
● latitude
● temperature
● EU
● coastline
As we did frequently in class, the plots show a dot for each city based on its longitude (x-axis)
and latitude (y-axis), creating a virtual map.
Each plot shows a clustering of the data into either three or five clusters, using either one or two
of the five features listed above for the distance function. (For clustering, EU and coastline are
converted from yes/no to scaled numeric values.) Below each plot, state which feature(s) must
have been used for the clustering. There is one correct answer for each plot.
If you are color blind or otherwise having trouble seeing the colors please let us know.
8
\
Three clusters based on one feature Five clusters based on two features
The feature is: The features are:
Three clusters based on two features Three clusters based on two features
The features are: The features are:
9
\
The next few problems use a dataset about courses that is provided at the end of the exam.
Three copies of the dataset are on the last three sheets. Feel free to tear them off.
Suppose the courses dataset from the end of the exam is loaded into an R dataframe called
took, and the following R code is run:
Suppose again the courses dataset from the end of the exam is loaded into an R dataframe
called took. Write R code to find all students who took a 106 course, returning the student and
the quarter in which they took the course (but not the course itself). The result should be
returned in reverse alphabetical order by student name.
● For full credit, your solution must work even if more 106 courses are added -- 106C,
106D, etc.
● Alternatively, partial credit will be given for solutions that only work for 106A and 106B.
(Hint: the “or” operator in R is a vertical bar.)
10
\
Using the courses dataset at the end of the exam, consider courses to be items and students to
define transactions, i.e., in data-mining terminology a “basket” is a set of courses taken by one
student. Column quarter is not used in this problem.
(a) List all frequent itemsets of size two or more with support > 0.5.
(b) List all association rules that have two items (courses) on the left-hand side and one item
(course) on the right-hand side, with support > 0.5 and confidence > 0.5. Hint: Your answer for
part (a) can be a starting point for this problem.
11
\
(a) In association rule mining, the concept of lift in place of confidence for rules S → i addresses
a shortcoming with the confidence metric that occurs in which scenario? Fill in one bubble:
◯ Some i occur very rarely in the data
◯ Some i occur very frequently in the data
◯ Some S occur very rarely in the data
◯ Some S occur very frequently in the data
(b) When lift is significantly less than 1.0 for a rule S → i , what does that tell us? Fill in one
bubble:
◯ When S occurs, i is more likely to occur than it occurs overall in the data set
◯ When S occurs, i is less likely to occur than it occurs overall in the data set
◯ It is not possible to have confidence > 0 for rule S → i
(c) The A-priori algorithm can be used to speed up which data mining algorithm(s)? Fill in one
bubble:
◯ Frequent itemsets
◯ Association rules
◯ Both frequent itemsets and association rules
Using the courses dataset at the end of the exam, draw an undirected graph where the nodes
are the students (provided), and there is an edge between two students if they took the same
course in the same quarter at least once. Then answer the questions on the next page based on
your graph.
12
\
What is the density of graph (ratio of edges to possible edges)? Please express as a fraction.
True or False: In general (not just in the example data), a clique with three nodes represents
three students taking the same course in the same quarter at least once. Fill in one bubble:
◯ True ◯ False
Suppose the graph has been loaded into Python using the networkx package and named G,
and the following code is run (still using networkx):
B = list(G.neighbors('Ben'))
C = list(G.neighbors('Cal'))
for n1 in B + C:
for n2 in G.neighbors(n1):
if n2 != 'Ben' and n2 != 'Cal': print n2
What does the program print? (don’t worry about output ordering)
13
\
We took two sentences from the CS102 course description and put them into a CSV file with a
header “description”:
Now suppose we load the descriptions into a pandas dataframe called T (with two rows and one
column), we import the re package for regular expressions, and we run the following code:
for i in range(len(T)):
text = T.loc[i].description
s = re.search('techniques(.*)(apply|hands)', text)
print text[s.start():s.end()]
Now suppose we run the following code on the second description (line 1), which token-izes
(line 2), removes punctuation (lines 3-4) and stopwords (lines 5-6), and creates trigrams (line 7).
text = T.loc[1].description
tokens = nltk.wordpunct_tokenize(text)
punct = list(string.punctuation)
tokens = [word for word in tokens if word not in punct]
stop = stopwords.words('english')
words = [word for word in tokens if word not in stop]
trigrams = nltk.ngrams(words, 3)
How many trigrams are there? Write one number. (For this problem, assume any word with
three or fewer characters is a stopword.)
14
\
Consider an image that is 20 pixels by 30 pixels, and the color of each pixel is encoded in RGB
format. How many different images are possible? Feel free to provide an arithmetic expression
rather than the final number.
(a) What techniques are used to handle data very large scale? Fill in one or more bubbles:
(b) True or false: reinforcement learning requires labeled training data. Fill in one bubble:
◯ True
◯ False
(c) According to Vera, what was the key to the second-generation “AlphaGo Zero” system that
beat the world Go champion? Fill in one bubble:
(d) Which of the following simple prediction strategies beat quite a few (more than 20) of the
student submissions on the Project #2 leaderboard, for fractional ratings? Fill in one or more
bubbles:
◯ Always predicting 3
◯ Always predicting 4
◯ Always predicting the average rating for the movie
◯ Always predicting the average rating for the user
◯ Choosing a random number between 1 and 5
15
\
The following data is used for Problems 10-12 & 14. It contains (fictitious) information about
students and the CS courses they took in each quarter.
We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.
Took
16
\
We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.
Took
17
\
We’re including three copies of the same data for your convenience.
Feel free to tear off these three pages.
Took
18