Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

RDD Assignment

The following exercises are based on the movies.tsv and movie-ratings.tsv files. The
columns in these files are delimited by a tab.

Each line in the movies.tsv file represents an actor who played in a movie. If a movie
has ten actors in it, then there will be ten rows for that particular movie.

1.Compute the number of movies produced in each year. The output should have two
columns: year and count. The output should be ordered by the count in descending
order.

2.Compute the number of movies each actor was in. The output should have two
columns: actor and count. The output should be ordered by the count in descending
order.

3.Compute the highest-rated movie per year and include all the actors in that movie.
The output should have only one movie per year, and it should contain four columns:
year, movie title, rating, and a semicolon-separated list of actor names. This question
will require joining the movies.tsv and movie-ratings.tsv files. There are two
approaches to this problem. The first one is to figure out the highest-rated movie per
year first and then join with the list of actors. The second one is to perform the join
first and then figure out the highest-rated movies per year along with a list of actors.
The result of each approach is different than the other one. Why do you think that is?

4.Determine which pair of actors worked together most. Working together is defined
as appearing in the same movie. The output should have three columns: actor 1, actor
2, and count. The output should be sorted by the count in descending order. The
solution to this question will require a self-join.

You might also like