Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

IMDB Movie Database

1. How many actors are there in the database? How many movies?
I made sure to identify only the actors and actresses in the cast_info table. I also identified only
the number of movies or TV movies.
numact = dbGetQuery(de, "SELECT COUNT(DISTINCT name)
FROM name, cast_info
WHERE cast_info.person_id = name.id
AND (role_id = 1 or role_id = 2)")
numtotal = dbGetQuery(de, "SELECT COUNT(*) FROM title")
nummov = dbGetQuery(de, "SELECT COUNT(*) FROM title WHERE (kind_id = 1 or kind_id = 3)")
Result: 3053802 actors, 999188 movies
2. What time period does the database cover?
dbGetQuery(de, "SELECT MIN(production_year), MAX(production_year) FROM aka_title")
Result: 1875 2022
3. What proportion of the actors are female? male?
propf = dbGetQuery(de, "SELECT COUNT(gender) FROM name WHERE gender = 'f'")
propf[1,1]/numact[1,1]
propm = dbGetQuery(de, "SELECT COUNT(gender) FROM name WHERE gender = 'm'")
propm[1,1]/numact[1,1]
Result: 40.53% female, 74.06% male
4. What proportion of the entries in the movies table are actual movies and what
proportion are television series, etc.?
prop1 = nummov[1,1]/numtotal[1,1]
prop2 = 1-prop1
Result: 28.32% movies, 71.61% TV series and others
5. How many genres are there? What are their names/descriptions?
I have included a small screenshot of the first 15 names of the genres out of the 32.
dbGetQuery(de, "SELECT COUNT(DISTINCT info) FROM movie_info WHERE info_type_id == 3")
dbGetQuery(de, "SELECT DISTINCT info FROM movie_info WHERE info_type_id == 3")

Result: 32 genres

6. List the 10 most common genres of movies, showing the number of movies in each of
these genres.
I had to link multiple tables to find the common genres. The SQL code is shown below:
dbGetQuery(de, "SELECT info, COUNT(info) AS NumOfMovies
FROM movie_info, title, kind_type
WHERE info_type_id == 3
AND movie_info.movie_id = title.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
GROUP BY info ORDER BY NumOfMovies DESC LIMIT 10")
Result:

7. Find all movies with the keyword 'space'. How many are there? What are the years these
were released? And who were the top 5 actors in each of these movies?
# How many are there?
dbGetQuery(de, "SELECT COUNT(title) FROM title, movie_keyword, keyword, kind_type
WHERE movie_keyword.movie_id = title.id
AND movie_keyword.keyword_id = keyword.id
AND kind_type.id = title.kind_id

AND kind_type.kind = 'movie'


AND keyword.keyword = 'space' LIMIT 10")
Results: 401 movies with the keyword space
# What are the years?
dbGetQuery(de, "SELECT title, production_year
FROM title, movie_keyword, keyword, kind_type
WHERE movie_keyword.movie_id = title.id
AND movie_keyword.keyword_id = keyword.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND keyword.keyword = 'space' LIMIT 10")
Results:

# Who were the top 5 actors in each?


dbGetQuery(de, "SELECT DISTINCT(name), nr_order, title, production_year
FROM title, movie_keyword, keyword, kind_type, cast_info, name
WHERE movie_keyword.movie_id = title.id
AND movie_keyword.keyword_id = keyword.id
AND cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND keyword.keyword = 'space'
AND nr_order < 6
ORDER BY title, nr_order LIMIT 20")
Results:

8. Has the number of movies in each genre changed over time? Plot the overall number of
movies in each year over time, and for each genre.
q8 = dbGetQuery(de, "SELECT production_year, info, COUNT(info) AS NumMovies
FROM title, kind_type, movie_info
WHERE movie_info.movie_id = title.id
AND movie_info.info_type_id == 3
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND (production_year > 1990 and production_year < 2016)
GROUP BY info, production_year
ORDER BY production_year")

Result: I apologize for


the lack of color printing.
Yes, certain genres have
been produced more
frequently. The graph
shows that Short Films,
Dramas, Animations, and
Documentaries are the 4
major genres that have
increased in production
over time.

9. Who are the actors that have been in the most movies? List the top 20.

q9 = dbGetQuery(de, "SELECT name, COUNT(name) AS NumMovies


FROM title, name, cast_info, role_type, kind_type
WHERE cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND cast_info.role_id = role_type.id
AND (role_type.id = 1 or role_type.id = 2)
GROUP BY name ORDER BY NumMovies DESC LIMIT 20")
Results: (shown in table to the right )

10. Who are the actors that have had the most number of movies with "top billing", i.e.,
billed as 1, 2 or 3? For each actor, also show the years these movies spanned?
q10 = dbGetQuery(de, "SELECT name, COUNT(name) AS NumMovies, MIN(production_year),
MAX(production_year)
FROM title, name, cast_info, role_type, kind_type
WHERE cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND cast_info.role_id = role_type.id
AND role_id < 3
AND (nr_order < 4 and nr_order > 0)
GROUP BY name ORDER BY NumMovies DESC LIMIT 20")
Results: Its pretty cool that the same guy is number 1 in both the previous table and this one.

11. Who are the 10 actors that performed in the most movies within any given year? What
are their names, the year they starred in these movies and the names of the movies?
For this problem, I only showed the production years after 1990 because I wanted to show a
section of the data. Also, I couldnt figure out how to order the data in descending order by
Number of Movies and then by order it by production year within that. The code and table
produced are shown below, but I dont know where I went wrong.
q11 = dbGetQuery(de, "SELECT name, COUNT(name) AS NumMovies, production_year, title
FROM title, name, cast_info, role_type, kind_type
WHERE cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND cast_info.role_id = role_type.id
AND role_id < 3
AND production_year > 1990
GROUP BY name
HAVING NumMovies > 0
ORDER BY NumMovies DESC, production_year LIMIT 10")
Result:

12. Who are the 10 actors that have the most aliases (i.e., see the aka_names table).
q12 = dbGetQuery(de, "SELECT name.name, COUNT(name.name)
AS NumAliases
FROM aka_name, name
WHERE aka_name.person_id = name.id
GROUP BY name.name ORDER BY NumAliases DESC LIMIT
10")
Result: (see table at right )

Code Appendix
de = dbConnect(SQLite(), "lean_imdbpy.db")
dbListTables(de)
# ===== Q1 =====
# Number of actors
numact = dbGetQuery(de, "SELECT COUNT(DISTINCT name)
FROM name, cast_info
WHERE cast_info.person_id = name.id
AND (role_id = 1 or role_id = 2)")
# Number of movies
dbGetQuery(de, "SELECT DISTINCT kind FROM kind_type")
numtotal = dbGetQuery(de, "SELECT COUNT(*) FROM title")
nummov = dbGetQuery(de, "SELECT COUNT(*) FROM title WHERE (kind_id = 1 or kind_id
= 3)")
# ===== Q2 =====
# Time period that the database covers
dbGetQuery(de, "SELECT MIN(production_year), MAX(production_year) FROM aka_title")
# ===== Q3 =====
# What proportion of the actors are female?
dbGetQuery(de, "SELECT * FROM name LIMIT 5")
dbGetQuery(de, "SELECT DISTINCT gender FROM name")
propf = dbGetQuery(de, "SELECT COUNT(name) FROM name WHERE gender = 'f'")
propf[1,1]/numact[1,1]
# What proportion of the actors are male?
propm = dbGetQuery(de, "SELECT COUNT(name) FROM name WHERE gender = 'm'")
propm[1,1]/numact[1,1]
# ===== Q4 =====
# What proportion are actual movies and what proportion are television series, etc?
prop1 = nummov[1,1]/numtotal[1,1]
prop2 = 1-prop1
# ===== Q5 =====
# How many genres are there? And what are their names/descriptions?
dbGetQuery(de, "SELECT COUNT(DISTINCT info) FROM movie_info WHERE info_type_id
== 3")
dbGetQuery(de, "SELECT DISTINCT info FROM movie_info WHERE info_type_id == 3")
# ===== Q6 =====

# List the 10 most common genres of movies with the number of movies in each
dbGetQuery(de, "SELECT info, COUNT(info) AS NumOfMovies
FROM movie_info, title, kind_type
WHERE info_type_id == 3
AND movie_info.movie_id = title.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
GROUP BY info ORDER BY NumOfMovies DESC LIMIT 10")
# ===== Q7 =====
# Find all movies with the keyword "space"
# How many are there?
dbGetQuery(de, "SELECT COUNT(title) FROM title, movie_keyword, keyword, kind_type
WHERE movie_keyword.movie_id = title.id
AND movie_keyword.keyword_id = keyword.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND keyword.keyword = 'space' LIMIT 10")
# What are the years?
dbGetQuery(de, "SELECT title, production_year
FROM title, movie_keyword, keyword, kind_type
WHERE movie_keyword.movie_id = title.id
AND movie_keyword.keyword_id = keyword.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND keyword.keyword = 'space' LIMIT 10")
# Who were the top 5 actors in each?
dbGetQuery(de, "SELECT DISTINCT(name), nr_order, title, production_year
FROM title, movie_keyword, keyword, kind_type, cast_info, name
WHERE movie_keyword.movie_id = title.id
AND movie_keyword.keyword_id = keyword.id
AND cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND keyword.keyword = 'space'
AND nr_order < 6 AND nr_order > 0
ORDER BY title, nr_order LIMIT 20")
# ===== Q8 =====
# Has the number of movies in each genre changed over time?
q8 = dbGetQuery(de, "SELECT production_year, info, COUNT(info) AS NumMovies
FROM title, kind_type, movie_info
WHERE movie_info.movie_id = title.id

AND movie_info.info_type_id == 3
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND (production_year > 1990 and production_year < 2016)
GROUP BY info, production_year
ORDER BY production_year")
# Plot the overall number of movies in each year over time by genre
library(ggplot2)
qplot(production_year, NumMovies, data = q8, colour = info)
# ===== Q9 =====
# Who are the actors that have been in the most movies? List top 20.
q9 = dbGetQuery(de, "SELECT name, COUNT(name) AS NumMovies
FROM title, name, cast_info, role_type, kind_type
WHERE cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND cast_info.role_id = role_type.id
AND (role_type.id = 1 or role_type.id = 2)
GROUP BY name ORDER BY NumMovies DESC LIMIT 20")
# ===== Q10 =====
# Who are the actors in the most movies with "top billing"? Show range of years too.
q10 = dbGetQuery(de, "SELECT name, COUNT(name) AS NumMovies,
MIN(production_year), MAX(production_year)
FROM title, name, cast_info, role_type, kind_type
WHERE cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND cast_info.role_id = role_type.id
AND role_id < 3
AND (nr_order < 4 and nr_order > 0)
GROUP BY name ORDER BY NumMovies DESC LIMIT 20")
# ===== Q11 =====
# Who are the 10 actors that performed in the most movies in each year?
q11 = dbGetQuery(de, "SELECT name, COUNT(name) AS NumMovies, production_year, title
FROM title, name, cast_info, role_type, kind_type
WHERE cast_info.movie_id = title.id
AND cast_info.person_id = name.id
AND kind_type.id = title.kind_id
AND kind_type.kind = 'movie'
AND cast_info.role_id = role_type.id

AND role_id < 3


AND production_year > 1990
GROUP BY name
HAVING NumMovies > 0
ORDER BY NumMovies DESC, production_year LIMIT 10")
# ===== Q12 =====
# Who are the 10 actors with the most aliases?
q12 = dbGetQuery(de, "SELECT name.name, COUNT(name.name) AS NumAliases
FROM aka_name, name
WHERE aka_name.person_id = name.id
GROUP BY name.name ORDER BY NumAliases DESC LIMIT 10")

You might also like