Yelp Dataset

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 9

Part 1: Yelp Dataset Profiling and Understanding

1. Profile the data by finding the total number of records for each of the tables
below:

i. Attribute table = 10000


ii. Business table = 10000
iii. Category table = 10000
iv. Checkin table = 10000
v. elite_years table = 10000
vi. friend table = 10000
vii. hours table = 10000
viii. photo table = 10000
ix. review table = 10000
x. tip table = 10000
xi. user table = 10000

code:
Select count(*)
From tables;

2. Find the total distinct records by either the foreign key or primary key for
each table. If two foreign keys are listed in the table, please specify which
foreign key.

i. Business = id (primary key)


number of distinct records: 10000

Select count(distinct id) From business;

ii. Hours = business_id (foreign key)


number of distinct records: 1562

Select count(distinct business_id) From hours;

iii. Category = business_id (foreign key)


number of distinct records: 2643

Select count(distinct business_id) From category;

iv. Attribute = business_id (foreign key)


number of distinct records: 1115

Select count(distinct business_id) From attribute;

v. Review = id(primary key), business_id(foreign key), user_id(foreign key)


number of distinct records: 10000, 8090, 9581

Select count(distinct id) as pri, count(distinct business_id) as foreign1,


count(distinct user_id) as foreign2 From review;

vi. Checkin = business_id (primary key)


number of distinct records: 493

Select count(distinct business_id) From checkin;

vii. Photo = id(primary key), business_id(foreign key)


number of distinct records: 10000, 6493
Select count(distinct id) as pri, count(distinct business_id) as foreign_key From
photo;

viii. Tip = user_id(foreign key), business_id(foreign key)


number of distinct records: 537, 3979

Select count(distinct user_id) as foreign1, count(distinct business_id) as foreign2


From tip;

ix. User = id(primary key)


number of distinct records: 10000

Select count(distinct id) From user;

x. Friend = user_id(foreign key)


number of distinct records: 11

Select count(distinct user_id) From friend;

xi. Elite_years = user_id(foreign key)


number of distinct records: 2780

Select count(distinct user_id) From elite_years;

3. Are there any columns with null values in the Users table? Indicate "yes," or
"no."

Answer: No

SQL code used to arrive at answer:

Select count(*) From user


Where
id IS NULL or
name IS NULL or
review_count IS NULL or
yelping_since IS NULL or
useful IS NULL or
funny IS NULL or
cool IS NULL or
fans IS NULL or
average_stars IS NULL or
compliment_hot IS NULL or
compliment_more IS NULL or
compliment_profile IS NULL or
compliment_cute IS NULL or
compliment_list IS NULL or
compliment_note IS NULL or
compliment_plain IS NULL or
compliment_cool IS NULL or
compliment_funny IS NULL or
compliment_writer IS NULL or
compliment_photos IS NULL;

4. For each table and column listed below, display the smallest (minimum), largest
(maximum), and average (mean) value for the following fields:
i. Table: Review, Column: Stars

min: 1 max: 5 avg: 3.7082

Select min(stars) as min, max(stars) as max, avg(stars) as avg From review;

ii. Table: Business, Column: Stars

min: 1 max: 5 avg: 3.6549

Select min(stars) as min, max(stars) as max, avg(stars) as avg From business;

iii. Table: Tip, Column: Likes

min: 0 max: 2 avg: 0.0144

Select min(likes) as min, max(likes) as max, avg(likes) as avg From tip;

iv. Table: Checkin, Column: Count

min: 1 max: 53 avg: 1.9414

Select min(count) as min, max(count) as max, avg(count) as avg From checkin;

v. Table: User, Column: Review_count

min: 0 max: 2000 avg: 24.2995

Select min(review_count) as min, max(review_count) as max, avg(review_count)


as avg From user;

5. List the cities with the most reviews in descending order:

SQL code used to arrive at answer:

Select city, sum(review_count) as review_num


From business
Group by city
Order by review_num DESC
Limit 1;

Copy and Paste the Result Below:

+-----------+------------+
| city | review_num |
+-----------+------------+
| Las Vegas | 82854 |
+-----------+------------+

6. Find the distribution of star ratings to the business in the following cities:

i. Avon

SQL code used to arrive at answer:


Select stars, sum(review_count) as count
From business
Where city = 'Avon'
Group by stars;

Copy and Paste the Resulting Table Below (2 columns – star rating and count):
+-------+-------+
| stars | count |
+-------+-------+
| 1.5 | 10 |
| 2.5 | 6 |
| 3.5 | 88 |
| 4.0 | 21 |
| 4.5 | 31 |
| 5.0 | 3 |
+-------+-------+

ii. Beachwood

SQL code used to arrive at answer:

Select stars, sum(review_count) as count


From business
Where city = 'Beachwood'
Group by stars;

Copy and Paste the Resulting Table Below (2 columns – star rating and count):

+-------+-------+
| stars | count |
+-------+-------+
| 2.0 | 8 |
| 2.5 | 3 |
| 3.0 | 11 |
| 3.5 | 6 |
| 4.0 | 69 |
| 4.5 | 17 |
| 5.0 | 23 |
+-------+-------+

7. Find the top 3 users based on their total number of reviews:

SQL code used to arrive at answer:

Select id, name, review_count


From user
Order by review_count DESC
Limit 3;

Copy and Paste the Result Below:

+------------------------+--------+--------------+
| id | name | review_count |
+------------------------+--------+--------------+
| -G7Zkl1wIWBBmD0KRy_sCw | Gerald | 2000 |
| -3s52C4zL_DHRK0ULG6qtg | Sara | 1629 |
| -8lbUNlXVSoXqaRRiHiSNg | Yuri | 1339 |
+------------------------+--------+--------------+

8. Does posing more reviews correlate with more fans?

Please explain your findings and interpretation of the results:

From the result below, we could see that posing more reviews does not correlate
with more fans.

Select name, review_count, fans


From user
Order by review_count DESC;
+-----------+--------------+------+
| name | review_count | fans |
+-----------+--------------+------+
| Gerald | 2000 | 253 |
| Sara | 1629 | 50 |
| Yuri | 1339 | 76 |
| .Hon | 1246 | 101 |
| William | 1215 | 126 |
| Harald | 1153 | 311 |
| eric | 1116 | 16 |
| Roanna | 1039 | 104 |
| Mimi | 968 | 497 |
| Christine | 930 | 173 |
| Ed | 904 | 38 |
| Nicole | 864 | 43 |
| Fran | 862 | 124 |
| Mark | 861 | 115 |
| Christina | 842 | 85 |
| Dominic | 836 | 37 |
| Lissa | 834 | 120 |
| Lisa | 813 | 159 |
| Alison | 775 | 61 |
| Sui | 754 | 78 |
| Tim | 702 | 35 |
| L | 696 | 10 |
| Angela | 694 | 101 |
| Crissy | 676 | 25 |
| Lyn | 675 | 45 |
+-----------+--------------+------+
(Output limit exceeded, 25 of 10000 total rows shown)

9. Are there more reviews with the word "love" or with the word "hate" in them?

Answer:

There were 1780 reviews with the word 'love' and 232 reviews with the word
'hate', indicating that there were more reviews with the word 'love'.

SQL code used to arrive at answer:

Select count(*) as love_num


From review
Where text Like '%love%';
+----------+
| love_num |
+----------+
| 1780 |
+----------+

Select count(*) as hate_num


From review
Where text Like '%hate%';
+----------+
| hate_num |
+----------+
| 232 |
+----------+

10. Find the top 10 users with the most fans:

SQL code used to arrive at answer:

Select id, name, fans


From user
Order by fans DESC
limit 10;

Copy and Paste the Result Below:

+------------------------+-----------+------+
| id | name | fans |
+------------------------+-----------+------+
| -9I98YbNQnLdAmcYfb324Q | Amy | 503 |
| -8EnCioUmDygAbsYZmTeRQ | Mimi | 497 |
| --2vR0DIsmQ6WfcSzKWigw | Harald | 311 |
| -G7Zkl1wIWBBmD0KRy_sCw | Gerald | 253 |
| -0IiMAZI2SsQ7VmyzJjokQ | Christine | 173 |
| -g3XIcCb2b-BD0QBCcq2Sw | Lisa | 159 |
| -9bbDysuiWeo2VShFJJtcw | Cat | 133 |
| -FZBTkAZEXoP7CYvRV2ZwQ | William | 126 |
| -9da1xk7zgnnfO1uTVYGkA | Fran | 124 |
| -lh59ko3dxChBSZ9U7LfUw | Lissa | 120 |
+------------------------+-----------+------+

Part 2: Inferences and Analysis

1. Pick one city and category of your choice and group the businesses in that city
or category by their overall star rating. Compare the businesses with 2-3 stars to
the businesses with 4-5 stars and answer the following questions. Include your
code.

i. Do the two groups you chose to analyze have a different distribution of hours?

The result below indicates they do have a different distribution of hours. 2-3
stars restaurants in Toronto are open from 10 a.m. to 4 p.m., while 4-5 stars
restaurants are open from 6 p.m. to 11 p.m. Restaurants with two stars have longer
hours of operation.

ii. Do the two groups you chose to analyze have a different number of reviews?

They do have a different number of reviews. The result below shows the restaurants
in Toronto with 2-3 stars have 34 reviews and the one with 4-5 stars have 89
reviews.

iii. Are you able to infer anything from the location data provided between these
two groups? Explain.

It is not possible to get more information from their location data because they
are in different locations.

SQL code used for analysis:

Select b.id, b.name,b.city, h.hours, c.category, b.review_count,


b.latitude ||','|| b.longitude as location, b.postal_code,
CASE
WHEN b.stars Between 2 And 3 THEN '2-3 stars'
WHEN b.stars Between 4 And 5 Then '4-5 stars'
END AS star_rate
From hours h
Join business b ON h.business_id = b.id
Join category c ON b.id = c.business_id
Where city = 'Toronto' AND category = 'Restaurants'
Group by star_rate;

+------------------------+-----------+---------+----------------------
+-------------+--------------+------------------+-------------+-----------+
| id | name | city | hours | category
| review_count | location | postal_code | star_rate |
+------------------------+-----------+---------+----------------------
+-------------+--------------+------------------+-------------+-----------+
| 1NyHpXJqSLHnvDCOW0nJDg | Pizzaiolo | Toronto | Saturday|10:00-4:00 | Restaurants
| 34 | 43.6479,-79.3901 | M5H 1X6 | 2-3 stars |
| 37kk0IW6jL7ZlxZF6k2QBg | Edulis | Toronto | Saturday|18:00-23:00 | Restaurants
| 89 | 43.6419,-79.4066 | M5V | 4-5 stars |
+------------------------+-----------+---------+----------------------
+-------------+--------------+------------------+-------------+-----------+

2. Group business based on the ones that are open and the ones that are closed.
What differences can you find between the ones that are still open and the ones
that are closed? List at least two differences and the SQL code you used to arrive
at your answer.

i. Difference 1: The number of distinct business id and city for the ones that are
open are far more than the ones that are closed, so the average of number of
reviews for the ones that are open is higher than the ones that are closed.

ii. Difference 2: There is no large difference in stars rating between the ones
that are open and the ones that are closed.

SQL code used for analysis:

Select count(distinct id), count(distinct city), avg(review_count), avg(stars),


is_open
From business
Group by is_open;

+--------------------+----------------------+-------------------+---------------
+---------+
| count(distinct id) | count(distinct city) | avg(review_count) | avg(stars) |
is_open |
+--------------------+----------------------+-------------------+---------------
+---------+
| 1520 | 144 | 23.1980263158 | 3.52039473684 |
0 |
| 8480 | 351 | 31.7570754717 | 3.67900943396 |
1 |
+--------------------+----------------------+-------------------+---------------
+---------+

3. For this last part of your analysis, you are going to choose the type of
analysis you want to conduct on the Yelp dataset and are going to prepare the data
for analysis.

Ideas for analysis include: Parsing out keywords and business attributes for
sentiment analysis, clustering businesses to find commonalities or anomalies
between them, predicting the overall star rating for a business, predicting the
number of fans a user will have, and so on. These are just a few examples to get
you started, so feel free to be creative and come up with your own problem you want
to solve. Provide answers, in-line, to all of the following:

i. Indicate the type of analysis you chose to do:

I choose to cluster the business in the restaurant category and filter out the
enterprises that are open to obtain the total number of reviews and operation hours
of business with an average rating of 4-5 stars.

ii. Write 1-2 brief paragraphs on the type of data you will need for your analysis
and why you chose that data:

In order to obtain all the information, I connected the three tables of business,
category and hours. Among them, I used the primary key business_id for the inner
join of table business and table category, and left join for the house table to
avoid the filtering of some businesses due to the lack of operation hours.

The variables I respectively choose from the business table are business name,
state, city, the average of stars and the sum of review counts. And I took hours
variable from hours table. Then I filtered businesses that are open as well as
businesses in the restaurant category and grouped them by business name. After
grouping, I kept businesses with an average rating of 4-5 stars and sorted them in
reverse order according to their rating and the total number of reviews.

iii. Output of your finished dataset:

+----------------------------------------+-------+----------------
+----------------------+-------+---------+
| business | state | city | hours
| stars | reviews |
+----------------------------------------+-------+----------------
+----------------------+-------+---------+
| Green Corner Restaurant | AZ | Mesa | Saturday|10:30-
22:00 | 5.0 | 1869 |
| Slyman's Restaurant | OH | Cleveland | Saturday|9:00-
13:00 | 4.5 | 2166 |
| Cabin Fever | ON | Toronto | Saturday|16:00-
2:00 | 4.5 | 182 |
| Sushi Osaka | ON | Toronto | Saturday|11:00-
23:00 | 4.5 | 56 |
| Big Wong Restaurant | NV | Las Vegas | Saturday|10:00-
23:00 | 4.0 | 5376 |
| Bootleggers Modern American Smokehouse | AZ | Phoenix | Saturday|11:00-
22:00 | 4.0 | 3017 |
| Hermanos Mexican Grill | ON | Mississauga | Saturday|12:00-
21:00 | 4.0 | 483 |
| Edulis | ON | Toronto | Saturday|18:00-
23:00 | 4.0 | 445 |
| Masamune Japanese Restaurant | ON | Mississauga | Saturday|12:00-
22:00 | 4.0 | 366 |
| C's Restaurant Bakery and Coffee Shop | WI | Middleton | Saturday|6:00-
14:00 | 4.0 | 259 |
| Miros Cantina Mexicana | EDH | Edinburgh | Saturday|12:00-
22:30 | 4.0 | 259 |
| Rise and Dine Cafe | OH | Chesterland | Saturday|6:00-
15:00 | 4.0 | 210 |
| Matt's Big Breakfast | AZ | Phoenix | None
| 4.0 | 188 |
| Cabin Club | OH | Westlake | None
| 4.0 | 105 |
| TWIISTED Burgers & Sushi | OH | Medina | None
| 4.0 | 94 |
| Naniwa-Taro | ON | Toronto | None
| 4.0 | 75 |
| Colossus Greek Taverna | ON | Oakville | None
| 4.0 | 55 |
| Seoul Garden Korean Restaurant | OH | Cuyahoga Falls | None
| 4.0 | 55 |
| Taqueria Y Cenaduria Culiacan | AZ | Tolleson | None
| 4.0 | 23 |
| Patiala House | ON | Brampton | None
| 4.0 | 10 |
+----------------------------------------+-------+----------------
+----------------------+-------+---------+

iv. Provide the SQL code you used to create your final dataset:

Select b.name as business, b.state, b.city, h.hours,


avg(b.stars) as stars, sum(b.review_count) as reviews
From business b
Join category c On b.id=c.business_id
left Join hours h On b.id=h.business_id
Where b.is_open = 1 AND category = 'Restaurants'
Group by b.name
Having stars Between 4.0 AND 5.0
Order by stars DESC, reviews DESC ;

You might also like