Professional Documents
Culture Documents
Yelp Dataset
Yelp Dataset
Yelp Dataset
1. Profile the data by finding the total number of records for each of the tables
below:
code:
Select count(*)
From tables;
2. Find the total distinct records by either the foreign key or primary key for
each table. If two foreign keys are listed in the table, please specify which
foreign key.
3. Are there any columns with null values in the Users table? Indicate "yes," or
"no."
Answer: No
4. For each table and column listed below, display the smallest (minimum), largest
(maximum), and average (mean) value for the following fields:
i. Table: Review, Column: Stars
+-----------+------------+
| city | review_num |
+-----------+------------+
| Las Vegas | 82854 |
+-----------+------------+
6. Find the distribution of star ratings to the business in the following cities:
i. Avon
Copy and Paste the Resulting Table Below (2 columns – star rating and count):
+-------+-------+
| stars | count |
+-------+-------+
| 1.5 | 10 |
| 2.5 | 6 |
| 3.5 | 88 |
| 4.0 | 21 |
| 4.5 | 31 |
| 5.0 | 3 |
+-------+-------+
ii. Beachwood
Copy and Paste the Resulting Table Below (2 columns – star rating and count):
+-------+-------+
| stars | count |
+-------+-------+
| 2.0 | 8 |
| 2.5 | 3 |
| 3.0 | 11 |
| 3.5 | 6 |
| 4.0 | 69 |
| 4.5 | 17 |
| 5.0 | 23 |
+-------+-------+
+------------------------+--------+--------------+
| id | name | review_count |
+------------------------+--------+--------------+
| -G7Zkl1wIWBBmD0KRy_sCw | Gerald | 2000 |
| -3s52C4zL_DHRK0ULG6qtg | Sara | 1629 |
| -8lbUNlXVSoXqaRRiHiSNg | Yuri | 1339 |
+------------------------+--------+--------------+
From the result below, we could see that posing more reviews does not correlate
with more fans.
9. Are there more reviews with the word "love" or with the word "hate" in them?
Answer:
There were 1780 reviews with the word 'love' and 232 reviews with the word
'hate', indicating that there were more reviews with the word 'love'.
+------------------------+-----------+------+
| id | name | fans |
+------------------------+-----------+------+
| -9I98YbNQnLdAmcYfb324Q | Amy | 503 |
| -8EnCioUmDygAbsYZmTeRQ | Mimi | 497 |
| --2vR0DIsmQ6WfcSzKWigw | Harald | 311 |
| -G7Zkl1wIWBBmD0KRy_sCw | Gerald | 253 |
| -0IiMAZI2SsQ7VmyzJjokQ | Christine | 173 |
| -g3XIcCb2b-BD0QBCcq2Sw | Lisa | 159 |
| -9bbDysuiWeo2VShFJJtcw | Cat | 133 |
| -FZBTkAZEXoP7CYvRV2ZwQ | William | 126 |
| -9da1xk7zgnnfO1uTVYGkA | Fran | 124 |
| -lh59ko3dxChBSZ9U7LfUw | Lissa | 120 |
+------------------------+-----------+------+
1. Pick one city and category of your choice and group the businesses in that city
or category by their overall star rating. Compare the businesses with 2-3 stars to
the businesses with 4-5 stars and answer the following questions. Include your
code.
i. Do the two groups you chose to analyze have a different distribution of hours?
The result below indicates they do have a different distribution of hours. 2-3
stars restaurants in Toronto are open from 10 a.m. to 4 p.m., while 4-5 stars
restaurants are open from 6 p.m. to 11 p.m. Restaurants with two stars have longer
hours of operation.
ii. Do the two groups you chose to analyze have a different number of reviews?
They do have a different number of reviews. The result below shows the restaurants
in Toronto with 2-3 stars have 34 reviews and the one with 4-5 stars have 89
reviews.
iii. Are you able to infer anything from the location data provided between these
two groups? Explain.
It is not possible to get more information from their location data because they
are in different locations.
+------------------------+-----------+---------+----------------------
+-------------+--------------+------------------+-------------+-----------+
| id | name | city | hours | category
| review_count | location | postal_code | star_rate |
+------------------------+-----------+---------+----------------------
+-------------+--------------+------------------+-------------+-----------+
| 1NyHpXJqSLHnvDCOW0nJDg | Pizzaiolo | Toronto | Saturday|10:00-4:00 | Restaurants
| 34 | 43.6479,-79.3901 | M5H 1X6 | 2-3 stars |
| 37kk0IW6jL7ZlxZF6k2QBg | Edulis | Toronto | Saturday|18:00-23:00 | Restaurants
| 89 | 43.6419,-79.4066 | M5V | 4-5 stars |
+------------------------+-----------+---------+----------------------
+-------------+--------------+------------------+-------------+-----------+
2. Group business based on the ones that are open and the ones that are closed.
What differences can you find between the ones that are still open and the ones
that are closed? List at least two differences and the SQL code you used to arrive
at your answer.
i. Difference 1: The number of distinct business id and city for the ones that are
open are far more than the ones that are closed, so the average of number of
reviews for the ones that are open is higher than the ones that are closed.
ii. Difference 2: There is no large difference in stars rating between the ones
that are open and the ones that are closed.
+--------------------+----------------------+-------------------+---------------
+---------+
| count(distinct id) | count(distinct city) | avg(review_count) | avg(stars) |
is_open |
+--------------------+----------------------+-------------------+---------------
+---------+
| 1520 | 144 | 23.1980263158 | 3.52039473684 |
0 |
| 8480 | 351 | 31.7570754717 | 3.67900943396 |
1 |
+--------------------+----------------------+-------------------+---------------
+---------+
3. For this last part of your analysis, you are going to choose the type of
analysis you want to conduct on the Yelp dataset and are going to prepare the data
for analysis.
Ideas for analysis include: Parsing out keywords and business attributes for
sentiment analysis, clustering businesses to find commonalities or anomalies
between them, predicting the overall star rating for a business, predicting the
number of fans a user will have, and so on. These are just a few examples to get
you started, so feel free to be creative and come up with your own problem you want
to solve. Provide answers, in-line, to all of the following:
I choose to cluster the business in the restaurant category and filter out the
enterprises that are open to obtain the total number of reviews and operation hours
of business with an average rating of 4-5 stars.
ii. Write 1-2 brief paragraphs on the type of data you will need for your analysis
and why you chose that data:
In order to obtain all the information, I connected the three tables of business,
category and hours. Among them, I used the primary key business_id for the inner
join of table business and table category, and left join for the house table to
avoid the filtering of some businesses due to the lack of operation hours.
The variables I respectively choose from the business table are business name,
state, city, the average of stars and the sum of review counts. And I took hours
variable from hours table. Then I filtered businesses that are open as well as
businesses in the restaurant category and grouped them by business name. After
grouping, I kept businesses with an average rating of 4-5 stars and sorted them in
reverse order according to their rating and the total number of reviews.
+----------------------------------------+-------+----------------
+----------------------+-------+---------+
| business | state | city | hours
| stars | reviews |
+----------------------------------------+-------+----------------
+----------------------+-------+---------+
| Green Corner Restaurant | AZ | Mesa | Saturday|10:30-
22:00 | 5.0 | 1869 |
| Slyman's Restaurant | OH | Cleveland | Saturday|9:00-
13:00 | 4.5 | 2166 |
| Cabin Fever | ON | Toronto | Saturday|16:00-
2:00 | 4.5 | 182 |
| Sushi Osaka | ON | Toronto | Saturday|11:00-
23:00 | 4.5 | 56 |
| Big Wong Restaurant | NV | Las Vegas | Saturday|10:00-
23:00 | 4.0 | 5376 |
| Bootleggers Modern American Smokehouse | AZ | Phoenix | Saturday|11:00-
22:00 | 4.0 | 3017 |
| Hermanos Mexican Grill | ON | Mississauga | Saturday|12:00-
21:00 | 4.0 | 483 |
| Edulis | ON | Toronto | Saturday|18:00-
23:00 | 4.0 | 445 |
| Masamune Japanese Restaurant | ON | Mississauga | Saturday|12:00-
22:00 | 4.0 | 366 |
| C's Restaurant Bakery and Coffee Shop | WI | Middleton | Saturday|6:00-
14:00 | 4.0 | 259 |
| Miros Cantina Mexicana | EDH | Edinburgh | Saturday|12:00-
22:30 | 4.0 | 259 |
| Rise and Dine Cafe | OH | Chesterland | Saturday|6:00-
15:00 | 4.0 | 210 |
| Matt's Big Breakfast | AZ | Phoenix | None
| 4.0 | 188 |
| Cabin Club | OH | Westlake | None
| 4.0 | 105 |
| TWIISTED Burgers & Sushi | OH | Medina | None
| 4.0 | 94 |
| Naniwa-Taro | ON | Toronto | None
| 4.0 | 75 |
| Colossus Greek Taverna | ON | Oakville | None
| 4.0 | 55 |
| Seoul Garden Korean Restaurant | OH | Cuyahoga Falls | None
| 4.0 | 55 |
| Taqueria Y Cenaduria Culiacan | AZ | Tolleson | None
| 4.0 | 23 |
| Patiala House | ON | Brampton | None
| 4.0 | 10 |
+----------------------------------------+-------+----------------
+----------------------+-------+---------+
iv. Provide the SQL code you used to create your final dataset: