Professional Documents
Culture Documents
Chapter 2
Chapter 2
Chapter 2
2. Application method
With the topic: “ ”, our objective is to apply the data mining methods we have
learned to address a real-world business problem. In specific, given the Food &
Beverages dataset, we want to comprehend the time when a particular type of dish is
ordered with its respective price.
In order to captivate such complexity for more insights collection, our group use
the Orange software to perform data selection, processing, clustering, and make
predictions.
2.1.2.2 K-means
2.1.3 Classification
Data classification plays a crucial role in extracting actionable insights from data,
enabling automated decision-making, and facilitating tasks such as spam filtering, fraud
detection, sentiment analysis, and more.
A support vector machine (SVM) is used for classification and regression tasks.
SVM aims to find the optimal hyperplane that separates data points into different classes
while maximizing the margin, which is the distance between the hyperplane and the
nearest data points (support vectors). SVM works by mapping input data into a high-
dimensional feature space and finding the hyperplane that best separates the classes. This
hyperplane is determined by support vectors, which are data points closest to the decision
boundary.
We preprocess the data using the preprocessing function of the software in order to
remove all of the missing values’s rows.
3. Data analysis
We have constructed a processing workflow in the Orange software as follows:
4. Data visualization
4.1 Data distributions
The visualization shows the correlation between the order_id component and price of
each product.
It is clear from that chart that the correlation is divided into two groups based on the total
ordered quantity of order_id, each type of product has a distinctive price, distributing
scatteringly in all of the colors.
The first group resperents the total number of order_id which is below 1000 order_ids,
while the second group shows the excessive number of 1000 order_ids. The highest
frequency is obviously more than 300 orders, while the lowest is about 30 orders.
The first group respresents the total number of order_id which is less than 1000
order_ids.
Overall, the number of order scatters in all of the order_id, where the most substantial
order was taken by the meal with the price of $17.95. In contrast, $19.95 was the price of
meal that customers found it most reluctant to spend. The price ranged from $11.95 to
$15.5 was the most concentrated preference of customers.
Specifically, the total order in this group was 2243 orders, accounting for 53.97% of the
whole spectrum. The $17.95-meal accounted for the largest order_id with 364 ids, while
the $19.95-meal was of the least chosen with merely 42 ids. The price of meal ranged
from $11.95 to $15.5 was substantially chosen by customers and propounded the most
densely-concentrated cluster. The respective number of order_id for the price of $11.95,
$12.95, $13.95, $14.5, $14.95, $15.5 is 149, 214, 231, 229, 123 and 108, accounting for
nearly 50%.
The second group respresents the total number of order_id which is more than 1000
order_ids.
Overall, the number of order scatters in all of the order_id, where the most substantial
order was taken by the meal with the price of $17.95. In contrast, $19.95 was still the
price of meal that customers found it most reluctant to spend. The price ranged from
$11.95 to $14.5 was the most concentrated preference of customers.
Specifically, the total order in this group was 1913 orders, accounting for 46.03% of the
whole spectrum. The $17.95-meal accounted for the largest order_id with 316 ids, while
the $19.95-meal was of the least chosen with merely 34 ids. The price of meal ranged
from $11.95 to $14.5 was substantially chosen by customers and propounded the most
densely-concentrated cluster.
The box plot represents the data disperson of the sales of restaurant from each dish. The
purpose of this analysis is to make comparision the disparity between ordered dishes on
account of each cusomer’s preference. From that, it is significant for our team to have
suitable adjustments.
The scatter plot visualizes the four clusters of American, Asian, Italian and Mexican food
in preference to the respective price, ranging from $5.0 to $19.95.
From the scatter plot, our team propounds that the price of American food ranged
predominantly from $7.0 to $13.95, where food with the price at $10.5 to $12.95 was of
the least selection. $7.0-meal was responded most substantially. As for Asian food, the
price ranged more variously in virtually all of the price. Most customers were eager to
spend at the price of $5.0 to $14.95, while the price of $16.5 and $17.95 were less
selected. The Intalian food and Mexican food was mutually controversial with the former
food priced from $14.5 to $19.95, while the latter food was cheaper with the price from
$5.0 to $14.95.
From the given analysis, it implicates that the American food should be lowered the price
at $10.5 and $13.95 to gain more customers, or they could enhance the quality. The Asian
food was well-responded by the customers despite price difference. The Italian food was
quite expensive that the price should be decreased. Most noticeably, the Mexican ones
were significantly chosen by the customers with a reasonable price, meaning that they
could lever the price in order to have more profits.
Based on the analysis depicted in the figure, it can be concluded that in this
particular case, kNN is less suitable compared to other classification models.
Additionally, when considering the F1 score, which measures the accuracy of the test, it
is evident that the SVM and Logistic Regression models exhibit higher values. Similarly,
in terms of the AUC (Area under the ROC Curve) metric, the kNN model performs the
poorest.
We employed the Confusion Matrix to analyze the Logistic Regression, aiming to
assess the number of mispredictions.
The confusion matrix shows the 34.4% of the Logistics Regression model; we
may choose this for the final prediction. This is a part of the predictions using the SVM
model with the data we analyzed. We randomly selected 100 customers (10% of the
dataset for predictions).
5.3 Predictions
Picture 17: Actual data’s predictions