Chapter 2

CHAPTER 2
2. Application method
2.1 Theoretical basis
With the topic: “ ”, our objective is to apply the data mining methods we have
learned to address a real-world business problem. In specific, given the Food &
Beverages dataset, we want to comprehend the time when a particular type of dish is
ordered with its respective price.
In order to captivate such complexity for more insights collection, our group use
the Orange software to perform data selection, processing, clustering, and make
predictions.
2.1.1 Data collecting and preprocessing
Data collection involves systematically gathering raw data from unstructured

database in order to form a structured dataset, while data preprocessing is the process of
cleaning, transforming, and organizing the collected data to make it suitable for analysis.
Both data collection and preprocessing are essential steps in extracting valuable insights
from raw data, ensuring its quality and usability for subsequent analysis tasks, as
specified below by the four components of data preprocessing.
Picture 1: Data preprocessing’s components

2.1.2 Clustering
2.1.2.1 Data clustering
Data clustering is a technique used in unsupervised machine learning to group

similar data points together based on certain characteristics or features. The goal of
clustering is to identify inherent patterns or structures within a dataset without the need
for predefined labels. By partitioning the data into distinct clusters, where data points
within the same cluster are more similar to each other than to those in other clusters,
clustering enables data exploration, pattern recognition, and segmentation.
A good clustering method creates high-quality clusters:
• High similarity within clusters
• Low similarity between clusters (high dissimilarity)
Typical applications include:
• Independent data clustering tools.
• Preliminary preprocessing stage for other algorithms.
2.1.2.2 K-means
K-means is a popular unsupervised machine learning algorithm used for clustering

data points into distinct groups based on their similarities. The algorithm iteratively
partitions a dataset into K clusters by minimizing the sum of squared distances between
data points and their respective cluster centroids. Initially, K centroids are randomly
selected from the data points, and each data point is assigned to the nearest centroid.
Then, the centroids are recalculated as the mean of all data points assigned to each
cluster. This process iterates until convergence, where centroids no longer change
significantly or a predefined number of iterations is reached.
Picture 2: K-means illustration
2.1.2.3 Interactive K-means

Interactive K-means is a variation of the traditional K-means algorithm that
incorporates user interaction into the clustering process. Unlike standard K-means, where
the number of clusters (K) is predetermined and centroids are initialized randomly,
interactive K-means allows users to provide feedback during the clustering process. This
feedback can include adjusting the number of clusters, specifying initial centroids, or
guiding the algorithm's iterations by manually moving centroids or reassigning data
points.
2.1.3 Classification
2.1.3.1 Data classification
Data classification is a fundamental task in data mining and machine learning,

involving the categorization of data points into predefined classes or categories based on
their features or attributes. The goal of classification is to develop a predictive model that
can accurately assign new, unseen data instances to the appropriate class labels based on
patterns learned from the training data. In data mining, classification algorithms analyze
historical data to identify patterns and relationships that can be used to classify future
observations.
For example, in email spam detection, the task is to classify incoming emails as
either "spam" or "not spam" based on their content and characteristics. By analyzing a
large dataset of labeled emails (spam or non-spam), a classification algorithm learns
patterns that distinguish between the two classes, such as specific keywords, sender
information, or email formatting. Another example is in medical diagnosis, where
classification algorithms can be used to predict the presence or absence of a disease based
on patient symptoms, test results, and demographic information.
Data classification plays a crucial role in extracting actionable insights from data,
enabling automated decision-making, and facilitating tasks such as spam filtering, fraud
detection, sentiment analysis, and more.
2.1.3.2 Logistic Regression
Logistic regression is a statistical method used for binary classification tasks,

where the goal is to predict the probability that an observation belongs to one of two
possible outcomes (usually labeled as 0 or 1. It models the relationship between one or
more independent variables (features) and the probability of the binary outcome using the
logistic function, which transforms the output into a probability between 0 and 1. During
prediction, logistic regression calculates the probability of the positive outcome and then
applies a decision threshold (usually 0.5) to classify observations into the appropriate
class.
Picture 3: Logistic regreesion’s illustration

2.1.3.3 Support Vector Machine (SVM)
A support vector machine (SVM) is used for classification and regression tasks.
SVM aims to find the optimal hyperplane that separates data points into different classes
while maximizing the margin, which is the distance between the hyperplane and the
nearest data points (support vectors). SVM works by mapping input data into a high-
dimensional feature space and finding the hyperplane that best separates the classes. This
hyperplane is determined by support vectors, which are data points closest to the decision
boundary.
Picture 4: Support Vector Machine’s

illustration
2.1.3.4 The K-Nearest Neighbors (KNN)
K-nearest neighbors (KNN) is made by identifying the majority class or the

average of the nearest K neighbors in the feature space. The choice of K, a positive
integer, determines the number of neighbors considered when making predictions. For
classification tasks, the predicted class is typically determined by a majority vote among
the K nearest neighbors, while for regression tasks, the predicted value is the average of
the target values of the K nearest neighbors. KNN is non-parametric and instance-based,
meaning it doesn't learn an explicit model but memorizes the training dataset, making it
computationally efficient for small to moderate-sized datasets but potentially slower for
larger datasets.
2.2 Actual data manifestation
2.2.1 Confusion matrix
Confusion matrix is a performance measurement tool used in machine learning to

evaluate the accuracy of a classification model. It visualizes the performance of a
classification algorithm by displaying the counts of true positive, true negative, false
positive, and false negative predictions made by the model on a set of test data. Each row
of the matrix represents the instances in an actual class, while each column represents the
instances in a predicted class.
Picture 4: Confusion matrix’s illustration

2.2.2 Actual data collecting and preprocessing
By extracting data from the file, our group examine the data’s structure, the
number of rows in the file, and the basic condition of the data.
Picture 5: Actual data collection
The data we gathered from…specifies the numbers of orders with a respective

price in a restaurant in January. We have made some edits from the initial data in order to
optimize the fuctionality of Orange software, propounding the final dataset which
includes 4156 rows and 8 columns.
We utilize Feature Statistics using Orange software to showcase the information

within each column of our dataset and assess its condition. The ‘Missing’ column
registers at 0%, indicating that our data table is fully complete and doesn’t require any
additional modifications.
Picture 6: Actual data’s feature statistics
We preprocess the data using the preprocessing function of the software in order to
remove all of the missing values’s rows.
Picture 7: Actual data’s preprocession

2.2.3 Dataset illustration
The dataset consists of a collection of information of a restaurant, including 4156
rows and 8 columns with the provision of elaborate insights into tremendous aspects of
the restaurant.
 Order_details_id: From 01 to 4156

 Order_id: From 01 to 1845
 Order_date: From 01/01/2023 to 31/01/2023
 Order time: From 11:15:00 a.m to 11:00:00 p.m
 Item_id: From 101 to 132
 Item_name
 Category: American, Italian, Asian and Mexican
 Price: From $5 to $19.95
3. Data analysis
We have constructed a processing workflow in the Orange software as follows:
Picture 8: Orange’s workflow

K-means
4. Data visualization
4.1 Data distributions
Picture 9: Data distributions’ visualization
The visualization shows the correlation between the order_id component and price of
each product.
It is clear from that chart that the correlation is divided into two groups based on the total
ordered quantity of order_id, each type of product has a distinctive price, distributing
scatteringly in all of the colors.
The first group resperents the total number of order_id which is below 1000 order_ids,
while the second group shows the excessive number of 1000 order_ids. The highest
frequency is obviously more than 300 orders, while the lowest is about 30 orders.
Picture 10: Data distributions’ visualization of the first group
The first group respresents the total number of order_id which is less than 1000
order_ids.
Overall, the number of order scatters in all of the order_id, where the most substantial
order was taken by the meal with the price of $17.95. In contrast, $19.95 was the price of
meal that customers found it most reluctant to spend. The price ranged from $11.95 to
$15.5 was the most concentrated preference of customers.
Specifically, the total order in this group was 2243 orders, accounting for 53.97% of the
whole spectrum. The $17.95-meal accounted for the largest order_id with 364 ids, while
the $19.95-meal was of the least chosen with merely 42 ids. The price of meal ranged
from $11.95 to $15.5 was substantially chosen by customers and propounded the most
densely-concentrated cluster. The respective number of order_id for the price of $11.95,
$12.95, $13.95, $14.5, $14.95, $15.5 is 149, 214, 231, 229, 123 and 108, accounting for
nearly 50%.
Picture 11: Data distributions’ visualization of the second group
The second group respresents the total number of order_id which is more than 1000
order_ids.
Overall, the number of order scatters in all of the order_id, where the most substantial
order was taken by the meal with the price of $17.95. In contrast, $19.95 was still the
price of meal that customers found it most reluctant to spend. The price ranged from
$11.95 to $14.5 was the most concentrated preference of customers.
Specifically, the total order in this group was 1913 orders, accounting for 46.03% of the
whole spectrum. The $17.95-meal accounted for the largest order_id with 316 ids, while
the $19.95-meal was of the least chosen with merely 34 ids. The price of meal ranged
from $11.95 to $14.5 was substantially chosen by customers and propounded the most
densely-concentrated cluster.
4.2 Data box plot
Picture 12: Data box plot s’ visualization
The box plot represents the data disperson of the sales of restaurant from each dish. The
purpose of this analysis is to make comparision the disparity between ordered dishes on
account of each cusomer’s preference. From that, it is significant for our team to have
suitable adjustments.
4.3 Data scatter plot

The data scatter plot aims at helping the restaurant owner adjust the menu in order to
attract more customers based on their predilection toward each dish. The average line
divides the scatter plot into two different areas, the former is the group with less
preference, while the latter is clustered by more responded customers.
Picture 13: Data scatter plot s’ visualization
The scatter plot visualizes the four clusters of American, Asian, Italian and Mexican food
in preference to the respective price, ranging from $5.0 to $19.95.
From the scatter plot, our team propounds that the price of American food ranged
predominantly from $7.0 to $13.95, where food with the price at $10.5 to $12.95 was of
the least selection. $7.0-meal was responded most substantially. As for Asian food, the
price ranged more variously in virtually all of the price. Most customers were eager to
spend at the price of $5.0 to $14.95, while the price of $16.5 and $17.95 were less
selected. The Intalian food and Mexican food was mutually controversial with the former
food priced from $14.5 to $19.95, while the latter food was cheaper with the price from
$5.0 to $14.95.
From the given analysis, it implicates that the American food should be lowered the price
at $10.5 and $13.95 to gain more customers, or they could enhance the quality. The Asian
food was well-responded by the customers despite price difference. The Italian food was
quite expensive that the price should be decreased. Most noticeably, the Mexican ones
were significantly chosen by the customers with a reasonable price, meaning that they
could lever the price in order to have more profits.
5. Results and prediction

5.1 Test and score
Picture 14: Actual data’s test and score
Based on the analysis depicted in the figure, it can be concluded that in this
particular case, kNN is less suitable compared to other classification models.
Additionally, when considering the F1 score, which measures the accuracy of the test, it
is evident that the SVM and Logistic Regression models exhibit higher values. Similarly,
in terms of the AUC (Area under the ROC Curve) metric, the kNN model performs the
poorest.
We employed the Confusion Matrix to analyze the Logistic Regression, aiming to
assess the number of mispredictions.
5.2 Confusion matrix
Picture 15: Actual data’s confusion matrix
The confusion matrix shows the 34.4% of the Logistics Regression model; we
may choose this for the final prediction. This is a part of the predictions using the SVM
model with the data we analyzed. We randomly selected 100 customers (10% of the
dataset for predictions).
5.3 Predictions
Picture 17: Actual data’s predictions
5.4 ROC analysis

Chapter 2

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2

Uploaded by

Copyright:

Available Formats

CHAPTER 2

2.1 Theoretical basis

2.1.1 Data collecting and preprocessing

Data collection involves systematically gathering raw data from unstructured

Picture 1: Data preprocessing’s components

2.1.2.1 Data clustering

Data clustering is a technique used in unsupervised machine learning to group

A good clustering method creates high-quality clusters:

• High similarity within clusters

• Low similarity between clusters (high dissimilarity)

Typical applications include:

• Independent data clustering tools.

• Preliminary preprocessing stage for other algorithms.

K-means is a popular unsupervised machine learning algorithm used for clustering

2.1.2.3 Interactive K-means

2.1.3.1 Data classification

Data classification is a fundamental task in data mining and machine learning,

2.1.3.2 Logistic Regression

Logistic regression is a statistical method used for binary classification tasks,

Picture 3: Logistic regreesion’s illustration

Picture 4: Support Vector Machine’s

2.1.3.4 The K-Nearest Neighbors (KNN)

K-nearest neighbors (KNN) is made by identifying the majority class or the

2.2 Actual data manifestation

2.2.1 Confusion matrix

Confusion matrix is a performance measurement tool used in machine learning to

Picture 4: Confusion matrix’s illustration

Picture 5: Actual data collection

The data we gathered from…specifies the numbers of orders with a respective

We utilize Feature Statistics using Orange software to showcase the information

Picture 7: Actual data’s preprocession

 Order_details_id: From 01 to 4156

Picture 8: Orange’s workflow

Picture 9: Data distributions’ visualization

Picture 10: Data distributions’ visualization of the first group

Picture 11: Data distributions’ visualization of the second group

4.2 Data box plot

Picture 12: Data box plot s’ visualization

4.3 Data scatter plot

5. Results and prediction

Picture 14: Actual data’s test and score

5.2 Confusion matrix

Picture 15: Actual data’s confusion matrix

5.4 ROC analysis

You might also like