Chapter 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

CHAPTER 2

2. Application method

2.1 Theoretical basis

With the topic: “ ”, our objective is to apply the data mining methods we have
learned to address a real-world business problem. In specific, given the Food &
Beverages dataset, we want to comprehend the time when a particular type of dish is
ordered with its respective price.

In order to captivate such complexity for more insights collection, our group use
the Orange software to perform data selection, processing, clustering, and make
predictions.

2.1.1 Data collecting and preprocessing

Data collection involves systematically gathering raw data from unstructured


database in order to form a structured dataset, while data preprocessing is the process of
cleaning, transforming, and organizing the collected data to make it suitable for analysis.
Both data collection and preprocessing are essential steps in extracting valuable insights
from raw data, ensuring its quality and usability for subsequent analysis tasks, as
specified below by the four components of data preprocessing.

Picture 1: Data preprocessing’s components


2.1.2 Clustering

2.1.2.1 Data clustering

Data clustering is a technique used in unsupervised machine learning to group


similar data points together based on certain characteristics or features. The goal of
clustering is to identify inherent patterns or structures within a dataset without the need
for predefined labels. By partitioning the data into distinct clusters, where data points
within the same cluster are more similar to each other than to those in other clusters,
clustering enables data exploration, pattern recognition, and segmentation.

A good clustering method creates high-quality clusters:

• High similarity within clusters

• Low similarity between clusters (high dissimilarity)

Typical applications include:

• Independent data clustering tools.

• Preliminary preprocessing stage for other algorithms.

2.1.2.2 K-means

K-means is a popular unsupervised machine learning algorithm used for clustering


data points into distinct groups based on their similarities. The algorithm iteratively
partitions a dataset into K clusters by minimizing the sum of squared distances between
data points and their respective cluster centroids. Initially, K centroids are randomly
selected from the data points, and each data point is assigned to the nearest centroid.
Then, the centroids are recalculated as the mean of all data points assigned to each
cluster. This process iterates until convergence, where centroids no longer change
significantly or a predefined number of iterations is reached.
Picture 2: K-means illustration

2.1.2.3 Interactive K-means


Interactive K-means is a variation of the traditional K-means algorithm that
incorporates user interaction into the clustering process. Unlike standard K-means, where
the number of clusters (K) is predetermined and centroids are initialized randomly,
interactive K-means allows users to provide feedback during the clustering process. This
feedback can include adjusting the number of clusters, specifying initial centroids, or
guiding the algorithm's iterations by manually moving centroids or reassigning data
points.

2.1.3 Classification

2.1.3.1 Data classification

Data classification is a fundamental task in data mining and machine learning,


involving the categorization of data points into predefined classes or categories based on
their features or attributes. The goal of classification is to develop a predictive model that
can accurately assign new, unseen data instances to the appropriate class labels based on
patterns learned from the training data. In data mining, classification algorithms analyze
historical data to identify patterns and relationships that can be used to classify future
observations.
For example, in email spam detection, the task is to classify incoming emails as
either "spam" or "not spam" based on their content and characteristics. By analyzing a
large dataset of labeled emails (spam or non-spam), a classification algorithm learns
patterns that distinguish between the two classes, such as specific keywords, sender
information, or email formatting. Another example is in medical diagnosis, where
classification algorithms can be used to predict the presence or absence of a disease based
on patient symptoms, test results, and demographic information.

Data classification plays a crucial role in extracting actionable insights from data,
enabling automated decision-making, and facilitating tasks such as spam filtering, fraud
detection, sentiment analysis, and more.

2.1.3.2 Logistic Regression

Logistic regression is a statistical method used for binary classification tasks,


where the goal is to predict the probability that an observation belongs to one of two
possible outcomes (usually labeled as 0 or 1. It models the relationship between one or
more independent variables (features) and the probability of the binary outcome using the
logistic function, which transforms the output into a probability between 0 and 1. During
prediction, logistic regression calculates the probability of the positive outcome and then
applies a decision threshold (usually 0.5) to classify observations into the appropriate
class.

Picture 3: Logistic regreesion’s illustration


2.1.3.3 Support Vector Machine (SVM)

A support vector machine (SVM) is used for classification and regression tasks.
SVM aims to find the optimal hyperplane that separates data points into different classes
while maximizing the margin, which is the distance between the hyperplane and the
nearest data points (support vectors). SVM works by mapping input data into a high-
dimensional feature space and finding the hyperplane that best separates the classes. This
hyperplane is determined by support vectors, which are data points closest to the decision
boundary.

Picture 4: Support Vector Machine’s


illustration

2.1.3.4 The K-Nearest Neighbors (KNN)

K-nearest neighbors (KNN) is made by identifying the majority class or the


average of the nearest K neighbors in the feature space. The choice of K, a positive
integer, determines the number of neighbors considered when making predictions. For
classification tasks, the predicted class is typically determined by a majority vote among
the K nearest neighbors, while for regression tasks, the predicted value is the average of
the target values of the K nearest neighbors. KNN is non-parametric and instance-based,
meaning it doesn't learn an explicit model but memorizes the training dataset, making it
computationally efficient for small to moderate-sized datasets but potentially slower for
larger datasets.

2.2 Actual data manifestation

2.2.1 Confusion matrix

Confusion matrix is a performance measurement tool used in machine learning to


evaluate the accuracy of a classification model. It visualizes the performance of a
classification algorithm by displaying the counts of true positive, true negative, false
positive, and false negative predictions made by the model on a set of test data. Each row
of the matrix represents the instances in an actual class, while each column represents the
instances in a predicted class.

Picture 4: Confusion matrix’s illustration


2.2.2 Actual data collecting and preprocessing
By extracting data from the file, our group examine the data’s structure, the
number of rows in the file, and the basic condition of the data.

Picture 5: Actual data collection

The data we gathered from…specifies the numbers of orders with a respective


price in a restaurant in January. We have made some edits from the initial data in order to
optimize the fuctionality of Orange software, propounding the final dataset which
includes 4156 rows and 8 columns.

We utilize Feature Statistics using Orange software to showcase the information


within each column of our dataset and assess its condition. The ‘Missing’ column
registers at 0%, indicating that our data table is fully complete and doesn’t require any
additional modifications.
Picture 6: Actual data’s feature statistics

We preprocess the data using the preprocessing function of the software in order to
remove all of the missing values’s rows.

Picture 7: Actual data’s preprocession


2.2.3 Dataset illustration
The dataset consists of a collection of information of a restaurant, including 4156
rows and 8 columns with the provision of elaborate insights into tremendous aspects of
the restaurant.

 Order_details_id: From 01 to 4156


 Order_id: From 01 to 1845
 Order_date: From 01/01/2023 to 31/01/2023
 Order time: From 11:15:00 a.m to 11:00:00 p.m
 Item_id: From 101 to 132
 Item_name
 Category: American, Italian, Asian and Mexican
 Price: From $5 to $19.95

3. Data analysis
We have constructed a processing workflow in the Orange software as follows:

Picture 8: Orange’s workflow


K-means

4. Data visualization
4.1 Data distributions

Picture 9: Data distributions’ visualization

The visualization shows the correlation between the order_id component and price of
each product.

It is clear from that chart that the correlation is divided into two groups based on the total
ordered quantity of order_id, each type of product has a distinctive price, distributing
scatteringly in all of the colors.
The first group resperents the total number of order_id which is below 1000 order_ids,
while the second group shows the excessive number of 1000 order_ids. The highest
frequency is obviously more than 300 orders, while the lowest is about 30 orders.

Picture 10: Data distributions’ visualization of the first group

The first group respresents the total number of order_id which is less than 1000
order_ids.

Overall, the number of order scatters in all of the order_id, where the most substantial
order was taken by the meal with the price of $17.95. In contrast, $19.95 was the price of
meal that customers found it most reluctant to spend. The price ranged from $11.95 to
$15.5 was the most concentrated preference of customers.

Specifically, the total order in this group was 2243 orders, accounting for 53.97% of the
whole spectrum. The $17.95-meal accounted for the largest order_id with 364 ids, while
the $19.95-meal was of the least chosen with merely 42 ids. The price of meal ranged
from $11.95 to $15.5 was substantially chosen by customers and propounded the most
densely-concentrated cluster. The respective number of order_id for the price of $11.95,
$12.95, $13.95, $14.5, $14.95, $15.5 is 149, 214, 231, 229, 123 and 108, accounting for
nearly 50%.

Picture 11: Data distributions’ visualization of the second group

The second group respresents the total number of order_id which is more than 1000
order_ids.

Overall, the number of order scatters in all of the order_id, where the most substantial
order was taken by the meal with the price of $17.95. In contrast, $19.95 was still the
price of meal that customers found it most reluctant to spend. The price ranged from
$11.95 to $14.5 was the most concentrated preference of customers.

Specifically, the total order in this group was 1913 orders, accounting for 46.03% of the
whole spectrum. The $17.95-meal accounted for the largest order_id with 316 ids, while
the $19.95-meal was of the least chosen with merely 34 ids. The price of meal ranged
from $11.95 to $14.5 was substantially chosen by customers and propounded the most
densely-concentrated cluster.

4.2 Data box plot

Picture 12: Data box plot s’ visualization

The box plot represents the data disperson of the sales of restaurant from each dish. The
purpose of this analysis is to make comparision the disparity between ordered dishes on
account of each cusomer’s preference. From that, it is significant for our team to have
suitable adjustments.

4.3 Data scatter plot


The data scatter plot aims at helping the restaurant owner adjust the menu in order to
attract more customers based on their predilection toward each dish. The average line
divides the scatter plot into two different areas, the former is the group with less
preference, while the latter is clustered by more responded customers.
Picture 13: Data scatter plot s’ visualization

The scatter plot visualizes the four clusters of American, Asian, Italian and Mexican food
in preference to the respective price, ranging from $5.0 to $19.95.

From the scatter plot, our team propounds that the price of American food ranged
predominantly from $7.0 to $13.95, where food with the price at $10.5 to $12.95 was of
the least selection. $7.0-meal was responded most substantially. As for Asian food, the
price ranged more variously in virtually all of the price. Most customers were eager to
spend at the price of $5.0 to $14.95, while the price of $16.5 and $17.95 were less
selected. The Intalian food and Mexican food was mutually controversial with the former
food priced from $14.5 to $19.95, while the latter food was cheaper with the price from
$5.0 to $14.95.
From the given analysis, it implicates that the American food should be lowered the price
at $10.5 and $13.95 to gain more customers, or they could enhance the quality. The Asian
food was well-responded by the customers despite price difference. The Italian food was
quite expensive that the price should be decreased. Most noticeably, the Mexican ones
were significantly chosen by the customers with a reasonable price, meaning that they
could lever the price in order to have more profits.

5. Results and prediction


5.1 Test and score

Picture 14: Actual data’s test and score

Based on the analysis depicted in the figure, it can be concluded that in this
particular case, kNN is less suitable compared to other classification models.
Additionally, when considering the F1 score, which measures the accuracy of the test, it
is evident that the SVM and Logistic Regression models exhibit higher values. Similarly,
in terms of the AUC (Area under the ROC Curve) metric, the kNN model performs the
poorest.
We employed the Confusion Matrix to analyze the Logistic Regression, aiming to
assess the number of mispredictions.

5.2 Confusion matrix

Picture 15: Actual data’s confusion matrix

The confusion matrix shows the 34.4% of the Logistics Regression model; we
may choose this for the final prediction. This is a part of the predictions using the SVM
model with the data we analyzed. We randomly selected 100 customers (10% of the
dataset for predictions).

5.3 Predictions
Picture 17: Actual data’s predictions

5.4 ROC analysis

You might also like