Article 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/377721653

Machine Learning-Based Price Prediction for Laptops

Article · January 2024

CITATIONS READS
0 51

1 author:

Lokmane Akkouh
École nationale des sciences appliquées d'Al Hoceima
1 PUBLICATION 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Lokmane Akkouh on 26 January 2024.

The user has requested enhancement of the downloaded file.


Machine Learning-Based Price
Prediction for Laptops
Author: AKKOUH Lokmane Supervisor: P. KHAMJANE Aziz
(email: lokmane.akkouh@etu.uae.ac.ma ) (email: akhamjane@uae.ac.ma)

Department of Mathematics & Computer Science, Abdelmalek Essaadi Univ. ENSAH College,
Alhoceima, Morocco

A
bstract: In the rapidly evolving digital age, the demand for laptops has surged,
making their pricing a significant concern for consumers worldwide. This study
addresses this issue by developing a prediction model for laptop prices based on
various laptop features. The model employs machine learning techniques, with a
focus on regression algorithms, to predict laptop prices accurately and efficiently. The
data used for this model includes multiple variables such as processor type, RAM, storage
capacity, and brand, among other features. Preliminary results suggest that the model
provides reasonably accurate predictions for reference purposes, offering consumers a
valuable tool for making informed purchasing decisions. However, further improvements
and testing are necessary for commercial use. This study not only contributes to the field
of machine learning applications in e-commerce but also provides a foundation for
enhancing e-commerce platforms’ user experience

influence the price. However, this is a


1.Introduction : crucial aspect for both sellers, who need
The rapid advancement of technology to price their products competitively, and
has led to an increase in the variety and buyers, who are looking for the best value
complexity of products available in the for their money. Machine Learning (ML)
market. One such product category is offers promising solutions to this
laptops, which come in numerous problem. ML algorithms can learn from
configurations and price points. historical data and make predictions on
Predicting the price of a laptop based on unseen data. In the context of laptop
its specifications is a challenging task price prediction, ML models can be
due to the multitude of factors that can trained on a dataset of laptop
specifications and their corresponding
prices. The trained model can then a model that can accurately determine
predict the price of a laptop given its the price of a laptop given its
specifications. In recent years, several specifications such as processor type,
studies have been conducted in this RAM, storage capacity, brand, and other
direction. Here are three notable ones: relevant features. By doing so, the project
Article [1] provides a practical aims to provide valuable insights for both
understanding of the machine learning sellers and buyers in the laptop market.
project lifecycle through a laptop price For sellers, it can help in pricing their
prediction project. The article discusses products competitively, while for buyers,
various steps involved in building a it can assist in finding the best value for
machine learning project, including data their money.
cleaning, exploratory data analysis,
feature engineering, and machine
learning modeling.
Article [2] discusses the application of
machine learning techniques to predict
laptop prices. The article emphasizes the
importance of understanding the data
and performing data cleaning to get the
correct types of columns.
Article [3] describes a supervised
machine learning-based laptop price
prediction system. The study used
multiple linear regression and achieved
an 81% prediction precision. These
articles provide valuable insights into the
application of machine learning for
laptop price prediction. They highlight the
importance of understanding the data,
performing appropriate data cleaning
and feature engineering, and choosing
the right machine-learning model for the
task.

1.1.Objectif :
The primary objective of this project is to
predict laptop prices based on various
laptop features using machine learning
regression models. This involves creating
Fig 1. Visual Guide of Problem-solving Workflo GPU The laptop’s graphics
processing unit
2. Methodology : STOCKAGEHDD The size of the laptop’s
hard disk drive storage
2.1. Data Collection and STOCKAGESSD The size of the laptop’s
Description : solid-state drive storage
TypeStockage The type of storage used
The data for this project was collected in the laptop
from two primary sources. The first Stockage The total storage capacity
source was a Moroccan e-commerce of the laptop
platform, from which data was scraped COULEUR The color of the laptop
on the 17th of December, 2023. This POIDS The weight of the laptop
process yielded a total of 630 records,
ECRAN The size of the laptop’s
each containing detailed information
screen
about different laptop models.
PRIX The price of the laptop
However, to ensure a robust analysis, it RATE The rating of the laptop
was necessary to have a larger dataset.
Therefore, an additional dataset was
2.1.2Table 2:
incorporated from Kaggle, a renowned
open-source data platform. This Laptop Specifications and Prices from
supplementary data provided a more Kaggle Dataset
comprehensive set of records, enhancing
the diversity and representativeness of Company The brand or manufacturer of
the dataset. the laptop
Cpu The laptop’s central
2.1.1Table 1: processing unit
Laptop Specifications and Prices from Ram The size of the laptop’s
Moroccan E-commerce Platform random access memory
MARK The brand or Memory The total storage capacity of
manufacturer of the the laptop
laptop Gpu The laptop’s graphics
RAM The size of the laptop’s processing unit
random access memory OpSys The operating system of the
typeRam The type of RAM used in laptop
the laptop.(ddr3/ddr4) Weight The weight of the laptop
CPU_Brand The brand of the laptop’s TypeName The size of the laptop’s
central processing unit screen
(CPU). ScreenResol The resolution of the laptop’s
CPU_Modifier The specific model of the -ution screen
CPU Inches The size of the laptop’s
CPU_Generation The generation of the screen
CPU Price The price of the laptop
2.2.Data Preprocessing and Moroccan dirhams to maintain
Exploratory Data Analysis (EDA) consistency with the first dataset.
For the 'GPU' column in both datasets, the
names of GPUs were cleaned and
The data preprocessing stage began with grouped under as few types as possible,
the first dataset, which was obtained reducing them from 115 to 20. This step
from a Moroccan e-commerce platform. was crucial to simplify the analysis and
The initial step was to remove duplicate improve the interpretability of the results.
entries to ensure the uniqueness of each
record. After consultation with experts, This rigorous preprocessing stage
certain columns such as 'color', ensured that the datasets were clean,
'typeRam', and 'rate' were deemed consistent, and ready for the subsequent
irrelevant to the project's objective and exploratory data analysis and modeling
were subsequently removed. stages.

The remaining columns underwent 2.3.Data Encoding and Outlier


further preprocessing. Non-digit Handling
characters were removed from the 'price'
After the initial data cleaning, the
and 'RAM' columns. The 'i' character was
next step was to encode the
deleted from the 'CPU_Modifier' column.
categorical variables. For this
The 'ECRAN' column, which represents
project, the ‘mark’ and ‘GPU’
the screen size, had non-digit characters
columns were identified as categorical
removed and the values were converted
variables. These were transformed using
to inches. Any missing values in this
one-hot encoding, a popular method for
column were filled using the mean of the
handling categorical data. This process
existing values.
involved creating a new boolean column
The second phase of preprocessing for each unique category in the original
involved the dataset obtained from column.
Kaggle. Initially, columns that were not
compatible with the first dataset, such as
'opsys', 'screenresolution', and
'typeName', were removed. To align with
the first dataset, the 'Memory' column
was divided into three new columns:
'STOCKAGESSD', 'STOCKAGEHDD', and
'STOCKAGEFlash'. Non-digit characters
were removed from these new columns
as well as from the 'Weight' column.
Finally, the 'Price' column, which was
originally in rupees, was converted to
Fig2: GPU’s values as columns
Following the encoding, the focus shifted reasonable estimate for missing data in
to handling outliers in the dataset. As these cases.
observed in the data visualization, some
records had weights exceeding 10 kg,
which is unusual for laptops. These
outliers were removed manually, without
the need for more complex outlier
detection methods such as Z-score or
IQR. This comprehensive preprocessing
ensured that the dataset was not only
Before Outliers removing After
clean and consistent but also suitable for
the application of machine learning
algorithms.

2.4 Correlation Analysis

Correlation analysis is a statistical method


used to evaluate the strength of relationship
Fig 3. Graphe between two quantitative variables. A high
presenting poids correlation means that two or more variables
have a strong relationship with each other,
and ecran values
while a low correlation means that the
variables are hardly
related. Fig 4.Graphe
presenting
The correlation ‘poids’& ‘eccran’
coefficient is
calculated using the
following formula:
In terms of storage, it was noted from
experience and expert consultation that ∑𝑛 ̅)
𝑖=1(𝑥𝑖−𝑥̅ )(𝑦𝑖−𝑦
r=
modern laptops typically have storage √∑𝑛 2 ̅)2
𝑖=1(𝑥𝑖−𝑥̅ ) (𝑦𝑖−𝑦

capacities of at least 100 GB. Therefore,


any values less than 100 in the storage Where:
columns were replaced with 100.
• r : is the correlation coefficient.
• xi : and yi are the individual sample
points indexed with i.
Lastly, missing values in the • 𝑥̅ and 𝑦 ̅ are the mean of x and y
‘CPU_Modifier’ and ‘CPU_Generation’ respectively.
columns were filled using the mode of the • n is the total number of samples.
respective columns. The mode, being the
most frequently occurring value, is a In our dataset, we calculated the correlation
between each attribute and the target variable
‘PRIX’. We found that:
• ‘ECRAN’ (the size of the laptop’s screen) the input variables (x). The
has a very low correlation with the price. mathematical representation is:
This suggests that screen size does not
significantly influence the price of the 𝑦 = 𝑏_0 + 𝑏_1 ∗ 𝑥_1 + 𝑏_2 ∗ 𝑥_2 + . . . + 𝑏_𝑛
laptop. ∗ 𝑥_𝑛 + 𝑒
• ‘ryzen’ (a specific model of CPU) has no
correlation with the price. This is because
Where:
all the values in this column are null.

Based on these findings, we decided to drop • y is the dependent variable.


these columns from our dataset. Despite some • b0 is the y-intercept.
columns showing small correlations, we • b1 to bn are the coefficients of
decided to keep them due to the limited the independent variables.
availability of data. These columns might still • x1 to xn are the independent
provide valuable insights when combined
with other attributes. variables.
• e is the error term.
3.Model Selection and
The R-squared score for the training set
Evaluation was 0.76, and the MSE was 0.10. For the
testing set, the R-squared score was
In this section, we performed model 0.73, and the MSE was 0.11.
selection for predicting laptop prices.
We split the data into training and testing 3.2.ElasticNet
sets with a test size of 0.2. We then used
GridSearchCV to find the best ElasticNet is a linear regression model
parameters for several models: Linear trained with both l1 and l2-norm
Regression, ElasticNet, Decision Tree, regularization of the coefficients. This
XGBRegressor, combination allows for learning a sparse
GradientBoostingRegressor, model where few of the weights are non-
AdaBoostRegressor, and Lasso. For zero like Lasso, while still maintaining
each model, we calculated the R-squared the regularization properties of Ridge
and Mean Squared Error (MSE) to
evaluate their performance. The R-squared score for the training set
was 0.99, and the MSE was 0.00. For the
3.1.Linear Regression testing set, the R-squared score was
0.73, and the MSE was 0.11.
Linear Regression is a simple yet
powerful model that assumes a linear 3.3.Decision Tree
relationship between the input variables
(x) and the single output variable (y). A Decision Tree is a flowchart-like
More specifically, that y can be structure in which each internal node
calculated from a linear combination of represents a feature (or attribute), each
branch represents a decision rule, and The R-squared score for the training set
each leaf node represents an outcome. was 0.88, and the MSE was 0.05. For the
The topmost node in a decision tree is testing set, the R-squared score was
known as the root node. It learns to 0.84, and the MSE was 0.06.
partition based on the attribute value. It
partitions recursively in such a manner 3.6.AdaBoostRegressor
called recursive partitioning.
The core principle of AdaBoost is to fit
The R-squared score for the training set a sequence of weak learners (i.e., models
was 0.44, and the MSE was 0.23. For the that are only slightly better than random
testing set, the R-squared score was guessing, such as small decision trees)
0.44, and the MSE was 0.22. on repeatedly modified versions of the
data. The predictions from all of them
3.4.XGBRegressor are then combined through a weighted
majority vote (or sum) to produce the
XGBoost is an optimized distributed final prediction.
gradient boosting library designed to be
highly efficient, flexible, and portable. It The R-squared score for the training set
implements machine learning was 0.72, and the MSE was 0.12. For the
algorithms under the Gradient Boosting testing set, the R-squared score was
framework. XGBoost provides a parallel 0.69, and the MSE was 0.12.
tree boosting (also known as GBDT,
GBM) that solves many data science 3.7.Lasso
problems in a fast and accurate way.
Lasso (Least Absolute Shrinkage and
The R-squared score for the training set Selection Operator) is a regression
was 0.98, and the MSE was 0.01. For the analysis method that performs both
testing set, the R-squared score was variable selection and regularization in
0.85, and the MSE was 0.06. order to enhance the prediction accuracy
and interpretability of the statistical
3.5.GradientBoostingRegressor model it produces.
Gradient Boosting builds an additive The R-squared score for the training set
model in a forward stage-wise fashion; it was 0.41, and the MSE was 0.24. For the
allows for the optimization of arbitrary testing set, the R-squared score was
differentiable loss functions. In each 0.41, and the MSE was 0.24.
stage, n_classes_ regression trees are fit
on the negative gradient of the binomial
or multinomial deviance loss function.
4.Conclusion Furthermore, the deployment of these
models in a real-world setting, such as a
In conclusion, our study on predicting platform for predicting laptop prices,
laptop prices using various machine could provide valuable insights for both
learning models yielded interesting sellers and buyers. It could help sellers
results. Despite the limited amount of price their laptops more competitively
data, the models were able to capture key and help buyers make more informed
relationships between laptop decisions.
specifications and their prices. The In the future, we plan to deploy this
model with the highest R-squared score model, along with other models, on a
and the lowest Mean Squared Error platform. This would allow us to leverage
(MSE) on the testing set was the the models’ predictive capabilities in a
GradientBoostingRegressor. practical, user-friendly manner. We also
aim to continually update and refine our
R2 : models as more data becomes available.
This study underscores the potential of
machine learning in making sense of
complex datasets and providing valuable
predictions. As we continue to gather
more data and refine our models, we look
forward to uncovering more insights and
improving the accuracy of our
MSE: predictions.

References
[1] “Laptop Price Prediction in Machine
Learning” by Analytics Vidhya
[2] “Laptop Price Prediction by Machine
Learning” on Medium
However, it’s important to note that the
quantity of data plays a crucial role in the [3] “Forecasting Laptop Prices: A
performance of machine learning Comparative Study of Machine Learning”
models. With the current dataset, we
were able to achieve a certain level of
accuracy. But as we gather more data in
the future, we anticipate that the
performance of our models will improve,
potentially leading to different results.
This makes the field of machine learning
dynamic and exciting, as models
continue to evolve with more data.

View publication stats

You might also like