Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

USED CAR PRICE PREDICTOR

Major Project Report

DEPARTMENT OF COMPUTER SCIENCE AND ENGG.


RAJKIYA ENGINEERING COLLEGE, KANNAUJ

PROJECT GUIDE : PROJECT MEMBER’S:


Dr. B.D.K PATRO AJAY YADAV (1908390100007)
Associate Professor SAURABH KUMAR
(1908390100056)
Dept. of Computer science and Engineering
Rajkiya Engineering college, Kannauj

TABLE OF CONTENTS

1. Introduction………………………………………………………………………….3
a) Motivation & Objective
b) Previous Work
c) Brief Description about work
2. Methodology………………………………………………………………………..6
3. Dataset……………………………………………………………………………..12
4. Experiment………………………………………………………………………….15
5. Result………………………………………………………………………………19
6. Conclusion & Future Work…………………………………………………………21

Page | 1
INTRODUCTION

MOTIVATION-

A significant part of the overall automotive market is derived from the used car
trade. Determining correctly the used car market values will certainly help
achieving fairer trade in many economies. In such systems, the accuracy of
second-hand car price evaluation largely determines whether the seller and the
buyer can get more efficient trading experience. The price evaluation model
based on big data analysis is proposed, which takes advantage of widely
circulated vehicle data and a large number of vehicle transaction data to analyze
the price data for each type of vehicles by using the optimized neural network
algorithm. It aims to establish a second-hand car price evaluation model to get
the price that best matches the car.

Page | 2
OBJECTIVE -
The main goal of this project is Automotive industry is an essential part of the
world economy. It is very common to observe that the used car sales surpasses
the new ones across the globe. The price of new cars in the industry is fixed by
manufacturer with some additional costs incurred by government in the form of
taxes so customers buying a new car can be assured of the money the invest to
be worthy but due to increase price of new car due to lack of fund to which used
car sales are on global increase . To build a model for predicting the price of
used car machine learning technique (Artificial Neural Network, Support Vector
Machine Random forest ) are applied .The car’s price is commonly determined
based on its specifications, technology, and performances. However, today
automotive industries frequently release a car product or series with the latest
specifications, and the price is adjusted to the specifications. In this proposed
system, the main goal is to predict the price of used cars. When a user uploads
his used car and detail report of his used car, a query is generated on the client
side and sent to the server side. Some algorithm is run and the generated output
is sent back to the client side to guide the user.

Page | 3
PREVIOUS WORK
We analyze the event time (time to delist ) via survival analysis. Asking price is
certainly a factor for determining the time to delist. A car may be sold at a
certain price level for a given listing. It is worth noting that the owners of the
listings at e-commerce site (i.e. sellers) may update the asking price any time. In
other words, we can claim that the listing will stay on the e-commerce site until
the asking price is right for a potential buyer or until it is reduced to a
willingness-to-pay level of a potential buyer . As an extension, once the the data
from web listing are collected, we can develop a decision support tool to
analyze pricing data and build predictive models of the sale event .This also
introduces a web-based decision support tool to aid buyers/sellers for assessing
their cars’ market values (asking prices). This particular tool can also aid to
determine the likelihood of selling an advertised car within a predetermined
time period (thirty days in our case) given the asking price and specifics of the
car.

Page | 4
BRIEF DESCRIPTION ABOUT THE WORK –
In this proposed system, the main goal is to predict the price of used cars. When a user
uploads his used car and detail report of his used car, a query is generated on the client side
and sent to the server side. Some algorithm is run and the generated output is sent back to the
client side to guide the user. Several related works have been done previously on the

subject of used car price prediction. predicted the price of used cars in Mauritius using
multiple linear regression, k-nearest neighbors, naive Bayes and decision trees. Although
their results was not good for prediction due to a less number of car observation. concluded in
his that the decision tree and naive Bayes are unable to use for variable with a continuous
value. used multiple linear regression to predict vehicle car price. They performed variable
selection technique to find the most influencing variables then eliminate the rest. The data
contain only selected variable that used to form the linear regression model. The result was
impressive with R-square = 98%. did a research to evaluate the performance of the neural
network in used car price prediction. The predicted value, however, are not very close to the
actual price, especially on cars with a higher price.

They concluded that support vector machine regression slightly outperform neural network
and linear regression in predicting used car price.The application of online used car price
evaluation model using the optimized BP neural network algorithm. They introduced a new
optimization method called Like Block-Monte Carlo Method to optimize hidden neurons.
The result shown that the optimized model yielded higher accuracy when it compared to the
non-optimized model Based on the previous related works, we realized that none of them had
implemented gradient boosting technique in the prediction of used car price yet. Thus, we
decided to build a used car price evaluation model using gradient boosted regression trees.

Page | 5
METHODOLOGY
We utilized several classic and state-of-the-art methods, including ensemble learning techniques, with a

90% - 10% split for the training and test data. To reduce the time required for training, we used 500

thousand examples from our dataset. Linear Regression, Random Forest

Page | 6
Linear Regression
Regression is one of the statistical methods to identify how a feature affects
another. The affecting feature is an independent feature, while the other is a
dependent feature. Analyzing the features will allow the management to identify
how the relationship between features correlates to price. The correlation can be
used in different objectives. The most important objective is a prediction as
prediction is one of the typical uses of regression .Linear Regression was chosen
as the first model due to its simplicity and comparatively small training time.
The features, without any feature mapping, were used directly as the feature
vectors. No regularization was used since the results clearly showed low
variance.

Random Forest
Random Forest is an ensemble learning based regression model. It uses a model
called decision tree, specifically as the name suggests, multiple decision trees to
generate the ensemble model which collectively produces a prediction. The
benefit of this model is that the trees are produced in
parallel and are relatively uncorrelated, thus producing good results as each tree
is not prone to individual errors of other trees. This uncorrelated behavior is

Page | 7
partly ensured by the use of Bootstrap Aggregation or bagging providing the
randomness required to produce robust and uncorrelated
trees. This model was hence chosen to account for the large number of features
in the dataset and compare a bagging technique with the following gradient
boosting methods.

Gradient Boost
Gradient Boosting is another decision tree based method that is generally
described as “a method of transforming weak learners into strong learners”. This
means that like a typical boosting method, observations are assigned different
weights and based on certain metrics.
The weights of difficult to predict observations are increased and then fed into
another tree to be trained. In this case the metric is the gradient of the loss
function.

Page | 8
This model was chosen to account for non-linear relationships between the
features and predicted price, by splitting the data into 100 regions.

Artificial Neural Network


An Artificial Neural Network is an interconnected architecture where there exists an input
layer where input data is placed, a hidden layer(s) where artificial neurons are stacked on on
top of each other and an output layer where the prediction or classification is made.

Forward Propagation
Forward propagation is technique in which data moves through from the corresponding input
layer, hidden layers and output layer sequentially.

Back Propagation
Back-propagation is the opposite of forward-propagation because it provides back-
propagation to the network. Back propagation is used to adjust the weights of the neural
network after the errors have been computed by the forward propagation algorithm.

Activation Functions
Activation functions are an important part of neural networks. This allows neural networks
to solve problems by generating non-linear functions. The three most commonly used
activation functions are Sigmoid, TanH and ReLU. Activation functions are used in both
forward and back propagation, in forward propagation we use the activation function to
calculate the loss when comparing the output of the function to a real number, and in back
propagation we use it to update the parameters of the neural network on fig.2 shows an
activation function commonly used in neural networks

Page | 9
Figure 2; Common Activation Function

CONVOLUTIONAL NEURAL NETWORKS


Convolutional Neural Networks is all about using Deep Learning with Computer vision.It is
also known as Conv Net as well as CNN. CNN is useful for feature extraction and
classification of objects in the image. This CNN is nothing but a stack of different layer. It is
more preferred in healthcare industry. some of the applications are tumor or cancer detection,
drug discovery, disease diagnosis.
Basically, it has three layers- a)
Convolution Layer
b) Pooling Layer

c) Fully-Connected layer
Figure 3: Convolutional Neural Network Architecture

Page | 10
Convolutional Layer
Linear functions used in convolutional neural networks are called convolutional layers. Each
node in the hidden layer uses an image processing feature detector to extract different features
from input image.

Model Implementation and Evaluation :

The Above Diagram Shows Flowchart or processing steps in our project. So at


very first step data is collected for project. For our project we have taken the
data related to car attributes in the csv form from kaggle.com which a data
science learning website.
We have considered 302 samples of car data in our research. Then in next step
data processing is done.

Page | 11
We have python pandas module to extract the data in csv from stored in an
Excel sheet and used it further in our project for building the required machine
learning models.

Then by using the data in excel sheet training and testing of models is done. We
Have used regression machine learning algorithms for prediction purpose in our
project. Three algorithms used in our project are Linear regression, and ridge
regression respectively.
Using SVM (Support Vector Machine) classifier the data is divided into two
parts i.e. 75% of data is used for training purpose and then 25% of data is used
for testing purpose.
Then when user enters the car details on interface and inputs it to the system.
Further system process on that input using the machine learning models and in
the final step output is produced and displayed based on the algorithm which
achieves higher accuracy.

DATASET
Building deep learning models require a lot of data. For this project datasets has been
researched and identified before any real work has begun. Since there is a heavy emphasis on
building models for this project, a key part of the project relies on a Dataset. Prior to coding, I
had to ensure I had a great dataset to work with to build a model.
• We have to check for duplicate ,null value and he dataset types.
• As VIN is unique for every car ,we filter the listing by VIN to eliminate the duplicate
values.
All the data types are good except mileage and price , so we find the wrong data and correct
them

Page | 12
Exploratory Data Analysis :

Page | 13
After clear analysis of the data, it is observed that we have: Vehicles from 49
different states across United States with 41 different branded cars. Vehicle
models with manufacture years from 1997-2022 .

Cars with a minimum price of 2995 and maximum upto 272990 dollars: 11
columns with numerical values, 6 columns with integer values and 1
columnwith float values.

Listings as per State:From this we can clearly see that texas stands in the first
position in the number of listings with 2162 (21.8%) and next is fonda with
around 1532 (15.5%) car listings and california has 1187 (12%)
he least number of cars being posted online from the state of Maine which is
just 3 Listings as per Brand: Fort is the dominant brand on the used car market
And Chevrolet, Toyota, Jeep, Nissan are another top 4 makers. Totally they
account for about 40% of the used car listings.

Page | 14
EXPERIMENT
TECHNOLOGY DECISION
Overview In this section
In this, we give the details about the technologies that used for this project. Although there
are many tools that exist out there in the market but found that these tools outlined perform
well for the problem that needs to be solved.
PYTHON
Python is a high-level language and used for general-purpose programming. It is widely used
in scientific computing and can be used for a wide range of common tasks from data mining
to software development. Python is the main language used in this project
GOOGLE COLAB

Google Colab is developed by Google Research. Colab allows anyone to write and run
arbitrary Python code through a browser and is particularly suited for machine learning, data
analysis, and training. Technically speaking. Libraries like NumPy, Pandas, etc. are supported
by Google Colab.
Importing the Python library makes it very easy to process data and perform common and complex
operations with one line of code, such as , Pandas, and more

Numpy
Numpy is a Python library that efficiently performs numerical calculations in Python. This
library is optimized for solving math problems. Numpy can also perform more efficient
mathematical operations compared to Python math libraries
Pandas
Pandas is a library in Python that, like numpy, is also used for data preprocessing and
preparation. One of the key features of Pandas is the DataFrame and Series data structures.
These data structures are optimized and include nice indexing that allows various functions
like reorganization, slicing, merging, concatenation, etc. Pandas and Numpy are very efficient
when used together to manage data.

.
OpenCV
Open Source Computer Vision (OpenCV) is a well-established computer vision library
written in C/C++ and abstracted for interoperability with C++, Python, and Java. It is a
powerful imaging tool that includes many tools for image processing, feature extraction, and
more.

DATA AUGMENTATION –
Data augmentation is a set of methods to artificially increase the amount of data by creating
new data points from existing data. This involves making small changes to the data to create
new data points. Data augmentation is useful for improving the performance and results of

Page | 15
machine learning models by generating new and good examples for your training data set.
When the data set of a machine learning model is large and sufficient, the model is more
accurate and performs better.
There are main options when compiling a model.
OPTIMIZER
This is a useful technique for optimizing cost functions using gradient descent. The Adam
optimizer is used to optimize the CNN model and training process, and some
hyperparameters are also used. Adaptive Moment Estimation is an optimization method
algorithm for gradient descent. This method is very impactful when solving large problems
with large amounts of data or parameters. It requires less memory. Intuitively, this is a
combination of the "momentum gradient descent" algorithm and the "RMSP" algorithm. The
Adam optimizer contains a combination of two gradient descent methodologies. a)
momentum b) RMS spread

LOSS FUNCTION
A loss function used to track whether a model improves with training.
CATEGORICAL CROSS LOSS FUNCTION

This is also known as logarithmic loss, log loss or logistic loss. Every predicted class
probability is compared to the actual class desired output 0 or 1 and a loss is calculated that
penalizes the probability based on how far it is from the actual expected value. Categorical
cross entropy is used for multi-class classification deep learning model. The aim is to reduce
the loss of the model.
Cross-entropy is defined a

Page | 16
RESULT
The following results are evaluating using testing data as input to multiple linear regression,
random forest regression, and gradient boosted regression trees. The results are then
compared using mean absolute error as a criterion. shows mean absolute error (5) of multiple
linear regression, random forest regression, and gradient boosted regression trees, in order.
Gradient boosted regression yield the best performance with only MSE =0.28. Random forest
regression is in second place with MSE = 0.35. Multiple linear regression has relatively large
MSE of 0.55 when compared with the other.

PREDICTION’S DONE BY MODEL

FUTURE WORK
For future work, we are interested to use the regression to select training or testing use the
dataset of other cars. We are also interested to find other methods to predict a car's price. We
have developed an accurate web application used car price prediction.

Page | 17
CONCLUSION
Car price prediction will be a challenging task because of the high number of attributes that
should be considered for the accurate prediction. The most important step within the
prediction process is collection and preprocessing of the information. During the research,
Car data collected from kaggle.com is converted into CSV form and used for building the
machine learning algorithms. Three algorithms which are Linear, Lasso and Ridge
Regression were utilized in this project. The data was divided into two parts training and
testing purpose by SVM classifier (Support Vector Machine).i.e. 75% of data was used for
training purpose and 25% of data was used for testing purpose of the machine learning. The
accuracies of the three machine learning models were checked and compared with one
another. The Final result was predicted consistent with the algorithm which achieves higher
accuracy. The main drawback of this project was less number of records that have been
utilized. As future work, we expect to collect more information and to utilize further
advanced developed methods like Random Forest, ANN (Artificial Neural Network), CNN
(Convolutional Neural Network) with a better user computer user interface experience.

Page | 18
REFRENCES

Datasethttps://www.kaggle.com/datasets/mohamedhanyyy/c hest-ctscan-images (accessed on


10 October,2022) .

[1]- Prajak Chertchom, Thongchai Kaewkiriya, Suwat Rungpheung, Sabir Buya,


Pitchayakit Boonpou,Nitis Monburinon "Prediction of Prices for Used Car by Using
Regression Models” (2018) IEEE.

[2]- Nur Oktavin Idris, Aspian Achban Siti Andini Utiarahman, Fuad Pontoiyo, Jorry
Karim, “Predicting the Selling Price of Cars Using Business Intelligence with the Feed-
forward Backpropagation Algorithms” " 2015 IEEE 7th International Conference on Cloud
Computing Technology and Science (CloudCom), Vancouver, BC, 2015.

[3]- Used Car Pricing and Beyond: A Survival Analysis Framework Ayhan Demiriz
Gebze Technical University 41400, Kocaeli, Turkey 41–62, 2018. [Online].
Available:IEEE.

Page | 19
APPENDIX
CODE

Page | 20
Page | 21
Page | 22
Page | 23
Page | 24

You might also like