Learning/"

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 39

 Title: Car Price Prediction

 Abstract
- The purpose of this project is to for a machine learning model to learn how to predict the
price of a car. In order for that to do we had to get a database and edit the database in a
way that would work with the original database that the Python program intended. Every
database is different, so we need to edit every database to ensure we input as much
important information as we need for the machine to predict the price without confusing
it. In our case, we have used only the Audi car company to predict their types of car
prices. The database has 10000 types of different cars from the Audi car brand. They all
differ in price, year, kilometers are driven, engine size or type, model, and fuel
consumption.
The project is based on this link:
 “https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-
learning/”
The project was made on:
- https://colab.research.google.com

 Introduction (Why you choose this project?)


- We chose this project because I liked cars and I knew about them also learning machine
models are a big part of technology right now and in the future. This program can be
implemented in car websites to predict the price of the car customers look for to make sure a
customer will not get ripped off. Meaning that we will also support the society and we will do
good so why not do good for our society.

 Summary of any one research paper


- The research paper shows how essential price preditiction machine learnings are. It
included statistics about used and new car prices, showing the differences between
them. Also it included how websites like eBay needed machine learning to predict the
price of products which was essential for the user to know if they are getting a good
deal which is what a website is all about.
 Data Set Description
https://www.kaggle.com/datasets/rohitagrawal362/audi-car-price-prediction
model year price transmis mileage fueltype tax highway enginesi
sion mpg ze
A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
A6 2016 16500 Automat 36203 Diesel 20 64.2 2
ic
A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
A4 2017 16800 Automat 25952 Diesel 145 67.3 2
ic
A3 2019 17300 Manual 1998 Petrol 145 49.6 1
Data Base 1: /content/audi.csv

- Table 1. The first 5 rows of the database excel sheet of audi. Which is the main database.

The number of rows: 10669


The number of columns: 10
Attributes: Model, Year, Price, Transmission, Mileage, Fuel Type, Tax, Highway Mpg,
Engine Size
Attributes Explanation:
Model: The name of the car model.
Year: The year of the car's manufacture.
Price: The price of the car.
Transmission: The type of transmission the car has.
Mileage: The total distance the car has traveled in miles.
Fuel Type: The type of fuel the car uses.
Tax: The tax amount associated with the car.
Highway Mpg: The car's estimated miles per gallon on the highway.
Engine Size: The size of the car's engine in liters.

Data Base 2: https://raw.githubusercontent.com/amankharwal/Website-data/master/


CarPrice.csv

Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓

Drive Engine Wheelbas Engin Fuel Bor Strok Compressio Horsepowe


whee locatio e e size Syste e e n ratio r
l n m ratio
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450

- Table 2. The first 6 rows of the second database.

Number of Rows: 206 & Number of Columns: 22


Attributes: car_ID: An identifier or index for each row in the table.
Symbolling: The insurance risk rating symbol associated with the car.
CarName: The name or model of the car.
Fuel type: The type of fuel the car uses
Aspiration: The type of aspiration or turbocharging the car has
Door number: The number of doors on the car.
Carbody: The type of car body or body style.
Drivewheel: The type of drivewheel or wheel drive configuration
Engine location: The location of the car's engine (e.g., front, rear).
Wheelbase: The distance between the centers of the front and rear wheels in inches.
Engine size The size or displacement of the car's engine in cubic centimeters (CC).
Fuel system: The type of fuel delivery system used in the car
Bore ratio: The bore ratio of the car's engine.
Stroke: The stroke ratio of the car's engine.
Compression ratio: The compression ratio of the car's engine.
Horsepower: The power output of the car's engine in horsepower.
Peak rpm: The peak revolutions per minute (rpm) at which the car's engine generates its max
power.
Citympg: The car's estimated fuel efficiency in miles per gallon (mpg) during city driving.
Highway mpg: The car's estimated fuel efficiency in miles per gallon (mpg) during highway
driving.
Price: The price of the car.

 Algorithm
The algorithm for the project operates as follows:
1. The program needs Input, this input will be the database we will get from an Excel
document.
2. Edit the database in the proper way or order to work with the program because every
database is different and the program requires it made in a specific way.
3. Edit the program so it can work for its intended purpose this includes but is not limited to the
types of input the program receives meaning the information about the car. Some databases
had the dimensions of the car, but the database we chose does not have them since the size of
the car does not matter. The model, engine, year and etc matter more to predicting the price
of the car in our opinion.
4. The program then will output the results from the database we inputted. From the output, we
will see some results or information which the machine learning model will use to predict the
price of the car.
The machine learning model then will output the information which the user asks. For
example, the user chooses the car with the following details Audi
A5 2014 Diesel 2.0 280000KM
5. The machine learning model will then output the price they predicted for the car asked with
the database given.
 Flowchart

Figure 1
 Experiment results (Entire code "change all the variable names", All outputs, All figures
outputs with explanations)
- The entire code of the first database with the variable names changed:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()
audi.isnull().sum()
audi.info()
print(audi.describe())
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
audi = audi[["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)
print(audi)
The entire code of database 1 explained in chunks(First a chunk of the code is
showed and then the output of the chunk code is shown and then explained):
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()

- In this lines of code as we can see we import a few modules like pandas for the porject
to work. And then after the import it outputs the first 6 lines of the excel database
sheet of audi.
id model year price transmission mileage fueltype tax highwaympg enginesize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0

- Table 3. As we can see in this table right here the python program has printed the first
6 lines of the database in excel we have inputted which is audi.csv.

audi.isnull().sum()
model 0
year 0
price 0
transmission 0
mileage 0
fueltype 0
tax 0
highwaympg 0
enginesize 0
dtype: int64

- Table 4. This table shows the command isnull, which Is a panda function which will
verify if there is an empty cell in the excel sheet blank or null. If there is then there
will be a true expression instead of false. Meaning the output will be 1 instead of 0
here. Which means the database we inputted is working fine.
audi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10668 entries, 0 to 10667
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10668 non-null object
1 year 10668 non-null int64
2 price 10668 non-null int64
3 transmission 10668 non-null object
4 mileage 10668 non-null int64
5 fueltype 10668 non-null object
6 tax 10668 non-null int64
7 highwaympg 10668 non-null float64
8 enginesize 10668 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 750.2+ KB

- Table 5. In this line of code we see every technical information we can get from the
database inlcuding the size of the file, memory usage, how many entries etc..
print(audi.describe())

type year price mileage tax highwaympg \


count 10668.000000 10668.000000 10668.000000 10668.000000 10668.000000
mean 2017.100675 22896.685039 24827.244001 126.011436 50.770022
std 2.167494 11714.841888 23505.257205 67.170294 12.949782
min 1997.000000 1490.000000 1.000000 0.000000 18.900000
25% 2016.000000 15130.750000 5968.750000 125.000000 40.900000
50% 2017.000000 20200.000000 19000.000000 145.000000 49.600000
75% 2019.000000 27990.000000 36464.500000 145.000000 58.900000
max 2020.000000 145000.000000 323000.000000 580.000000 188.300000

continuing below >>

enginesize
count 10668.000000
mean 1.930709
std 0.602957
min 0.000000
25% 1.500000
50% 2.000000
75% 2.000000
max 6.300000

- Table 6. This command describes the database in a technical way in this example in a
dataframe which contains numerical data. It shows the average value or also known as
the standart deviation
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()

<ipython-input-8-b24cc0cfc4f5>:3: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

Figure.2. As we can see this code will make us a graph with the average price of the car. The
graph will change based upon the database we input because of the price differ in the excel
sheet. Meaning the market if the consider the database the input of the market. In this
picture as we can see most cars have the price of around 20,000 dollars.
sns.distplot(audi.price)
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()

<ipython-input-9-0bb2be9b5c0c>:1: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(audi.corr())
<ipython-input-9-0bb2be9b5c0c>:3: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = audi.corr()

year price mileage tax highwaympg enginesize


year 1.000000 0.592581 -0.789667 0.093066 -0.351281 -0.031582
price 0.592581 1.000000 -0.535357 0.356157 -0.600334 0.591262
mileage -0.789667 -0.535357 1.000000 -0.166547 0.395103 0.070710
tax 0.093066 0.356157 -0.166547 1.000000 -0.635909 0.393075
highwaympg -0.351281 -0.600334 0.395103 -0.635909 1.000000 -0.365621
enginesize -0.031582 0.591262 0.070710 0.393075 -0.365621 1.000000

- Table 7.

-As we can see the output showed a warning. It is a panda warning that may interfere in the
future. Below the warning it is the table with dataframe information. And below the table is
the figure which shows the chart with different colors.

“The figure is placed in the next page due to its size”


Figure 3. This chart shows how the different categories of the database we input. It is working
fine since the we have 1 in a diagonal way which was the way it was intended to work. It differs
between the cells and this will be used to determice the price of the car which the data learning
machine will use.
predict = "price"
audi = audi[["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)

<ipython-input-19-a1c592d6430c>:3: FutureWarning: In a future version of pandas all


arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
x = np.array(data.drop([predict], 1))
1.0

- In this output we see a panda warning and after it we see i=only the number 1. This is
to show that the program with the pandas used works as intended. Meaning our
program worked as we intended.
print(audi)

- enginesize highwaympg price


0 1.4 55.4 12500
1 2.0 64.2 16500
2 1.4 55.4 11000
3 2.0 67.3 16800
4 1.0 49.6 17300
... ... ... ...
10663 1.0 49.6 16999
10664 1.0 49.6 16999
10665 1.0 49.6 17199
10666 1.4 47.9 19499
10667 1.4 47.9 15999

[10668 rows x 3 columns]

- Table 8. This output will show the modified version of the database we inputted. It has
been reduced to only 3 rows because this was the only way to make the database work
with the program. It needs to be modified in order to work.
 The entire code of the second database with the variable names changed:

import seaborn as sns


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
DB2 = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-DB2/master/
CarPrice.csv")
DB2.head()
DB2.isnull().sum()
DB2.info()
print(DB2.describe())
DB2.CarName.unique()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(DB2.price)
plt.show()
sns.distplot(data.price)
print(DB2.corr())
plt.figure(figsize=(20, 15))
correlations = DB2.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
DB2 = DB2[["symboling", "wheelbase", "carlength",
"carwidth", "carheight", "curbweight",
"enginesize", "boreratio", "stroke",
"compressionratio", "horsepower", "peakrpm",
"citympg", "highwaympg", "price"]]
x = np.array(DB2.drop([predict], 1))
y = np.array(DB2[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)
print(DB2)

 The entire output of second database with explanations:

Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓
Drive Engine Wheelbas Engin Fuel Bor Strok Compressio Horsepowe
whee locatio e e size Syste e e n ratio r
l n m ratio
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450

- Table 9. This table wil simply shows the input which is the database we inputted. As
we can see it just shows the rows and the columns of the excel file.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 car_ID 205 non-null int64
1 symboling 205 non-null int64
2 CarName 205 non-null object
3 fueltype 205 non-null object
4 aspiration 205 non-null object
5 doornumber 205 non-null object
6 carbody 205 non-null object
7 drivewheel 205 non-null object
8 enginelocation 205 non-null object
9 wheelbase 205 non-null float64
10 carlength 205 non-null float64
11 carwidth 205 non-null float64
12 carheight 205 non-null float64
13 curbweight 205 non-null int64
14 enginetype 205 non-null object
15 cylindernumber 205 non-null object
16 enginesize 205 non-null int64
17 fuelsystem 205 non-null object
18 boreratio 205 non-null float64
19 stroke 205 non-null float64
20 compressionratio 205 non-null float64
21 horsepower 205 non-null int64
22 peakrpm 205 non-null int64
23 citympg 205 non-null int64
24 highwaympg 205 non-null int64
25 price 205 non-null float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB

- As we can see in this output it just shows the non null count and the type of the input
we have put into the database. IT also shows how many different data types and also
the memory usage.
car_ID symboling wheelbase carlength carwidth carheight \
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 103.000000 0.834146 98.756585 174.049268 65.907805 53.724878
std 59.322565 1.245307 6.021776 12.337289 2.145204 2.443522
min 1.000000 -2.000000 86.600000 141.100000 60.300000 47.800000
25% 52.000000 0.000000 94.500000 166.300000 64.100000 52.000000
50% 103.000000 1.000000 97.000000 173.200000 65.500000 54.100000
75% 154.000000 2.000000 102.400000 183.100000 66.900000 55.500000
max 205.000000 3.000000 120.900000 208.100000 72.300000 59.800000

curbweight enginesize boreratio stroke compressionratio \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 2555.565854 126.907317 3.329756 3.255415 10.142537
std 520.680204 41.642693 0.270844 0.313597 3.972040
min 1488.000000 61.000000 2.540000 2.070000 7.000000
25% 2145.000000 97.000000 3.150000 3.110000 8.600000
50% 2414.000000 120.000000 3.310000 3.290000 9.000000
75% 2935.000000 141.000000 3.580000 3.410000 9.400000
max 4066.000000 326.000000 3.940000 4.170000 23.000000

horsepower peakrpm citympg highwaympg price


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 104.117073 5125.121951 25.219512 30.751220 13276.710571
std 39.544167 476.985643 6.542142 6.886443 7988.852332
min 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 70.000000 4800.000000 19.000000 25.000000 7788.000000
50% 95.000000 5200.000000 24.000000 30.000000 10295.000000
75% 116.000000 5500.000000 30.000000 34.000000 16503.000000
max 288.000000 6600.000000 49.000000 54.000000 45400.000000

- This command describes the database in a technical way in this example in a dataframe
which contains numerical data. It shows the average value or also known as the
standart deviation
<ipython-input-2-3b6c97159ec3>:7: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

<ipython-input-2-3b6c97159ec3>:9: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(data.corr())
<ipython-input-2-3b6c97159ec3>:11: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = data.corr()

- In this output we can see a few warnings. The warnings are to edit the program in the
future because it may not be valid.
Figure 4.
car_ID symboling wheelbase carlength carwidth \
car_ID 1.000000 -0.151621 0.129729 0.170636 0.052387
symboling -0.151621 1.000000 -0.531954 -0.357612 -0.232919
wheelbase 0.129729 -0.531954 1.000000 0.874587 0.795144
carlength 0.170636 -0.357612 0.874587 1.000000 0.841118
carwidth 0.052387 -0.232919 0.795144 0.841118 1.000000
carheight 0.255960 -0.541038 0.589435 0.491029 0.279210
curbweight 0.071962 -0.227691 0.776386 0.877728 0.867032
enginesize -0.033930 -0.105790 0.569329 0.683360 0.735433
boreratio 0.260064 -0.130051 0.488750 0.606454 0.559150
stroke -0.160824 -0.008735 0.160959 0.129533 0.182942
compressionratio 0.150276 -0.178515 0.249786 0.158414 0.181129
horsepower -0.015006 0.070873 0.353294 0.552623 0.640732
peakrpm -0.203789 0.273606 -0.360469 -0.287242 -0.220012
citympg 0.015940 -0.035823 -0.470414 -0.670909 -0.642704
highwaympg 0.011255 0.034606 -0.544082 -0.704662 -0.677218
price -0.109093 -0.079978 0.577816 0.682920 0.759325

carheight curbweight enginesize boreratio stroke \


car_ID 0.255960 0.071962 -0.033930 0.260064 -0.160824
symboling -0.541038 -0.227691 -0.105790 -0.130051 -0.008735
wheelbase 0.589435 0.776386 0.569329 0.488750 0.160959
carlength 0.491029 0.877728 0.683360 0.606454 0.129533
carwidth 0.279210 0.867032 0.735433 0.559150 0.182942
carheight 1.000000 0.295572 0.067149 0.171071 -0.055307
curbweight 0.295572 1.000000 0.850594 0.648480 0.168790
enginesize 0.067149 0.850594 1.000000 0.583774 0.203129
boreratio 0.171071 0.648480 0.583774 1.000000 -0.055909
stroke -0.055307 0.168790 0.203129 -0.055909 1.000000
compressionratio 0.261214 0.151362 0.028971 0.005197 0.186110
horsepower -0.108802 0.750739 0.809769 0.573677 0.080940
peakrpm -0.320411 -0.266243 -0.244660 -0.254976 -0.067964
citympg -0.048640 -0.757414 -0.653658 -0.584532 -0.042145
highwaympg -0.107358 -0.797465 -0.677470 -0.587012 -0.043931
price 0.119336 0.835305 0.874145 0.553173 0.079443

compressionratio horsepower peakrpm citympg \


car_ID 0.150276 -0.015006 -0.203789 0.015940
symboling -0.178515 0.070873 0.273606 -0.035823
wheelbase 0.249786 0.353294 -0.360469 -0.470414
carlength 0.158414 0.552623 -0.287242 -0.670909
carwidth 0.181129 0.640732 -0.220012 -0.642704
carheight 0.261214 -0.108802 -0.320411 -0.048640
curbweight 0.151362 0.750739 -0.266243 -0.757414
enginesize 0.028971 0.809769 -0.244660 -0.653658
boreratio 0.005197 0.573677 -0.254976 -0.584532
stroke 0.186110 0.080940 -0.067964 -0.042145
compressionratio 1.000000 -0.204326 -0.435741 0.324701
horsepower -0.204326 1.000000 0.131073 -0.801456
peakrpm -0.435741 0.131073 1.000000 -0.113544
citympg 0.324701 -0.801456 -0.113544 1.000000
highwaympg 0.265201 -0.770544 -0.054275 0.971337
price 0.067984 0.808139 -0.085267 -0.685751

highwaympg price
car_ID 0.011255 -0.109093
symboling 0.034606 -0.079978
wheelbase -0.544082 0.577816
carlength -0.704662 0.682920
carwidth -0.677218 0.759325
carheight -0.107358 0.119336
curbweight -0.797465 0.835305
enginesize -0.677470 0.874145
boreratio -0.587012 0.553173
stroke -0.043931 0.079443
compressionratio 0.265201 0.067984
horsepower -0.770544 0.808139
peakrpm -0.054275 -0.085267
citympg 0.971337 -0.685751
highwaympg 1.000000 -0.697599
price -0.697599 1.000000

- In this output we can see that the information of the figure below. IT is
Figure.3. This chart shows how the different categories of the database we input. It is working fine since the we have
1 in a diagonal way which was the way it was intended to work. It differs between the cells and this will be used to
determice the price of the car which the data learning machine will use.
Compare a minimum 2 datasets with all outputs
 Conclusion
 Reference
Bukvić, L., Pašagić Škrinjar, J., Fratrović, T., & Abramović, B. (2022). Price Prediction
and Classification of Used-Vehicles Using Supervised Machine Learning. Sustainability, 14(24),
17034. https://doi.org/10.3390/su142417034

 All
Outputs
of DB1:
import seaborn
as sns
import numpy as
np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()

Figure1:

audi.isnul().sum() audi.info()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()

<ipython-input-17-b24cc0cfc4f5>:3: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see

https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(audi.price)
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()

<ipython-input-18-0bb2be9b5c0c>:1: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(audi.corr())
<ipython-input-18-0bb2be9b5c0c>:3: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = audi.corr()
year price mileage tax highwaympg enginesize
year 1.000000 0.592581 -0.789667 0.093066 -0.351281 -0.031582
price 0.592581 1.000000 -0.535357 0.356157 -0.600334 0.591262
mileage -0.789667 -0.535357 1.000000 -0.166547 0.395103 0.070710
tax 0.093066 0.356157 -0.166547 1.000000 -0.635909 0.393075
highwaympg -0.351281 -0.600334 0.395103 -0.635909 1.000000 -0.365621
enginesize -0.031582 0.591262 0.070710 0.393075 -0.365621 1.000000
predict = "price"
audi = audi[["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)

<ipython-input-19-a1c592d6430c>:3: FutureWarning: In a future version of pandas all


arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
x = np.array(data.drop([predict], 1))
1.0
Print(audi.describe())

 All Output Figures of DB1 Explained:

Figure 1
The chart generated by the program is based on the inputted data and displays the
distribution of cars according to their prices. From the chart, it is evident that the majority of cars
are priced around 20k, making it the most common price point in the dataset.

Figure 2
This figure illustrates the relationship between various data attributes such as year,
mileage, and tax with the price of the cars. The colors on the figure represent the rate, with red
indicating a higher rate and blue indicating a smaller rate. The figure provides a visual
representation of how these different attributes impact the price of the cars.

 Algorithm:
The algorithm for the project operates as follows:
6. The program needs Input, this input will be the database we will get from an Excel
document.
7. Edit the database in the proper way or order to work with the program because every
database is different and the program requires it made in a specific way.
8. Edit the program so it can work for its intended purpose this includes but is not limited to the
types of input the program receives meaning the information about the car. Some databases
had the dimensions of the car, but the database we chose does not have them since the size of
the car does not matter. The model, engine, year and etc matter more to predicting the price
of the car in our opinion.
9. The program then will output the results from the database we inputted. From the output, we
will see some results or information which the machine learning model will use to predict the
price of the car.
The machine learning model then will output the information which the user asks. For
example, the user chooses the car with the following details Audi
A5 2014 Diesel 2.0 280000KM
10. The machine learning model will then output the price they predicted for the car asked with
the database given.
 Comparing the two databases’ outputs:
The output of input data from DB1:

The output of input data from DB2:

The output generated by the Python program is directly changed by the information
inputted through the database. The program uses the columns and rows from the Excel database
as input data to perform calculations and generate the desired output. The differences observed in
the output are a result of the variations in the data entered into the database, which in turn affects
the calculations performed by the program.

DB1 : DB2 :
- The difference here is that the Python program shows for every column in the database
non-null count and data type. This feature gives a comprehensive overview of the database,
providing insights into the completeness and type of data present in each column.
DB1 :

The difference here lies in the Python program on how it has made the
calculations on the database we inputted. The program was made to work only with 1 specific
database, so the results differ a lot. One thing we could notice is that it shows the same number
205 and it would differ from the program’s way of determining the cost. On the first database,
the program failed to make such an output as it did on the second one. The program was
originally made to work with the second database therefore that’s why we didn’t receive the
same output.

DB2 :
DB1 Figure 1:

DB2 Figure 1:

As we can see, the difference here is that the program made a graph of how expensive
were the cars and at what density. We notice that in the first database (DB1) the cars we inputted,
most of them were in the price range of 20k and at the second database (DB2) we can see that the
price range of the cars we inputted was at 9-10k.
DB1 Figure 2:

DB2 Figure 2:

The program has generated a comprehensive table illustrating the impact of each column
and row on the determination of car prices. The differences between the tables can be attributed
to the fact that the second database had a significantly larger number of diverse inputs, leading to
variations in the results.
DB1 :

Here the programs displayed the characteristics of


the first 5 and last 5 cars from the database we inputted,
highlighting the differences in their features.
DB2:
 Conclusions from the comparative study:
In conclusion, we find that both of the databases we edited and inputted into the program
worked. They differed in the number of columns or attributes but the program worked since we
edited them. It also showed some statistics of the database as we see in the graph. It was very
effective in showing the average price of the cars we inputted(around 10,000 cars both) in an
effective graph. Showing how the prices differ based on different databases. We could also use
different databases as inputs for different markets. Meaning we could see the price of different
car models in a graph and that would a very efficient way to create a graph based on different
markets. Also w,e see a big difference in the colored table because the second database was
much bigger in columns and rows due to having more attributes than Database 1. They both had
similarities in colors meaning in the same value but they did have their differences.

You might also like