Professional Documents
Culture Documents
Learning/"
Learning/"
Learning/"
Abstract
- The purpose of this project is to for a machine learning model to learn how to predict the
price of a car. In order for that to do we had to get a database and edit the database in a
way that would work with the original database that the Python program intended. Every
database is different, so we need to edit every database to ensure we input as much
important information as we need for the machine to predict the price without confusing
it. In our case, we have used only the Audi car company to predict their types of car
prices. The database has 10000 types of different cars from the Audi car brand. They all
differ in price, year, kilometers are driven, engine size or type, model, and fuel
consumption.
The project is based on this link:
“https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-
learning/”
The project was made on:
- https://colab.research.google.com
- Table 1. The first 5 rows of the database excel sheet of audi. Which is the main database.
Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓
Algorithm
The algorithm for the project operates as follows:
1. The program needs Input, this input will be the database we will get from an Excel
document.
2. Edit the database in the proper way or order to work with the program because every
database is different and the program requires it made in a specific way.
3. Edit the program so it can work for its intended purpose this includes but is not limited to the
types of input the program receives meaning the information about the car. Some databases
had the dimensions of the car, but the database we chose does not have them since the size of
the car does not matter. The model, engine, year and etc matter more to predicting the price
of the car in our opinion.
4. The program then will output the results from the database we inputted. From the output, we
will see some results or information which the machine learning model will use to predict the
price of the car.
The machine learning model then will output the information which the user asks. For
example, the user chooses the car with the following details Audi
A5 2014 Diesel 2.0 280000KM
5. The machine learning model will then output the price they predicted for the car asked with
the database given.
Flowchart
Figure 1
Experiment results (Entire code "change all the variable names", All outputs, All figures
outputs with explanations)
- The entire code of the first database with the variable names changed:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()
audi.isnull().sum()
audi.info()
print(audi.describe())
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
audi = audi[["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)
print(audi)
The entire code of database 1 explained in chunks(First a chunk of the code is
showed and then the output of the chunk code is shown and then explained):
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()
- In this lines of code as we can see we import a few modules like pandas for the porject
to work. And then after the import it outputs the first 6 lines of the excel database
sheet of audi.
id model year price transmission mileage fueltype tax highwaympg enginesize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0
- Table 3. As we can see in this table right here the python program has printed the first
6 lines of the database in excel we have inputted which is audi.csv.
audi.isnull().sum()
model 0
year 0
price 0
transmission 0
mileage 0
fueltype 0
tax 0
highwaympg 0
enginesize 0
dtype: int64
- Table 4. This table shows the command isnull, which Is a panda function which will
verify if there is an empty cell in the excel sheet blank or null. If there is then there
will be a true expression instead of false. Meaning the output will be 1 instead of 0
here. Which means the database we inputted is working fine.
audi.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10668 entries, 0 to 10667
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10668 non-null object
1 year 10668 non-null int64
2 price 10668 non-null int64
3 transmission 10668 non-null object
4 mileage 10668 non-null int64
5 fueltype 10668 non-null object
6 tax 10668 non-null int64
7 highwaympg 10668 non-null float64
8 enginesize 10668 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 750.2+ KB
- Table 5. In this line of code we see every technical information we can get from the
database inlcuding the size of the file, memory usage, how many entries etc..
print(audi.describe())
enginesize
count 10668.000000
mean 1.930709
std 0.602957
min 0.000000
25% 1.500000
50% 2.000000
75% 2.000000
max 6.300000
- Table 6. This command describes the database in a technical way in this example in a
dataframe which contains numerical data. It shows the average value or also known as
the standart deviation
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()
<ipython-input-8-b24cc0cfc4f5>:3: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
Figure.2. As we can see this code will make us a graph with the average price of the car. The
graph will change based upon the database we input because of the price differ in the excel
sheet. Meaning the market if the consider the database the input of the market. In this
picture as we can see most cars have the price of around 20,000 dollars.
sns.distplot(audi.price)
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
- Table 7.
-As we can see the output showed a warning. It is a panda warning that may interfere in the
future. Below the warning it is the table with dataframe information. And below the table is
the figure which shows the chart with different colors.
- In this output we see a panda warning and after it we see i=only the number 1. This is
to show that the program with the pandas used works as intended. Meaning our
program worked as we intended.
print(audi)
- Table 8. This output will show the modified version of the database we inputted. It has
been reduced to only 3 rows because this was the only way to make the database work
with the program. It needs to be modified in order to work.
The entire code of the second database with the variable names changed:
Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓
Drive Engine Wheelbas Engin Fuel Bor Strok Compressio Horsepowe
whee locatio e e size Syste e e n ratio r
l n m ratio
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450
- Table 9. This table wil simply shows the input which is the database we inputted. As
we can see it just shows the rows and the columns of the excel file.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 car_ID 205 non-null int64
1 symboling 205 non-null int64
2 CarName 205 non-null object
3 fueltype 205 non-null object
4 aspiration 205 non-null object
5 doornumber 205 non-null object
6 carbody 205 non-null object
7 drivewheel 205 non-null object
8 enginelocation 205 non-null object
9 wheelbase 205 non-null float64
10 carlength 205 non-null float64
11 carwidth 205 non-null float64
12 carheight 205 non-null float64
13 curbweight 205 non-null int64
14 enginetype 205 non-null object
15 cylindernumber 205 non-null object
16 enginesize 205 non-null int64
17 fuelsystem 205 non-null object
18 boreratio 205 non-null float64
19 stroke 205 non-null float64
20 compressionratio 205 non-null float64
21 horsepower 205 non-null int64
22 peakrpm 205 non-null int64
23 citympg 205 non-null int64
24 highwaympg 205 non-null int64
25 price 205 non-null float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB
- As we can see in this output it just shows the non null count and the type of the input
we have put into the database. IT also shows how many different data types and also
the memory usage.
car_ID symboling wheelbase carlength carwidth carheight \
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 103.000000 0.834146 98.756585 174.049268 65.907805 53.724878
std 59.322565 1.245307 6.021776 12.337289 2.145204 2.443522
min 1.000000 -2.000000 86.600000 141.100000 60.300000 47.800000
25% 52.000000 0.000000 94.500000 166.300000 64.100000 52.000000
50% 103.000000 1.000000 97.000000 173.200000 65.500000 54.100000
75% 154.000000 2.000000 102.400000 183.100000 66.900000 55.500000
max 205.000000 3.000000 120.900000 208.100000 72.300000 59.800000
- This command describes the database in a technical way in this example in a dataframe
which contains numerical data. It shows the average value or also known as the
standart deviation
<ipython-input-2-3b6c97159ec3>:7: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
- In this output we can see a few warnings. The warnings are to edit the program in the
future because it may not be valid.
Figure 4.
car_ID symboling wheelbase carlength carwidth \
car_ID 1.000000 -0.151621 0.129729 0.170636 0.052387
symboling -0.151621 1.000000 -0.531954 -0.357612 -0.232919
wheelbase 0.129729 -0.531954 1.000000 0.874587 0.795144
carlength 0.170636 -0.357612 0.874587 1.000000 0.841118
carwidth 0.052387 -0.232919 0.795144 0.841118 1.000000
carheight 0.255960 -0.541038 0.589435 0.491029 0.279210
curbweight 0.071962 -0.227691 0.776386 0.877728 0.867032
enginesize -0.033930 -0.105790 0.569329 0.683360 0.735433
boreratio 0.260064 -0.130051 0.488750 0.606454 0.559150
stroke -0.160824 -0.008735 0.160959 0.129533 0.182942
compressionratio 0.150276 -0.178515 0.249786 0.158414 0.181129
horsepower -0.015006 0.070873 0.353294 0.552623 0.640732
peakrpm -0.203789 0.273606 -0.360469 -0.287242 -0.220012
citympg 0.015940 -0.035823 -0.470414 -0.670909 -0.642704
highwaympg 0.011255 0.034606 -0.544082 -0.704662 -0.677218
price -0.109093 -0.079978 0.577816 0.682920 0.759325
highwaympg price
car_ID 0.011255 -0.109093
symboling 0.034606 -0.079978
wheelbase -0.544082 0.577816
carlength -0.704662 0.682920
carwidth -0.677218 0.759325
carheight -0.107358 0.119336
curbweight -0.797465 0.835305
enginesize -0.677470 0.874145
boreratio -0.587012 0.553173
stroke -0.043931 0.079443
compressionratio 0.265201 0.067984
horsepower -0.770544 0.808139
peakrpm -0.054275 -0.085267
citympg 0.971337 -0.685751
highwaympg 1.000000 -0.697599
price -0.697599 1.000000
- In this output we can see that the information of the figure below. IT is
Figure.3. This chart shows how the different categories of the database we input. It is working fine since the we have
1 in a diagonal way which was the way it was intended to work. It differs between the cells and this will be used to
determice the price of the car which the data learning machine will use.
Compare a minimum 2 datasets with all outputs
Conclusion
Reference
Bukvić, L., Pašagić Škrinjar, J., Fratrović, T., & Abramović, B. (2022). Price Prediction
and Classification of Used-Vehicles Using Supervised Machine Learning. Sustainability, 14(24),
17034. https://doi.org/10.3390/su142417034
All
Outputs
of DB1:
import seaborn
as sns
import numpy as
np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
audi = pd.read_csv("/content/audi.csv")
audi.head()
Figure1:
audi.isnul().sum() audi.info()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(audi.price)
plt.show()
<ipython-input-17-b24cc0cfc4f5>:3: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(audi.price)
print(audi.corr())
plt.figure(figsize=(20, 15))
correlations = audi.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
Figure 1
The chart generated by the program is based on the inputted data and displays the
distribution of cars according to their prices. From the chart, it is evident that the majority of cars
are priced around 20k, making it the most common price point in the dataset.
Figure 2
This figure illustrates the relationship between various data attributes such as year,
mileage, and tax with the price of the cars. The colors on the figure represent the rate, with red
indicating a higher rate and blue indicating a smaller rate. The figure provides a visual
representation of how these different attributes impact the price of the cars.
Algorithm:
The algorithm for the project operates as follows:
6. The program needs Input, this input will be the database we will get from an Excel
document.
7. Edit the database in the proper way or order to work with the program because every
database is different and the program requires it made in a specific way.
8. Edit the program so it can work for its intended purpose this includes but is not limited to the
types of input the program receives meaning the information about the car. Some databases
had the dimensions of the car, but the database we chose does not have them since the size of
the car does not matter. The model, engine, year and etc matter more to predicting the price
of the car in our opinion.
9. The program then will output the results from the database we inputted. From the output, we
will see some results or information which the machine learning model will use to predict the
price of the car.
The machine learning model then will output the information which the user asks. For
example, the user chooses the car with the following details Audi
A5 2014 Diesel 2.0 280000KM
10. The machine learning model will then output the price they predicted for the car asked with
the database given.
Comparing the two databases’ outputs:
The output of input data from DB1:
The output generated by the Python program is directly changed by the information
inputted through the database. The program uses the columns and rows from the Excel database
as input data to perform calculations and generate the desired output. The differences observed in
the output are a result of the variations in the data entered into the database, which in turn affects
the calculations performed by the program.
DB1 : DB2 :
- The difference here is that the Python program shows for every column in the database
non-null count and data type. This feature gives a comprehensive overview of the database,
providing insights into the completeness and type of data present in each column.
DB1 :
The difference here lies in the Python program on how it has made the
calculations on the database we inputted. The program was made to work only with 1 specific
database, so the results differ a lot. One thing we could notice is that it shows the same number
205 and it would differ from the program’s way of determining the cost. On the first database,
the program failed to make such an output as it did on the second one. The program was
originally made to work with the second database therefore that’s why we didn’t receive the
same output.
DB2 :
DB1 Figure 1:
DB2 Figure 1:
As we can see, the difference here is that the program made a graph of how expensive
were the cars and at what density. We notice that in the first database (DB1) the cars we inputted,
most of them were in the price range of 20k and at the second database (DB2) we can see that the
price range of the cars we inputted was at 9-10k.
DB1 Figure 2:
DB2 Figure 2:
The program has generated a comprehensive table illustrating the impact of each column
and row on the determination of car prices. The differences between the tables can be attributed
to the fact that the second database had a significantly larger number of diverse inputs, leading to
variations in the results.
DB1 :