ML Final

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

ALIGARH COLLEGE OF ENGINEERING AND TECHNOLOGY,ALIGARH

(Approved By AICTE New Delhi, Affiliated to AKTU Lucknow)

PRACTICAL FILE
Submitted in partial fulfillment for the award of
BACHELOR OF TECHNOLOGY
(CSE), 3rdyear
Submitted by

RUPESH VARSHNEY
Roll No:- 2001090100045

Submitted To
Mr. Dewang Chaudhary
(INTERNSHIP ASSESSMENT: KCS-554 )
In MACHINE LEARNING
[Session: 2022-2023]
CERTIFICATE
OF INTERNSHIP
IN
MACHINE LEARNING
This is to certify that Mr.

RUPESH VARSHNEY

From
Skillvoid
Has Successfully Completed Internship Program
for the period from 09 October, 2022 to 09 November, 2022.

HEAD-HR
ALIGARH COLLEGE OF ENGINEERING & TECHNOLGY
3KM FROM SASNI GATE, MATHURA ROAD ALIGARH-202001

INTERNSHIP CERTIFICATE

This is to certify that Rupesh Varshney student of ALIGARH COLLEGE OF


ENGINEERING B. Tech 3rdyear Computer Science and Engineering branch,
has undergone Internship Training in (Topic) Machine Learning from 09-Oct-
2022 to 09-Nov-2022.

(Mr.Dewang Chaudhary) (Dr.Anand Sharma)


Faculty Internship Assessment Lab HOD, CSE
Abstract

Machine learning and Analytics with Python is designed for


practitioners in data science and data analytics in both
academic and business environments. The aim is to present the
reader with the main concepts used in Machine Learning using
tools developed in Python, such as SciKit-learn, Pandas,
Numpy, and others. The use of Python is of particular interest,
given its recent popularity in the Machine Learning community.
The book can be used by seasoned programmers and
newcomers alike. The book is organized in a way that individual
chapters are sufficiently independent from each other so that
the reader is comfortable using the contents as a reference. The
book discusses what data science and analytics are, from the
point of view of the process and results obtained.
Important features of Python are also covered, including a
Python primer. The basic elements of machine learning, pattern
recognition, and artificial intelligence that underpin the
algorithms and implementations used in the rest of the book also
appear in the first part of the book. Regression analysis using
Python, clustering techniques, and classification algorithms are
covered in the second part of the book. Hierarchical clustering,
decision trees, and ensemble techniques are also explored,
along with dimensionality reduction techniques and
recommendation systems. The support vector machine
algorithm and the Kernel trick are discussed in the last part of
the book.
ACKNOWLEDGEMENT

I would like to express my special thanks of gratitude to


my teacher Mr. Dewang Chaudhary as well as our
HOD Dr. Anand Sharma who gave me the golden
opportunity to do this wonderful project on the topic
MACHINE LEARNING, which also helped me in doing
a lot of Research and I came to know about so many
new things I am really thankful to them.
Secondly I would also like to thank my parents and friends
who helped me a lot in finalizing this project within the
limited time frame.

Date: Rupesh Varshney


15/11/2022 B-Tech (CSE)
CONTENT

S_No. Topic Page No.


1 Introduction to Organisation 1-3

2 Software Training Work 4-10

3 Internship Training Work 11-15

4 Project Work 16-27

5 Result and Discussion 28-31

6 Conclusion 32
1. INTRODUCTION TO ORGANIZATION

Skillvoid Elearning Private Limited is an unlisted


private company incorporated on 17 February, 2021.
It is classified as a private limited company and is
located in Chennai, Tamil Nadu. It's authorized share
capital is INR 15.00 lac and the total paid-up capital is
INR 1.00 lac.

Current status of Skillvoid Technologies Llp company Active.


Skillvoid E Learning Visions for promoting technical excellence & offering fantastic
opportunities to the students and faculty of educational institutes & corporates , enabling
them to pursue insight into transformative technical trends with a view to empower &
build a knowledge society.

Details of the last annual general meeting of skillvoid Elearning Private Limited are not
available. The company is yet to submit its first full-year financial statements to the
registrar.

Skillvoid Elearning Private Limited has four directors-Senthilkumar Ramasamy


Muniappa Gounder, Jeeva Mayandi Kalidass, and others.

U80902TN2021PTC141464. The registered office of Skillvoid ElearningPrivate Limited


is at JAMNIPUR DEHRADUN Uttarakhand 248002.
PARTNERS
NAME DIN/PAN DESIGNATION DATE OF
APPOINTMENT
Mohit
17 September, 2021

SONIYA 09323487 Director 17 September, 2021

EXPLORE LEARNING NETWORK


COMPANY INFORMATION

CIN AAY-6426

Company Status Active

Date of Incorporation 17 September 2021

Registration State Uttarakhand

Partners 0

Designated Partners 2

Company Category Limited Liability Partnership

Listing status Unlisted


2. SOFTWARE TRAINING WORK

UNDERTAKEN

I have used main software in this project is Jupyter


notebook and there are many open licensee
softwares like:
1. Jupyter Notebook
2. Spyder
3. Thonny
4. Jupyter Lab
5. Pycharm
6. Visual Code
7. Atom
Pycharm is one the software which I used mostly in
Machine Learning.
1. JUPYTER NOTEBOOK
Jupyter Notebook is a web-based
interactive computational environment for
creating notebook documents.
interface, a notebook is a JSON document, following a
versioned schema, usually ending with the ".ipynb"
extension.
Jupyter notebooks are built upon a number
of popular open-source libraries:
● IPython
● Numpy

● Panda

● Bootstrap (front-end framework)

● Matplotlib

Jupyter Notebook can connect to many kernels to allow


programming in different languages. A Jupyter kernel is a
program responsible for handling various types of
requests (code execution, code completions, inspection),
and providing a reply. Kernels talk to the other
components of Jupyter using ZeroMQ, and thus can be on
the same or remote machines. Unlike many other
Notebook-like interfaces, in Jupyter, kernels are not aware
that they are attached to a specific document, and can be
connected to many clients at once. Usually kernels allow
execution of only a single language, but there are a couple
of exceptions. By default Jupyter Notebook ships with the
IPython kernel. As of the 2.3 release (October 2014),
there are 49 Jupyter-compatible kernels for many
programming languages, including Python, R, Julia and
Haskell.
A Jupyter Notebook can be converted to a number of open
standard output formats (HTML, presentation
slides, LaTeX, PDF, ReStructuredText, Markdown,
Python) through "Download As" in the web
interface, via
the nbconvert library or "jupyter nbconvert" command line
interface in a shell. To simplify visualisation of Jupyter
notebook documents on the web, the nbconvert library is
provided as a service through NbViewer which can take a
URL to any publicly available notebook document,
convert it to HTML on the fly and display it to the user.
The notebook interface was added to IPython in the 0.12
release (December 2011), renamed to Jupyter notebook
in 2015 (IPython 4.0 is Jupyter 1.0). Jupyter Notebook is
similar to the notebook interface of other programs such
as Maple, Mathematica, and SageMath, a computational
interface style that originated with Mathematica in the
1980s. Jupyter interest overtook the popularity of the
Mathematica notebook interface in early 2018.
JupyterLab is a newer user interface for Project Jupyter.
It offers the building blocks of the classic Jupyter
Notebook (notebook, terminal, text editor, file browser,
rich outputs, etc.) in a flexible user interface. The first
stable release was announced on February 20, 2018.

2. Spyder
Scientific Python Development Environment (Spyder) is
a free & open-source python IDE. It is lightweight and is
an excellent python ide for data science & ML. It is
used by a lot of data analysts for real-time code analysis.
Spyder has an interactive code execution pattern which
gives you the option to compile any single line, a
section of the code, or the whole code in one go.
You can find the redundant variables, errors, syntax
issues in your code without even compiling it in Spyder
via the static code analysis feature. It is also integrated
with many DS packages like NumPy, SciPy, Pandas,
IPython, etc. to help you in doing data analytics.
You can control the execution flow of your source code
from the Spyder GUI (Graphical User Interface) via the
Spyder debugger. The history log page of Spyder records
all the commands used in the editor for further
references. You can also know about any built-in
function, method, class, etc. in Spyder via the Help Pane
of Spyder. It is an excellent tool for data science
enthusiasts.
3. Thonny
Thonny is an excellent Python IDE that will run on
Windows, Linux, and Mac. The debugger of Thonny
helps in debugging codes line by line, this process helps
a lot for beginners who are learning to code. The
excellent GUI of Thonny makes the installation of third-
party packages much easier.
Thonny autocompletes code according to its prediction
and inspects the code for bracket mismatching and
highlights the error which is a great feature for
beginners. It is completely free to download. When you
call a function in Thonny, it will be done in a separate
window which makes the user understand the local
variables & call stack of the function better. The package
manager of
Thonny helps you in downloading them and
increasing the functionality of python.

4. JupyterLab
It is a web-based python IDE for Machine Learning &
DS professionals. You can test your code as you write
via the interactive output system of JupyterLab. The
interface of JupyterLab is quite good as it provides you a
simultaneous view of the terminal, text editor, console,
and file directory.
Features like auto code completion, auto-formatting,
autosave, etc. make it one of the best free Python IDEs
for ML and DS professionals. There is a zen mode in
JupyterLab which allows users to minimise distractions,
unrequired screens, and focus on the project under
process. The files created in JupyterLab can be
downloaded in various formats like .py, pdf, etc.
5. PyCharm
It is an excellent python IDE which has features like auto
code completion, auto code indentation, etc. It has a
smart debugger that analyses the code and highlights
errors. DS & ML professionals who are into web
development prefer PyCharm also because of its easy
navigation facility. You can search for any particular
symbol used in long codes via the navigation feature in
PyCharm. Interlinking multiple scripts is also easier in
PyCharm.
One can restructure their code easily via PyCharm’s
refactoring feature where you can change the method
signature, rename the file, extract any method in code.
ML professionals use integrated unit testing to test their
ML pipelines.
It helps in knowing the performance of any particular ML
model. PyCharm comes with inbuilt integrated unit
testing and one can see the results in a graphical layout. It
also has a version control system that helps in keeping
track of the changes made to any particular
file/application.
6. Visual Code
Visual Code is one of the most used Python IDE by ML
& DS professionals. It works on Windows, Mac, and
Linux operating systems. VS Code supports many
languages besides Python like C, C#, JavaScript, HTML,
CSS, etc. Visual Code is a lightweight, open-source
Python IDE that has a free version as well as a paid
version for businesses/enterprises.
It is also a good platform for beginners as you will get
hints in the VS Code whenever you create functions or
classes. The auto code completion also helps users to save
time while coding. VS Code is also integrated with
PyLint which checks errors in the source code. You can
perform unit testing on your ML or DS models easily via
VS Code.
The REPL (read-evaluate-print loop) helps in seeing
quick results of any small python code in a separate
window. It helps a lot when one is experimenting
with any new API or function.
VS Code makes working with SQL, Unity, .NET,
Node.js, and many other tools easier. One can rename
a file, extract methods, add imports, etc. in your code
via the VS Code refactor. VS Code is an excellent IDE
for ML & DS to optimise and debug codes easily.
7. Atom
Atom is an excellent IDE for ML & DS professionals
which supports many other languages besides python
like C, C++, HTML, JavaScript, etc. You can use it on
Windows, Linux, and Mac. Atom supports MySQL,
PostgreSQL, Microsoft SQL Server which helps you in
writing and executing SQL queries/commands.
There are many useful packages in Atom like the atom-
beautify package which beautifies your code and makes
it more accurate. The outline view feature of Atom lets
you see a tree-based view of your code and you can
cross- check your classes, functions, etc. easily. Atom
will provide you many themes and templates from
GitHub to choose from.
ML & DS professionals also prefer Atom because of
its ability for cross-platform editing. It is one of the
best open-source free IDEs to use currently.
3. Internship training work
undertaken
Machine learning is a multidisciplinary field that uses
scientific inference and mathematical algorithms to
extract meaningful knowledge and insights from a large
amount of structured and unstructured data. These
algorithms are implemented via computer programs
which are usually run on powerful hardware since it
requires a significant amount of processing.

Machine learning is a combination of statistical


mathematics, Data Science, data analysis and
visualization, domain knowledge and computer science.
As it is apparent from the name, the most important
component of Machine Learning is “Machine” itself. No
amount of algorithmic computation can draw meaningful
insights from improper data.
Machine learning is a subfield of artificial intelligence,
which is broadly defined as the capability of a machine
to imitate intelligent human behavior. This means
machines that can recognize a visual scene, understand
a text written in natural language, or perform an action
in the physical world.
I made data science project on the topic car and done many
instructions like importing .csv file, data cleaning (Find all the null
value in the dataset and if any null detected then fill it with the
mean of that column.), Check what are the different type of make
in our dataset and count what are the count of each make in
data?, Filtering (Show all the records whose Identification. Make
is Audi and BMW.), removing unwanted records (Remove all the
records (rows) where Identification. Year is greater than 2011.)
and applying function to column (Increase all the value of
Dimensions Width by 3.)
METHODOLOGY IN MACHINE LEARNING

In 1959, Arthur Samuel defined machine learning as a relatField of study that gives
computers the ability to learn ed". Ms Results without being explicitly programmed".
interactions

"The goal of machine learning is to build computer systems that can adapt and learn
from their experience." Tom Dietterich deal- use amountsocial extract analysis

wa Tom M. Mitchell provided a widely quoted, more formal definition: "A computer
program is said to learn 05 from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E"

● Machine learning
A branch of artificial intelligence, concerned with the design and development
of algorithms that allow computers to evolve behaviours based on empirical
data.As intelligence requires knowledge, it is necessary for the computers to
acquire knowledge.Machine learning refers to a system capable of the autonomous
acquisition and integration of knowledge

● Why is ML needed?

1. No human experts industrial/manufacturing control.


2. mass spectrometer analysis, drug design, astronomical discovery.
3. Black-box human expertise
4. face/handwriting/speech recognition.
5. driving a car, flying a plane.
6. Rapidly changing phenomena
7. credit scoring, financial modelling.
8. diagnosis, fraud detection.
9. customization/personalization personalised news reader.
10. movie/book recommendation.

● LEARNING SYSTEM MODEL

● Types of Algorithm in ML

1. Supervised learning.
Prediction
Classification (discrete labels), Regression (real values)
2. Unsupervised learning.
Clustering
Probability distribution estimation Finding association (in features)
Dimension reduction
3. Semi-supervised learning.
4. Reinforcement learning.
Decision making (robot, chess machine)
● Regression and Classification
These are Supervised Learning algorithms. Both the algorithms are used for prediction
in Machine learning and work with the labeled datasets. But the difference between
both is how they are used for different machine learning problems.

The main difference between Regression and Classification algorithms that


Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.
● Performance

There are several factors affecting the performance:

1. Types of training provided


2. The form and extent of any initial background knowledge
3. The type of feedback provided

The learning algorithms used


Two important factors:
1. Modelling
2. Optimization

● Application

1. Face detection.
2. Object detection and recognition.
3. Image segmentation.
4. Multimedia event detection.
5. Economical and commercial usage.
4. PROJECT WORK

1. Fake News Detection


Fake news is one of the biggest problems because it leads to a lot of misinformation in a
particular region. Most of the time, spreading false news about a community’s political and
religious beliefs can lead to riots and violence as you must have seen in the country where you
live. So, to detect fake news, we can find relationships between the fake news headlines so
that we can train a machine learning model that can tell us whether a particular piece of
information is fake or real by simply observing the headline in the news. So in the section
below, I’m going to introduce you to a machine learning project on fake news detection using
the Python programming language.

Fake News Detection using Python


The dataset I am using here for the fake news detection task has data about the news title,
news content, and a column known as label that shows whether the news is fake or real. So
we can use this dataset to find relationships between fake and real news headlines to
understand what type of headlines are in most fake news. So let’s import the necessary
Python libraries and the dataset that we need for this task:

fake_or_real_news.csv
https://www.kaggle.com/hassanamin/textdb3
Dataset
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
data = pd.read_csv("news.csv") print(data.head())

x = np.array(data["title"]) y
= np.array(data["label"])
cv = CountVectorizer()
x = cv.fit_transform(x)

Now let’s separate the dataset into training and testing sets, and then I’ll use the Multinomial Naive
Bayes algorithm to train the fake news detection model:

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42) model


= MultinomialNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))

Now let’s test this model. To test our trained model, I’ll first write down the title of any news
item found on google news to see if our model predicts that the news is real or not:

news_headline = "CA Exams 2021: Supreme Court asks ICAI to extend opt-out option for July
exams, final order tomorrow"
data = cv.transform([news_headline]).toarray()
print(model.predict(data))

Now I’m going to write a random fake news headline to see if the model predicts the news is
fake or not:

news_headline = "Cow dung can cure CoronaVirus" data


= cv.transform([news_headline]).toarray()
print(model.predict(data))

This news is Fake

● Summary

Fake news is one of the biggest problems because it leads to a lot of


misinformation in a particular region.
4. PROJECT WORK

2. Car Price Prediction- I made a Machine Learning project on the


topic car Price Prediction. And here
are the details of this project with coding.
Project Car details:

Car Dataset

Here, the dataset of different cars is given with their applications. This data is
available in a csv file. We are going to
analyse this data using the pandas DataFramework.

Instructions which are going to perform on dataset:


1. (Data Cleaning) Find all the null value in the dataset and if any null detected
then fill it with the mean of that column.

2. Question (based on the value count function) Check what are the different
type of make in our dataset and count what are the count of each make in
data?

3. (Filtering) Show all the records whose Identification.


Make is Audi and BMW.

4. (removing unwanted records)


Remove all the records (rows) where Identification. Year
is greater than 2011.

5. (applying function to column)


Increase all the value of Dimensions Width by 3.

One of the main areas of research in machine learning is the prediction of the
price of cars. It is based on finance and the marketing domain. It is a major
research topic in machine learning because the price of a car depends on
many factors. Some of the factors that contribute a lot to the price of a car are:

● Brand
● Model
● Horsepower
● Mileage
● Safety Features
● GPS and many more

If one ignores the brand of the car, a car manufacturer primarily fixes the price
of a car based on the features it can offer a customer. Later, the brand may
raise the price depending on its goodwill, but the most important factors are
what features a car gives you to add value to your life. So, in the section below,
I will walk you through the task of training a car price prediction model with
machine learning using the Python programming language

Coding/Solution of these Car Price Prediction instructions:

1. data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype

0 car_ID 205 non-null int64


1 symboling 205 non-null int64
2 CarName 205 non-null object
3 fueltype 205 non-null object
4 aspiration 205 non-null object
5 doornumber 205 non-null object
6 carbody 205 non-null object
7 drivewheel 205 non-null object
8 enginelocation 205 non-null object
9 wheelbase 205 non-null float64
10 carlength 205 non-null float64
11 carwidth 205 non-null float64
12 carheight 205 non-null float64
13 curbweight 205 non-null int64
14 enginetype 205 non-null object
15 cylindernumber 205 non-null object
16 enginesize 205 non-null int64
17 fuelsystem 205 non-null object
18 boreratio 205 non-null float64
19 stroke 205 non-null float64
20 compressionratio 205 non-null float64
21 horsepower 205 non-null int64
22 peakrpm 205 non-null int64
23 citympg 205 non-null int64
24 highwaympg 205 non-null int64
25 price 205 non-null float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

data = pd.read_csv("CarPrice.csv")
data.head()

2. print(data.describe))

car_ID symboling wheelbase ... citympg highwaympg price


count 205.000000 205.000000 205.000000 205.000000 205.000000
205.000000
mean 103.000000 0.834146 98.756585 25.219512 30.751220
13276.710571
std 59.322565 1.245307 6.021776 ... 6.542142 6.886443
7988.852332
min 1.000000 -2.000000 86.600000 ... 13.000000 16.000000
5118.000000
25% 52.000000 0.000000 94.500000 ... 19.000000 25.000000
7788.000000
50% 103.000000 1.000000 97.000000 ... 24.000000 30.000000
10295.000000
75% 154.000000 2.000000 102.400000 ... 30.000000 34.000000
16503.000000
max 205.000000 3.000000 120.900000 ... 49.000000 54.000000
45400.000000
3. data.carName.unique()

array(['alfa-romero giulia', 'alfa-romero stelvio',


'alfa-romero Quadrifoglio', 'audi 100 ls', 'audi 100ls',
'audi fox', 'audi 5000', 'audi 4000', 'audi 5000s (diesel)',
'bmw 320i', 'bmw x1', 'bmw x3', 'bmw z4', 'bmw x4', 'bmw x5',
'chevrolet impala', 'chevrolet monte carlo', 'chevrolet vega 2300',
'dodge rampage', 'dodge challenger se', 'dodge d200',
'dodge monaco (sw)', 'dodge colt hardtop', 'dodge colt (sw)',
'dodge coronet custom', 'dodge dart custom',
'dodge coronet custom (sw)', 'honda civic', 'honda civic cvcc',
'honda accord cvcc', 'honda accord lx', 'honda civic 1500 gl',
'honda accord', 'honda civic 1300', 'honda prelude',
'honda civic (auto)', 'isuzu MU-X', 'isuzu D-Max ',
'isuzu D-Max V-Cross', 'jaguar xj', 'jaguar xf', 'jaguar xk',
'maxda rx3', 'maxda glc deluxe', 'mazda rx2 coupe', 'mazda rx-4',
'mazda glc deluxe', 'mazda 626', 'mazda glc', 'mazda rx-7 gs',
'mazda glc 4', 'mazda glc custom l', 'mazda glc custom',
'buick electra 225 custom', 'buick century luxus (sw)',
'buick century', 'buick skyhawk', 'buick opel isuzu deluxe',
'buick skylark', 'buick century special',
'buick regal sport coupe (turbo)', 'mercury cougar', 'mitsubishi
mirage', 'mitsubishi lancer', 'mitsubishi outlander', 'mitsubishi
g4', 'mitsubishi mirage g4', 'mitsubishi montero',
'mitsubishi pajero', 'Nissan versa', 'nissan gt-r', 'nissan rogue',
'nissan latio', 'nissan titan', 'nissan leaf', 'nissan juke',
'nissan note', 'nissan clipper', 'nissan nv200', 'nissan dayz',
'nissan fuga', 'nissan otti', 'nissan teana', 'nissan kicks',
'peugeot 504', 'peugeot 304', 'peugeot 504 (sw)', 'peugeot 604sl',
'peugeot 505s turbo diesel', 'plymouth fury iii',
'plymouth cricket', 'plymouth satellite custom (sw)',
'plymouth fury gran sedan', 'plymouth valiant', 'plymouth duster',
'porsche macan', 'porcshce panamera', 'porsche cayenne',
'porsche boxter', 'renault 12tl', 'renault 5 gtl', 'saab 99e',
'saab 99le', 'saab 99gle', 'subaru', 'subaru dl', 'subaru brz',
'subaru baja', 'subaru r1', 'subaru r2', 'subaru trezia',
'subaru tribeca', 'toyota corona mark ii', 'toyota corona',
'toyota corolla 1200', 'toyota corona hardtop',
'toyota corolla 1600 (sw)', 'toyota carina', 'toyota mark ii',
'toyota corolla', 'toyota corolla liftback',
'toyota celica gt liftback', 'toyota corolla tercel', 'toyota
corona liftback', 'toyota starlet', 'toyota tercel', 'toyota
cressida', 'toyota celica gt', 'toyouta tercel',
'vokswagen rabbit', 'volkswagen 1131 deluxe sedan',
'volkswagen model 111', 'volkswagen type 3', 'volkswagen 411 (sw)',
'volkswagen super beetle', 'volkswagen dasher', 'vw dasher',
'vw rabbit', 'volkswagen rabbit', 'volkswagen rabbit custom',
'volvo 145e (sw)', 'volvo 144ea', 'volvo 244dl', 'volvo 245',
'volvo 264gl', 'volvo diesel', 'volvo 246'], dtype=object)

The price column in this dataset is supposed to be the column whose values
we need to predict. So let’s see the distribution of the values of the price
column:

3. print(data.corr))

car_ID symboling ... highwaympg price


car_ID 1.000000 -0.151621 ... 0.011255 -0.109093
symboling -0.151621 1.000000 ... 0.034606 -0.079978
wheelbase 0.129729 -0.531954 ... -0.544082 0.577816
carlength 0.170636 -0.357612 ... -0.704662 0.682920
carwidth 0.052387 -0.232919 ... -0.677218 0.759325
carheight 0.255960 -0.541038 ... -0.107358 0.119336
curbweight 0.071962 -0.227691 ... -0.797465 0.835305
enginesize -0.033930 -0.105790 ... -0.677470 0.874145
boreratio 0.260064 -0.130051 ... -0.587012 0.553173
stroke -0.160824 -0.008735 ... -0.043931 0.079443
compressionratio 0.150276 -0.178515 ... 0.265201 0.067984
horsepower -0.015006 0.070873 ... -0.770544 0.808139
peakrpm -0.203789 0.273606 ... -0.054275 -0.085267
citympg 0.015940 -0.035823 ... 0.971337 -0.685751
highwaympg 0.011255 0.034606 ... 1.000000 -0.697599
price -0.109093 -0.079978 ... -0.697599 1.000000

predict = "price"
data = data[["symboling", "wheelbase", "carlength",
"carwidth", "carheight", "curbweight",
"enginesize", "boreratio", "stroke",
"compressionratio", "horsepower", "peakrpm",
"citympg", "highwaympg", "price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])

from sklearn.model_selection import train_test_split


xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)

from sklearn.tree import DecisionTreeRegressor


model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)

● Summary
It is a major research topic in machine learning because the price of a car
depends on many factors.
Out[5]: Dimensions.Height 0
Dimensions.Length 0
Dimensions.Width 0
Engine Information.Driveline 0
Engine Information.Engine Type 0
Engine Information.Hybrid 0
Engine Information.Number of Forward Gears 0
Engine Information.Transmission 0
Fuel Information.City mpg 0
Fuel Information.Fuel Type 0
Fuel Information.Highway mpg 0
Identification.Classification 0
Identification.ID 0
Identification.Make 0
Identification.Model Year 0
Identification.Year 0
Engine Information.Engine Statistics.Horsepower 0
Engine Information.Engine Statistics.Torque 0
dtype: int64

In[6]: car.head(2)

Out[6]:

Dim. Dim. Dim. Engine Iden. Iden. Engine Info. Fuel Engine
Height length Width Information Make Year Engine Info. Info.
Driveline Statistics. Fuel Hybrid
Horsepower Type
All-wheel
0 140 143 202 drive Audi 2009 250 Gasoline TRUE

Front-wheel
1 140 143 202 drive Audi 2009 200 Gasoline TRUE

2. Value count function’s query with output


In[7]: car['Dimensions.Height'].value_counts()
193 114
195 109
188 105
...
108 1
63 1
78 1
159 1
241 1
Name: Dimensions.Height, Length: 198, dtype: int64

3. Filtering’s query with output


In[8]: car['Identification.Make'].isin(['Audi','BMW'])
Out[8]: 0 True
1 True
2 True
3 True
4 True
...
5071 False
5072 False
5073 False
5074 True
5075 True
Name: Identification.Make, Length: 5076, dtype:
bool

In[9]: car.head()

Out[9]:
Dim. Dim. Dim. Engine Iden. Iden. Engine Info. Fuel Engine
Height length Width Information Make Year Engine Info. Info.
Driveline Statistics. Fuel Hybrid
Horsepower Type
1 Front-wheel
140 143 202 drive Audi 2009 200 Gasoline TRUE

2 Front-wheel
140 143 202 drive Audi 2009 200 Gasoline TRUE

3 All-wheel
140 143 202 drive Audi 2009 200 Gasoline TRUE

4 All-wheel
140 143 202 drive Audi 2009 200 Gasoline TRUE

4. Removing unwanted record’s query with output


In[10]: car[~(car['Identification.Year']>2011)]

Out[10]:
Dim. Dim. Dim Engine Iden. Iden. Engine Fuel Engine
Heig lengt . Info. Make Year Info. Info. Info.
ht h Widt Driveline Engine Fuel Hybrid
h Statistics. Type
Horsepow
er
0 All-wheel Gasolin
140 143 202 drive Audi 2009 250 e TRUE

1 Front-wheel Gasolin
140 143 202 drive Audi 2009 200 e TRUE

2 Front-wheel Gasolin
140 143 202 drive Audi 2009 200 e TRUE

3 All-wheel Gasolin
140 143 202 drive Audi 2009 200 e TRUE

... … … … ….. … … … … …
Dodg Gasoli True
Rear-wheel
4374 97 235 224 e 2011 390 ne
drive
4375 Dodg Gasoli True
Four-wheel
97 235 224 e 2011 310 ne
drive

3861 rows × 18 columns


In[11]: car.head(2)
Out[11]:
Dim. Dim. Dim. Engine Iden. Iden. Engine Info. Fuel Engine
Height length Width Information Make Year Engine Info. Info.
Driveline Statistics. Fuel Hybrid
Horsepower Type
All-wheel
0 140 143 202 drive Audi 2009 250 Gasoline TRUE

Front-wheel
1 140 143 202 drive Audi 2009 200 Gasoline TRUE

5. Applying function to column’s query with output


In[12]:
car['Dimensions.Width']=car['Dimensions.Width'].apply(lambda
x:x+3)
In[13]: car.head()
Out[13]:
Dim. Dim. Dim. Engine Iden. Iden. Engine Info. Fuel Engine
Height length Width Information Make Year Engine Info. Info.
Driveline Statistics. Fuel Hybrid
Horsepower Type

0 All-wheel
140 143 205 drive Audi 2009 250 Gasoline TRUE

1 Front-wheel
140 143 205 drive Audi 2009 200 Gasoline TRUE

2 Front-wheel
140 143 205 drive Audi 2009 200 Gasoline TRUE

3 All-wheel
140 143 205 drive Audi 2009 200 Gasoline TRUE

4 All-wheel
140 143 205 drive Audi 2009 200 Gasoline TRUE
5. RESULTS AND DISCUSSION

Head() function is showing first 5 rows of dataset.


And Shape is used to define the number of rows and columns.

Head(2) is showing first two rows of dataset in .csv file.


Isnull()functionis usedtocheckthatthereis anynullvalue ornot.
Value_counts() function is used to count length of any
column.

Isin() function is used to check that particular type of


data is present in specific column of dataset or not. If it
is present then it return ‘True’, otherwise ‘False’.
[~] It is used to delete particular data if condition matched
to that.

Here, Head() function is used to just see the changes that


are done through above coding.
It shows the final data of dataset that are made by
applying different type of instructions, conditions and
coding.
6. Conclusion

Machine learning education is well into its formative stages of development; it is


evolving into a self-supporting discipline and producing professionals with
distinct and complementary skills relative to professionals in the computer,
information, and statistical sciences. However, regardless of its potential
eventual disciplinary status, the evidence points to robust growth of data
science education that will indelibly shape the undergraduate students of the
future. In fact, fueled by growing student interest and industry demand, data
science education will likely become a staple of the undergraduate experience.
There will be an increase in the number of students majoring, minoring, earning
certificates, or just taking courses in data science as the value of data skills
becomes even more widely recognized. The adoption of a general education
requirement in data science for all undergraduates will endow future
generations of students with the basic understanding of Machine Learning that
they need to become responsible citizens. Continuing education programs such
as data science boot camps, career accelerators, summer schools, and
incubators will provide another stream of talent. This constitutes the emerging
watershed of data science education that feeds multiple streams of generalists
and specialists in society; citizens are empowered by their basic skills to
examine, interpret, and draw value from data.

You might also like