Data Mining Using Phyton

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 50

UNIVERSITY OF NORHTERN PHILIPPINES

M O D U L E 5

DM USING PYTHON
D ATA M I N I N G U S I N G P Y T H O N

Python is a popular programming language used for data mining.

Prepared by:
ARPEE C. ARRUEJO, DIT

Start Slide
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup Machine Learning in Python


WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
WH of Python?

Python is a popular programming language. It was created


by Guido van Rossum, and released in 1991.
It is used for:
• web development (server-side),
• software development,
• mathematics,
• system scripting.
WH of Python?
What can Python do?

• Python can be used on a server to create web applications.


• Python can be used alongside software to create workflows.
• Python can connect to database systems. It can also read and modify
files.
• Python can be used to handle big data and perform complex
mathematics.
• Python can be used for rapid prototyping, or for production-ready
software development.
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup Machine Learning in Python


WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
Python Installation

Many PCs and Macs will have python already installed.

To check if you have python installed on a Windows PC, search in the start
bar for Python or run the following on the Command Line (cmd.exe):

If you find that you do not have Python installed on your computer, then you
can download it for free from the following website:

https://www.python.org/
Python Installation

Install VSCode, Sublime Text, or Notepad for the coding.


Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup Machine Learning in Python


WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
Home

Python Syntax
As we learned in the previous page,
Python syntax can be executed by writing
directly in the Command Line:

Or by creating a python file on the server,


using the .py file extension, and running it
in the Command Line:
Python Variables
In Python, variables are created when you assign a
value to it:

Python has no command for declaring a variable.

Example
Output
Syntax

x=5 5
y = "Hello, Hello, World!
World!"

print(x)
print(y)
CREATIVE AGENCY About Activity Home

Checkpoint!

Exercise:

Insert the missing part of the code below to output


"Hello World".

Visit Now
Python Variables
Variables
Variables are containers for storing data values.

Creating Variables
Python has no command for declaring a variable.

A variable is created the moment you first assign a value to it.

Example Variables do not need to be declared with any


Output particular type, and can even change type after
Syntax
they have been set.
x=5 5
y = "John" John
print(x)
print(y)
Output Variables
Example
The Python print() function is often used Output
to output variables. Syntax

x = "Python is Python is awesome


awesome"
print(x)

In the print() function, you output


multiple variables, separated by a comma:
Global Variables
Example
Variables that are created outside of a function (as in Output
all of the examples above) are known as global Syntax
variables. x = "awesome" Python is awesome

Global variables can be used by everyone, both inside def myfunc():


of functions and outside. print("Python is " + x)

myfunc()
If you create a variable with the same name
inside a function, this variable will be local, and
can only be used inside the function. The global
variable with the same name will remain as it
was, global and with the original value.
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup Machine Learning in Python


WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
Modules
What is a Module? Example
Consider a module to be the same as a code library.
A file containing a set of functions you want to include in your application. Syntax Output

Create a Module def greeting(name): Python is awesome


To create a module just save the code you want in a file with the file print("Hello, " + name)
extension .py:

Save this code in a file named mymodule.py

Example
Use a Module
Now we can use the module we just created, by using the import statement: Syntax Output

import mymodule Hello, Jonathan


mymodule.greeting("Jonathan")
Variables in Modules
The module can contain functions, as already described, but also variables of all types Example
(arrays, dictionaries, objects etc):
Syntax Output
Example
Save this code in the file mymodule.py import mymodule 36

a = mymodule.person1["age"]
print(a)

Import the module named mymodule, and access the person1 dictionary:
Built-in Modules
There are several built-in modules in Python, which you can import whenever you Example
like.
Syntax Output
Example
Import and use the platform module: import platform Windows

x = platform.system()
print(x)
Python JSON
JSON is a syntax for storing and exchanging data. Example

JSON is text, written with JavaScript object notation. Syntax Output

Example import platform Windows


Import the json module:
x = platform.system()
print(x)
NumPy
NumPy is a Python library.

NumPy is used for working with arrays.

NumPy is short for "Numerical Python".

Example
WH of NumPy
What is NumPy?
• NumPy is a Python library used for working with arrays.
• It also has functions for working in domain of linear algebra, fourier transform, and matrices.
• NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.
• NumPy stands for Numerical Python.

Why is NumPy?
• In Python we have lists that serve the purpose of arrays, but they are slow to process.
• NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
• The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray
very easy.
• Arrays are very frequently used in data science, where speed and resources are very important.
Installation of NumPy
If you have Python and PIP already installed on a system, then installation of NumPy is very easy.

Install it using this command:

If this command fails, then use a python distribution that already has NumPy installed like, Anaconda, Spyder etc.

Import NumPy
Once NumPy is installed, import it in your applications by adding the import keyword:

Now NumPy is imported and ready to use.


Pandas
Pandas is a Python library used for working
with data sets.

It has functions for analyzing, cleaning,


exploring, and manipulating data.

The name "Pandas" has a reference to both


"Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.
WH of Pandas
What is Pandas?
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why is Pandas?
• Pandas allows us to analyze big data and make conclusions based on statistical theories.
• Pandas can clean messy data sets, and make them readable and relevant.
• Relevant data is very important in data science.

Why Can Pandas Do?


Pandas gives you answers about the data. Like:
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.
Installation of Pandas
Example
IIf you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Syntax Output
Install it using this command:
import pandas as pd cars passings
0 BMW 3
mydataset = { 1 Volvo 7
'cars': ["BMW", "Volvo", 2 Ford 2
"Ford"],
If this command fails, then use a python distribution that already has Pandas installed like, Anaconda,
'passings': [3, 7, 2]
Spyder etc. }

Import Pandas myvar =


Once Pandas is installed, import it in your applications by adding the import keyword: pd.DataFrame(mydataset)

print(myvar)

Now Pandas is imported and ready to use.


Installation of Pandas
Example
IIf you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Syntax Output
Install it using this command:
import pandas as pd cars passings
0 BMW 3
mydataset = { 1 Volvo 7
'cars': ["BMW", "Volvo", 2 Ford 2
"Ford"],
If this command fails, then use a python distribution that already has Pandas installed like, Anaconda,
'passings': [3, 7, 2]
Spyder etc. }

Import Pandas myvar =


Once Pandas is installed, import it in your applications by adding the import keyword: pd.DataFrame(mydataset)

print(myvar)

Now Pandas is imported and ready to use.


Example

Panda Series Syntax

import pandas as pd
Output

0 1
a = [1, 7, 2] 1 7
What is a Series? 2 2
myvar = pd.Series(a) dtype: int64
A Pandas Series is like a column in a
table. print(myvar)

It is a one-dimensional array holding


data of any type.
Labels Example

If nothing else is specified, the values are labeled with their Syntax Output
index number. First value has index 0, second value has 1
import pandas as pd
index 1 etc.

a = [1, 7, 2]
This label can be used to access a specified value.
myvar = pd.Series(a)

ICreate Labels print(myvar[0])

With the index argument, you can


name your own labels.
Pandas Read CSV Example: Load the CSV into a DataFrame.

Read CSV Files Syntax Output

import pandas as pd 1
A simple way to store big data sets is to use CSV files (comma
separated files).
df =
CSV files contains plain text and is a well know format that can
pd.read_csv('data.csv')
be read by everyone including Pandas.
In our examples we will be using a CSV file called 'data.csv'. print(df.to_string())
Download data.csv. or Open data.csv
Pandas Read JSON Example: Load the CSV into a DataFrame.

Read JSON Syntax Output

import pandas as pd 1
Big data sets are often stored, or extracted as JSON.
df =
JSON is plain text, but has the format of an object, and is well
pd.read_json('data.json')
known in the world of programming, including Pandas.
print(df.to_string())
In our examples we will be using a JSON file called 'data.json'.

Open data.json.

JSON = Python Dictionary

JSON objects have the same format as Python


dictionaries.
If your JSON code is not in a file, but in a Python Dictionary, you
can load it into a DataFrame directly:
Syntax
Output
import pandas as pd

data = {
"Duration":{
1
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}

df = pd.DataFrame(data)

print(df)
Django
Django is a back-end server side web framework.
1
Django is free, open source and written in Python.

Django makes it easier to build web pages using Python.


WH of Pandas
What is Django?
• Django is a Python framework that makes it easier to create web sites using Python.
• Django takes care of the difficult stuff so that you can concentrate on building your web applications
• Django emphasizes reusability of components, also referred to as DRY (Don't Repeat Yourself), and
comes with ready-to-use features like login system, database connection and CRUD operations (Create
Read Update Delete).

How does Django Work?


Django follows the MVT design pattern (Model View Template).
• Model - The data you want to present, usually data from a database.
• View - A request handler that returns the relevant template and content - based on the request from the
user.
• Template - A text file (like an HTML file) containing the layout of the web page, with logic on how to
display the data.
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup Machine Learning in Python


WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
Machine Evaluate your Model

Learning In Machine Learning we create models to predict the outcome of certain events, like in the
previous chapter where we predicted the CO2 emission of a car when we knew the weight and

Train/Test engine size.


To measure if the model is good enough, we can use a method called Train/Test.

What is Train/Test?
Train/Test is a method to measure the accuracy of your model.
It is called Train/Test because you split the data set into two sets: a training set and a testing
set.

You train the model using the training set.


You test the model using the testing set.
Start With a Data Set
Start with a data set you want to test.
Our data set illustrates 100 customers in a shop, and their shopping habits.
Syntax
Output

#Three lines to make our compiler able to draw: 1

import sys
import matplotlib
matplotlib.use('Agg')

import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)

x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x

plt.scatter(x, y)
plt.show()

#Two lines to make our compiler able to draw:


plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Decision Tree
A Decision Tree is a Flow Chart, and can help you make
decisions based on previous experience. 1
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup Machine Learning in Python


WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
Example
In the example, a person will try to decide if he/she should go to a comedy show or not.

Luckily our example person has registered every time there was a comedy show in town, and registered some information
about the comedian, and also registered if he/she went or not.
1

Now, based on this data set, Python can create a decision tree that can be used to decide if any new shows are worth
attending to.
How does it Work?
First, read the dataset with pandas:
1

To make a decision tree, all data has to be numerical.

We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values.

Pandas has a map() method that takes a dictionary with information on how to convert the values.

{'UK': 0, 'USA': 1, 'N': 2}

Means convert the values 'UK' to 0, 'USA' to 1, and 'N' to 2.


cont’n...
Change string values into numerical values:
1

Then we have to separate the feature columns from the target


column.
cont’n...
The feature columns are the columns that we try to predict
from, and the target column is the column with the values
we try to predict.
1

X is the feature columns, y is the target column:


cont’n...
Now we can create the actual decision tree, fit it
with our details. Start by importing the modules we
need:
1

Create and display a Decision Tree:


Result Explained
The decision tree uses your earlier decisions to calculate the odds for you to wanting to go see a comedian or
not. 1

Let us read the different aspects of the decision tree:

Rank
Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and
the rest will follow the False arrow (to the right).

gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would
mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.

samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this
is the first step.

value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".
Gini
There are many ways to split the samples, we use the GINI method in this tutorial.

The Gini method uses this formula: 1


Gini = 1 - (x/n)2 - (y/n)2

Where x is the number of positive answers("GO"), n is the number of samples, and y is the number
of negative answers ("NO"), which gives us this calculation:

1 - (7 / 13)2 - (6 / 13)2 = 0.497


Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?

Phyton Installation and Setup


Machine Learning in Python
WH Questions
Train/Test
Presentations are tools that can be Linear Regression
used as lectures, speeches, reports Decision Tree
Sentiment Analysis
Text Mining
A Decision Tree is a Flow Chart, and can help you make
decisions based on previous experience. 1
Text Mining Method
1
CREATIVE AGENCY

References:
https://www.w3schools.com/python/default.asp

Learn More
D ATA M I N I N G U S I N G P H Y T O N

To God be
all the Glory!

You might also like