Professional Documents
Culture Documents
Data Mining Using Phyton
Data Mining Using Phyton
Data Mining Using Phyton
M O D U L E 5
DM USING PYTHON
D ATA M I N I N G U S I N G P Y T H O N
Prepared by:
ARPEE C. ARRUEJO, DIT
Start Slide
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?
To check if you have python installed on a Windows PC, search in the start
bar for Python or run the following on the Command Line (cmd.exe):
If you find that you do not have Python installed on your computer, then you
can download it for free from the following website:
https://www.python.org/
Python Installation
Python Syntax
As we learned in the previous page,
Python syntax can be executed by writing
directly in the Command Line:
Example
Output
Syntax
x=5 5
y = "Hello, Hello, World!
World!"
print(x)
print(y)
CREATIVE AGENCY About Activity Home
Checkpoint!
Exercise:
Visit Now
Python Variables
Variables
Variables are containers for storing data values.
Creating Variables
Python has no command for declaring a variable.
myfunc()
If you create a variable with the same name
inside a function, this variable will be local, and
can only be used inside the function. The global
variable with the same name will remain as it
was, global and with the original value.
Phyton Syntax and Modules
Phyton Intro
Syntax
What is Python?
NumPy
What can Python Do?
Pandas
Why Python?
Example
Use a Module
Now we can use the module we just created, by using the import statement: Syntax Output
a = mymodule.person1["age"]
print(a)
Import the module named mymodule, and access the person1 dictionary:
Built-in Modules
There are several built-in modules in Python, which you can import whenever you Example
like.
Syntax Output
Example
Import and use the platform module: import platform Windows
x = platform.system()
print(x)
Python JSON
JSON is a syntax for storing and exchanging data. Example
Example
WH of NumPy
What is NumPy?
• NumPy is a Python library used for working with arrays.
• It also has functions for working in domain of linear algebra, fourier transform, and matrices.
• NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.
• NumPy stands for Numerical Python.
Why is NumPy?
• In Python we have lists that serve the purpose of arrays, but they are slow to process.
• NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
• The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray
very easy.
• Arrays are very frequently used in data science, where speed and resources are very important.
Installation of NumPy
If you have Python and PIP already installed on a system, then installation of NumPy is very easy.
If this command fails, then use a python distribution that already has NumPy installed like, Anaconda, Spyder etc.
Import NumPy
Once NumPy is installed, import it in your applications by adding the import keyword:
Why is Pandas?
• Pandas allows us to analyze big data and make conclusions based on statistical theories.
• Pandas can clean messy data sets, and make them readable and relevant.
• Relevant data is very important in data science.
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.
Installation of Pandas
Example
IIf you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Syntax Output
Install it using this command:
import pandas as pd cars passings
0 BMW 3
mydataset = { 1 Volvo 7
'cars': ["BMW", "Volvo", 2 Ford 2
"Ford"],
If this command fails, then use a python distribution that already has Pandas installed like, Anaconda,
'passings': [3, 7, 2]
Spyder etc. }
print(myvar)
print(myvar)
import pandas as pd
Output
0 1
a = [1, 7, 2] 1 7
What is a Series? 2 2
myvar = pd.Series(a) dtype: int64
A Pandas Series is like a column in a
table. print(myvar)
If nothing else is specified, the values are labeled with their Syntax Output
index number. First value has index 0, second value has 1
import pandas as pd
index 1 etc.
a = [1, 7, 2]
This label can be used to access a specified value.
myvar = pd.Series(a)
import pandas as pd 1
A simple way to store big data sets is to use CSV files (comma
separated files).
df =
CSV files contains plain text and is a well know format that can
pd.read_csv('data.csv')
be read by everyone including Pandas.
In our examples we will be using a CSV file called 'data.csv'. print(df.to_string())
Download data.csv. or Open data.csv
Pandas Read JSON Example: Load the CSV into a DataFrame.
import pandas as pd 1
Big data sets are often stored, or extracted as JSON.
df =
JSON is plain text, but has the format of an object, and is well
pd.read_json('data.json')
known in the world of programming, including Pandas.
print(df.to_string())
In our examples we will be using a JSON file called 'data.json'.
Open data.json.
data = {
"Duration":{
1
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
Django
Django is a back-end server side web framework.
1
Django is free, open source and written in Python.
Learning In Machine Learning we create models to predict the outcome of certain events, like in the
previous chapter where we predicted the CO2 emission of a car when we knew the weight and
What is Train/Test?
Train/Test is a method to measure the accuracy of your model.
It is called Train/Test because you split the data set into two sets: a training set and a testing
set.
import sys
import matplotlib
matplotlib.use('Agg')
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
plt.scatter(x, y)
plt.show()
Luckily our example person has registered every time there was a comedy show in town, and registered some information
about the comedian, and also registered if he/she went or not.
1
Now, based on this data set, Python can create a decision tree that can be used to decide if any new shows are worth
attending to.
How does it Work?
First, read the dataset with pandas:
1
We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values.
Pandas has a map() method that takes a dictionary with information on how to convert the values.
Rank
Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and
the rest will follow the False arrow (to the right).
gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would
mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.
samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this
is the first step.
value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".
Gini
There are many ways to split the samples, we use the GINI method in this tutorial.
Where x is the number of positive answers("GO"), n is the number of samples, and y is the number
of negative answers ("NO"), which gives us this calculation:
References:
https://www.w3schools.com/python/default.asp
Learn More
D ATA M I N I N G U S I N G P H Y T O N
To God be
all the Glory!