Numerical Computing with python (numpy,matplotlib)

In [1]: n=int(input("enter thr number:"))

m=int(input("enter the number:"))
enter thr number:34
enter the number:35

In [2]: import numpy as np a=np.array(['d','h','r','u','v','i']) print("numpy array in python:",a)

numpy array in python: ['d' 'h' 'r' 'u' 'v' 'i']

In [3]:

square root number:45

45 2025

In [6]: n=int(input("square
import numpy as nproot number:")) print(n,n**2)

[1. 6. 7. 2. 4.]

In [7]: import numpy as np

print("1st numer:",num1)
print("2nd number:",num2)

print("output numer after addition:",num)

1st numer: 12
2nd number: 14
output numer after addition: 26

In [8]: import numpy as np


pip install matplotlib
In [10]:

In [11]: import matplotlib.pyplot as plt


fig = plt.figure(figsize=(5,2)),Y, color="green")

plt.ylabel("NO OF STUDENT")
plt.title("STUDENT OF CLASS")
In [14]:


fig = plt.figure(figsize=(5,3))

plt.title("markes of student")

In [18]:

import matplotlib.pyplot as plt

In [22]: Y=(15,12,45,55,34,43)

plt.scatter(X,Y, color="red")

plt.ylabel("no of students")
plt.title("student of class")

import numpy as np
y=2*x + np.random.randn(200)

In [30]: x1=[89,43,36,36,95,10,66,34,38,20]

plt.scatter(x1,y1,c ="grey",linewidths=2,marker="x",edgecolor="red",s=150)


C:\Users\DHRUVI\AppData\Local\Temp\ipykernel_21880\ UserWarning: You passed a edgecolor/edgecolors ('red') for an unfilled marker ('x'). Matplotlib is ignoring t
plt.scatter(x1,y1,c ="grey",linewidths=2,marker="x",edgecolor="red",s=150)

introduction to pandas for data import and export(Excel,CVS etc)

In [1]: import pandas as pd df=pd.read_csv("PMData.csv")

In [5]:

0 NaN NaN NaN NaN NaN NaN NaN

3 Project Name Task Name Assigned to Start Date Days Required End Date Progress
4 Marketing Market Research Alice 01-01-2024 13 14-01-2024 78%
5 Marketing Content Creation Bob 14-01-2024 14 28-01-2024 100%

1 Project Management Data NaN NaN NaN NaN NaN NaN

8 Product Dev Prototype Development Ethan 02-01-2024 18 20-01-2024 100%

Excel Sample Data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6

6 Marketing Social Media Planning Charlie 28-01-2024 22 19-02-2024 45%

14 NaN NaN
Financial NaN NaN Analysis
Budget NaN NaN Kevin
NaN 02-02-2024 22 24-02-2024 10%
9 Product Dev Quality Assurance Fiona 20-01-2024 10 30-01-2024 78%

In [6]: df.head(15)


0 NaN NaN NaN NaN NaN NaN NaN

7 Marketing Campaign Analysis Daisy 18-02-2024 25 14-03-2024 0%

2 NaN NaN NaN NaN NaN NaN NaN

3 Project Name Task Name Assigned to Start Date Days Required End Date Progress
10 Product Dev User Interface Design Gabriel 04-02-2024 25 29-02-2024 0%
4 Marketing Market Research Alice 01-01-2024 13 14-01-2024 78%
Excel Sample Data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6

1 Project Management Data NaN NaN NaN NaN NaN NaN

8 Product Dev Prototype Development Ethan 02-01-2024 18 20-01-2024 100%

11 Customer Svc Service Improvement Hannah 01-02-2024 22 23-02-2024 100%

5 Marketing Content Creation Bob 14-01-2024 14 28-01-2024 100%

14 Financial Budget Analysis Kevin 02-02-2024 22 24-02-2024 10%

9 Product Dev Quality Assurance Fiona 20-01-2024 10 30-01-2024 78%

In [7]: df.tail()

12 Customer Svc Ticket Resolution Ian 24-02-2024 25 20-03-2024 100%
6 Marketing Social Media Planning Charlie 28-01-2024 22 19-02-2024 45%

10 Product Dev User Interface Design Gabriel 04-02-2024 25 29-02-2024 0%

49 Sample Data Engineering
Excel Unnamed: 1 Unnamed: Prototype
2 Unnamed:Testing
3 Unnamed: 4 Tom
Unnamed: 23-02-2024
5 Unnamed: 6 27 21-03-2024 0%

In [8]: df.tail(5)
13 Customer Svc Customer Feedback Julia 21-03-2024 30 20-04-2024 0%
7 Marketing Campaign Analysis Daisy 18-02-2024 25 14-03-2024 0%

11 Customer Svc Service Improvement Hannah 01-02-2024 22 23-02-2024 100%

45 Logistics Transportation Planning Patricia 29-01-2024 30 28-02-2024 100%

49 Sample Data Engineering Prototype Testing Tom 23-02-2024 27 21-03-2024 0%

Excel Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
In [9]:

(50, 7)
In [12]: name=['dhruvi','yamini','vishal','jay','meet']
12 Customer Svc Ticket Resolution Ian 24-02-2024 25 20-03-2024 100%
46 Logistics Inventory Optimization Quentin 29-03-2024 20 18-04-2024 0%
df =pd.DataFrame(dict)
45 Logistics Transportation Planning Patricia 29-01-2024 30 28-02-2024 100%
name deploma score
0 dhruvi IT 40
1 yamini CSE 23
2 vishal IT-D 48
3 jay BIO 34
4 meet CS 45
13 Customer Svc Customer Feedback Julia 21-03-2024 30 20-04-2024 0%
In [13]: df.to_csv("Dhruvi.csv")
47 Engineering
df.to_excel("dhruvi.xlsx") Product Design Rachel 02-01-2024 25 27-01-2024 20%
46 Logistics Inventory Optimization Quentin 29-03-2024 20 18-04-2024 0%
Out[13]: name deploma score
0 dhruvi IT40
1 yamini CSE23
2 vishal IT-D48
3 jay BIO34

4 Engineering
meet System Integration
CS Sam 02-02-2024
45 22 24-02-2024 0%

47 Engineering Product Design Rachel 02-01-2024 25 27-01-2024 20%

In [14]:


4 Engineering
meet System Integration
CS Sam 02-02-2024
45 22 24-02-2024 0%
name deploma score
0 dhruvi IT40
1 yamini CSE23
In [15]: 2df.shape
vishal (5,IT-D48
3 jay BIO34

In [16]: df.values
array([['dhruvi', 'IT', 40],
['yamini', 'CSE', 23],
['vishal', 'IT-D', 48],
['jay', 'BIO', 34],
['meet', 'CS', 45]], dtype=object)

In [17]: df.describe()



In [18]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
count 5.000000

# Column Non-Null Count Dtype

0 name 5 non-null object
1 deploma 5 non-null object
2score 5 non-null int64
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
Practical 3

Basic introduction to Scikit Learn

mean 38.000000
In [23]: from sklearn.datasets import load_iris


[[5.1 3.5 1.4 0.2]

[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 9.924717
std 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
min 23.000000
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 34.000000
3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 45.000000
75% 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 48.000000
max 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

In [25]: from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=50) print(x_train)

[[4.6 3.2 1.4 0.2]

[6.3 2.3 4.4 1.3]
[5.2 4.1 1.5 0.1]
[5.5 2.5 4. 1.3]
[6.9 3.1 4.9 1.5]
[4.7 3.2 1.6 0.2]
[4.9 3.1 1.5 0.1]
[5.9 3. 4.2 1.5]
[4.9 3. 1.4 0.2]
[6. 2.7 5.1 1.6]
[4.8 3. 1.4 0.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.9 1.8]
[7.2 3. 5.8 1.6]
[7.7 3. 6.1 2.3]
[6.6 2.9 4.6 1.3]
[6.3 2.7 4.9 1.8]
[5.5 3.5 1.3 0.2]
[5.8 2.7 5.1 1.9]
[4.3 3. 1.1 0.1]
[6. 2.2 4. 1. ]
[5.1 3.8 1.6 0.2]
[6.3 3.4 5.6 2.4]
[4.8 3.4 1.9 0.2]
[5.2 3.4 1.4 0.2]
[6. 3. 4.8 1.8]
[5.9 3. 5.1 1.8]
[6.9 3.2 5.7 2.3]
[6.7 3.3 5.7 2.1]
[4.8 3.4 1.6 0.2]
[6.2 3.4 5.4 2.3]
[5.6 2.7 4.2 1.3]
[6.7 2.5 5.8 1.8]
[5. 2.3 3.3 1. ]
[5.1 3.5 1.4 0.2]
[6.4 3.2 4.5 1.5]
[6.5 3.2 5.1 2. ]
[5.4 3.7 1.5 0.2]
[6.2 2.8 4.8 1.8]
[5.8 2.7 4.1 1. ]
[5.7 2.9 4.2 1.3]
[6.8 2.8 4.8 1.4]
[5.6 3. 4.5 1.5]
[5.6 2.8 4.9 2. ]
[5. 2. 3.5 1. ]
[5. 3.4 1.6 0.4]
[6.4 3.2 5.3 2.3]
[5. 3.2 1.2 0.2]
[7.6 3. 6.6 2.1]
[4.8 3.1 1.6 0.2]
[5.7 2.6 3.5 1. ]
[6.9 3.1 5.1 2.3]
[5.1 3.8 1.5 0.3]
[4.6 3.4 1.4 0.3]
[5.6 2.9 3.6 1.3]
[4.9 2.5 4.5 1.7]
[6. 3.4 4.5 1.6]
[5. 3.3 1.4 0.2]
[5.4 3.4 1.5 0.4]
[5. 3.5 1.6 0.6]
[6.1 2.6 5.6 1.4]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[6.4 2.7 5.3 1.9]
[6.1 2.8 4. 1.3]
[5.7 3. 4.2 1.2]
[4.7 3.2 1.3 0.2]
[6.3 2.8 5.1 1.5]
[4.6 3.6 1. 0.2]
[6.7 3. 5.2 2.3]
[5.9 3.2 4.8 1.8]
[6.4 2.8 5.6 2.2]
[5.5 4.2 1.4 0.2]
[7.2 3.6 6.1 2.5]
[6.9 3.1 5.4 2.1]]

In [26]: from sklearn.linear_model import LogisticRegression model = LogisticRegression(), y_train)


Practical 5

Import Pima Indian diabetes data Apply select KBest and chi2 for feature selection Identify the best features

In [28]: import pandas as pd

import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
url = ""
names = ['preg', 'Glucose', 'pres', 'skin', 'Insulin', 'BMI', 'Pedi', 'age', 'Outcome']
dataset = pd.read_csv(url, names=names)
x = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
kbest = SelectKBest(score_func=chi2, k=5)
x_new = kbest.fit_transform(x,y)
mask = kbest.get_support()
best_feature = x.columns[mask]

Index(['preg', 'Glucose', 'Insulin', 'BMI', 'age'], dtype='object')

Practical 6

Write a program to learn a decision tree and use it to predict class labels of test data. Training and test data will be explicity provided by instructor. Tree Pruning Should not be Performed.

In [30]: from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score
X_train = [[1, 2], [3, 4], [4, 3], [3, 4], [1, 2], [1, 4], [1, 2]]
y_train = [1, 0, 1, 1, 0, 0, 1]
X_test = [[2, 2], [4, 3], [5, 5], [6, 2]]
y_test = [0, 1, 0, 1]
clf = DecisionTreeClassifier(random_state=42), y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.75

Practical 7

ML Project . Use the following Dataset as music.csv | a. Store File as music.csv and import it to python using pandas | b. Prepare the data by Splitting data in input (age, gender) and output (genre) data set | c. Use Decision tree model form Sklearn to
predict the genre of various age group people. | d. Calculate the accuracy of the model | e. Vary training and test Size tp check different accuracy values models achieves.

In [31]: import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
data = [
["age", "genre", "gender"],
[20, "Rock", "M"],
[60, "Jazz", "F"],
[23, "Pop", "F"],
[30, "Classical", "M"],
[34, "Electronic", "F"],
[56, "Rock", "F"],
[45, "Hip-Hop", "M"],
[23, "Classical", "F"],
[56, "Pop", "M"],
[45, "Electronic", "M"]
df_music = pd.DataFrame(data)
0 1 2
0 age genre gender
1 20 Rock M
2 60 Jazz F
3 23 Pop F
4 30 Classical M
5 34 Electronic F
6 56 Rock F
7 45 Hip-Hop M
8 23 Classical F
9 56 Pop M
10 45 Electronic M

In [32]: df_music.to_csv("music.csv")

In [33]:



10 45 Electronic M

In [34]: df_music.head()


0 1 2

04 30 Classical M
1 2

In [35]: import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
6 df56= pd.read_csv("music.csv")
Rock F
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
X[:, 1] = pd.factorize(X[:, 1])[0]
X[:, 2] = pd.factorize(X[:, 2])[0]
0 age genre gender
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier(random_state=42), y_train)
y_pred = clf.predict(X_test)
accuracy_decision = accuracy_score(y_test, y_pred)
print("accuracy:", accuracy_decision)
45 Hip-Hop
0.25 M

Practical 8

20a program
Rockto use a knearest
M neighbor it to predict class labesl of test data. Training and test data must be provided explicity.

In [37]: from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
X_train = [[1, 4], [4, 2], [6, 2], [3, 4], [4, 5], [4, 4], [5, 3]]
y_train = [1, 0, 1, 1, 0, 0, 1]
8 X_test = [[2, F2], [4, 3], [5, 5], [6, 2]]
23 Classical
y_test = [0, 1, 0, 1]
clf = KNeighborsClassifier(n_neighbors=3), y_train)
y_pred = clf.predict(X_test)
2 accuracy
60 =Jazz F
accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

9 56 Pop M

3 23 Pop F
Practical 9

Accuracy: 0.5
D:\anaconda\lib\site-packages\sklearn\neighbors\ FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode`
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

Import vgsales.csv from kaggle platform. | a. Find rows and columns in Dataset

dg_vgsales = pd.read_csv("vgsales.csv") dg_vgsales.head()

In [38]:


Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
4 2671 Boxing 2600 1980 Fighting Activision 0.72 0.04 0.0 0.01 0.77
0 259 Asteroids 2600 1980 Shooter Atari 4.00 0.26 0.0 0.054.31
1 545 Missile Command 2600 1980 Shooter Atari 2.56 0.17 0.0 0.032.76
2 1768 Kaboom! 2600 1980 Misc Activision 1.07 0.07 0.0 0.011.15
In [39]: 3 dg_vgsales.tail()
1971 Defender 2600 1980 Misc Atari 0.99 0.05 0.0 0.011.05

Out[39]: Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
Mighty No. 9 XOne 2016 Platform Deep Silver 0.01 0.00 0.00 0.00.01
Resident Evil 4 HD XOne 2016 Shooter Capcom 0.01 0.00 0.00 0.00.01

16321 16573 Farming 2017 - The Simulation PS4 2016 Simulation UIG Entertainment 0.00 0.01 0.00 0.00.01
16322 16579 Rugby Challenge 3 XOne 2016 Sports Alternative Software 0.00 0.01 0.00 0.00.01

16323 16592 Chou Ezaru wa Akai Hana: Koi wa Tsuki ni Shiru... PSV 2016 Action dramatic create 0.00 0.00
0.01 Rank 0.0 0.01

In [41]:
(16324, 11)
b. Find Basic information regarding dataset using describe command.

In [42]: dg_vgsales.describe()


16319 16565

Rank Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales

16320 16572

In [43]: dg_vgsales.values
array([[259, 'Asteroids', '2600', ..., 0.0, 0.05, 4.31],
[545, 'Missile Command', '2600', ..., 0.0, 0.03, 2.76],
[1768, 'Kaboom!', '2600', ..., 0.0, 0.01, 1.15],
count 16324.000000
..., 16324.000000 16324.000000 16324.000000 16324.000000 16324.000000 16324.000000
mean [16573,
8291.508270 2006.404251
'Farming 2017 - The0.265464
Simulation',0.147581 0.078673
'PS4', ..., 0.0, 0.0,0.0483340.540328
std 4792.043734
0.01], 5.826744 0.821658 0.508809 0.311584 0.1899021.565860
min 1.000000'Rugby
[16579, 1980.000000 0.000000
Challenge 3', 0.000000
'XOne', ..., 0.000000
0.0, 0.0, 0.01], 0.0000000.010000
[16592, 'Chou Ezaru wa Akai Hana: Koi wa Tsuki ni Shirube Kareru',
'PSV', ..., 0.01, 0.0, 0.01]], dtype=object)

Practical 10

Project on regression | a.Import home data.csv on kaggle using pandas

dh_home = pd.read_csv("home_data.csv")

In [45]: b. Understand data by running head, info and describe command

25% 4135.750000 2003.000000 0.000000 0.000000 0.000000 0.000000 0.060000
In [46]:


4 8293.500000 2007.000000
1954400510 0.080000
20150218T000000 0.020000 0.000000
510000 0.010000
3 0.170000 2.00 1680 8080 1.0 0 0 ... 8
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft
1680 0 1987 0 98074 47.6168 -122.045 1800 7503
0 7129300520 20141013T000000 221900 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 13405650
1 6414100192 20141209T000000 538000 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 16907639
25 rows × 21 columns
5631500400 20150225T000000 180000 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 27208062
3 2487200875 20141209T000000 604000 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 13605000
In [47]:


75% 12439.250000 2010.000000 0.240000 0.110000 0.040000 0.040000 0.480000

max 16600.000000 2016.000000 41.490000 29.020000 10.220000 10.57000082.740000

4 rows × 21 columns
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft
21609 6600060120 20150223T000000 400000 4 2.50 2310 5813 2.0 0 0 ... 8 2310 0 2014 0 98146 47.5107 -122.362 18307200
21610 1523300141 20140623T000000 402101 2 0.75 1020 1350 2.0 0 0 ... 7 1020 0 2009 0 98144 47.5944 -122.299 10202007
In [48]: 21611 291310100(21613, 21)
20150116T000000 400000 3 2.50 1600 2388 2.0 0 0 ... 8 1600 0 2004 0 98027 47.5345 -122.069 14101287
21612 1523300157 20141015T000000 325000 2 0.75 1020 1076 2.0 0 0 ... 7 1020 0 2008 0 98144 47.5941 -122.299 10201357

In [49]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype

0 id 21613 non-null int64

1 date 21613 non-null object
2 price 21613 non-null int64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(4), int64(16), object(1)
memory usage: 3.5+ MB

In [50]: dh_home.describe().T


count mean std min 25% 50% 75% max

id 21613.0 4.580302e+09 2.876566e+09 1.000102e+06 2.123049e+09 3.904930e+09 7.308900e+09 9.900000e+09

price 21613.0 5.400881e+05 3.671272e+05 7.500000e+04 3.219500e+05 4.500000e+05 6.450000e+05 7.700000e+06

In [51]:
array([[7129300520, '20141013T000000', 221900, ..., -122.257, 1340, 5650],
[6414100192, '20141209T000000', 538000, ..., -122.319, 1690, 7639],
[5631500400, '20150225T000000', 180000, ..., -122.233, 2720, 8062],
[1523300141, '20140623T000000', 402101, ..., -122.299, 1020, 2007],
[291310100, '20150116T000000', 400000, ..., -122.069, 1410, 1287],
bedrooms 21613.0 '20141015T000000', 325000,0.000000e+00
3.370842e+00 9.300618e-01 ..., -122.299, 1020, 1357]],
3.000000e+00 3.000000e+00 4.000000e+00 3.300000e+01

In [52]: import matplotlib.pyplot as plt

plt.scatter(dh_home['sqft_living'], dh_home['price'])
d. Apply Linear Regression model to predict the price

bathrooms 21613.0 2.114757e+00 7.701632e-01 0.000000e+00 1.750000e+00 2.250000e+00 2.500000e+00 8.000000e+00

sqft_living 21613.0 2.079900e+03 9.184409e+02 2.900000e+02 1.427000e+03 1.910000e+03 2.550000e+03 1.354000e+04

sqft_lot 21613.0 1.510697e+04 4.142051e+04 5.200000e+02 5.040000e+03 7.618000e+03 1.068800e+04 1.651359e+06

In [53]: from sklearn.linear_model import LinearRegression

m = LinearRegression()[['sqft_living']], dh_home['price'])
pred_price = m.predict([[2000]])

floors 21613.0 1.494309e+00 5.399889e-01 1.000000e+00 1.000000e+00 1.500000e+00 2.000000e+00 3.500000e+00
D:\anaconda\lib\site-packages\sklearn\ UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names

Practical 11

Write a program to duster a set of points using K-menas. Training and test data must be provided explicitly.

In [56]: pip install threadpoolctl==3.1.0

waterfront 21613.0 7.541757e-03 8.651720e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
Requirement already satisfied: threadpoolctl==3.1.0 in d:\anaconda\lib\site-packages (3.1.0)
Note: you may need to restart the kernel to use updated packages.

In [57]: import numpy as np

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
X_train = np.array([[2, 4], [4, 5], [5, 2], [6, 4], [5, 5], [4, 2], [5, 2]])
X_test = np.array([[2, 2], [4, 3], [5, 5], [6, 2]])
km = KMeans(n_clusters=3)
view 21613.0 2.343034e-01 7.663176e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00
y_pred = km.predict(x_test)
plt.scatter(x_train[:, 0], x_train[:, 1], c=km.labels_)
plt.scatter(x_test[:, 0], x_test[:, 1], marker="x", s=150, linewidths=1, c=y_pred)

condition 21613.0 3.409430e+00 6.507430e-01 1.000000e+00 3.000000e+00 3.000000e+00 4.000000e+00 5.000000e+00

grade 21613.0 7.656873e+00 1.175459e+00 1.000000e+00 7.000000e+00 7.000000e+00 8.000000e+00 1.300000e+01

Practical 12 Import

Irissqft_above 21613.0 1.788391e+03 8.280910e+02 2.900000e+02 1.190000e+03 1.560000e+03 2.210000e+03 9.410000e+03

di_iris = pd.read_csv("Iris.csv")

In [59]:
a. Find Rows and Columns using shape command


In [60]:
(150, 6)
b. Print First 30 instances
sqft_basement 21613.0using Head command
2.915090e+02 4.425750e+02 0.000000e+00 0.000000e+00 0.000000e+00 5.600000e+02 4.820000e+03

In [61]: di_iris.head(10)


yr_built 21613.0 1.971005e+03 2.937341e+01 1.900000e+03 1.951000e+03 1.975000e+03 1.997000e+03 2.015000e+03

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

yr_renovated 21613.0 8.440226e+01 4.016792e+02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.015000e+03

c. Find out the Data instances in each class
0 1 5.1 3.5 1.4 0.2 Iris-setosa
In [62]: dd = di_iris.groupby('Species').size() print(dd)

Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64

d. Plotzipcode 21613.0 (box,

the univariatefraphs 9.807794e+04 5.350503e+01
plot and histograms) 9.800100e+04 9.803300e+04 9.806500e+04 9.811800e+04 9.819900e+04

1 2 4.9 3.0 1.4 0.2 Iris-setosa

In [63]: di_iris.boxplot(column='SepalLengthCm', by='Species')
In [64]: plt.title("Box Plot of Sepal Length")

lat 21613.0 4.756005e+01 1.385637e-01 4.715590e+01 4.747100e+01 4.757180e+01 4.767800e+01 4.777760e+01

2 3 4.7 3.2 1.3 0.2 Iris-setosa

long 21613.0 -1.222139e+02 1.408283e-01 -1.225190e+02 -1.223280e+02 -1.222300e+02 -1.221250e+02 -1.213150e+02

3 4 4.6 3.1 1.5 0.2 Iris-setosa

sqft_living15 21613.0 1.986552e+03 6.853913e+02 3.990000e+02 1.490000e+03 1.840000e+03 2.360000e+03 6.210000e+03

4 5 5.0 3.6 1.4 0.2 Iris-setosa

di_iris.hist(column='PetalWidthCm', by='Species')
plt.suptitle("Histogram of Petal Width")

sqft_lot15 21613.0 1.276846e+04 2.730418e+04 6.510000e+02 5.100000e+03 7.620000e+03 1.008300e+04 8.712000e+05

5 6 5.4 3.9 1.7 0.4 Iris-setosa

6 7 4.6 3.4 1.4 0.3 Iris-setosa

7 8 5.0 3.4 1.5 0.2 Iris-setosa

e. Plot the multivariate plot (scatter matrix)

In [65]: from pandas.plotting import scatter_matrix

scatter_matrix(di_iris, alpha=0.5, figsize=(10,10), diagonal='hist')

8 9 4.4 2.9 1.4 0.2 Iris-setosa

9 10 4.9 3.1 1.5 0.1 Iris-setosa

f. Split data to train model by 80% data values.

In [66]: from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(di_iris.iloc[:,:-1], di_iris.iloc[:, -1], test_size=10)
print("X Training Shape", x_train.shape)
print("X Testing Shape", x_test.shape)
print("Y Training Shape", y_train.shape)
print("X Testing Shape", y_test.shape)

X Training Shape (140, 5)

X Testing Shape (10, 5)
Y Training Shape (140,)
X Testing Shape (10,)

Apply K-NN and k means clustering to check accuracy and decide which is better

In [67]: from sklearn.neighbors import KNeighborsClassifier

from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors=5), y_train)
knn_pred = knn.predict(x_test)
knn_acc = accuracy_score(y_test, knn_pred)
kmeans = KMeans(n_clusters=4, random_state=50)
kmeans_pred = kmeans.predict(x_test)
kmeans_acc = accuracy_score(y_test, knn_pred)
print("KNN Acc: ", knn_acc)
print("KMEANS Acc: ", kmeans_acc)

KNN Acc: 1.0

D:\anaconda\lib\site-packages\sklearn\neighbors\ FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default beha
KMEANS Acc: 1.0
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

In [ ]:

