Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 29

ROLL NO:

Assignment 1:SET-A

Q.1: Write a python program to create a data frame containing columns name as
name,age,and percentage. Add 10 rows to the data frame. View the data frame.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',88,99]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df)

Output:
Name Age Percentage
0 subhan 20 78
1 tofik 22 45
2 rayyan 21 15
3 sharif 21 65
4 alim 88 99
5 shoib 18 97
6 danish 19 49
7 mustakim 25 6
8 mosin 20 78
9 arbaz 22 15

Q.2: Write a pyhton program to print shape,number of rows-number of columns, data


type, feature names, and description of the data.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['subhan',20,12]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df.shape)
print(df.size)
print(df.dtypes)
print(df.columns)
print(df.describe)

Output:
(10, 3)
30
Name object
Age object
Percentage object
dtype: object
Index(['Name', 'Age', 'Percentage'], dtype='object')
<bound method NDFrame.describe of Name Age Percentage
0 alim 20 78
1 tofik 22 45
2 rayyan 21 15
3 sharif 21 65
4 subhan 20 12
5 shoib 18 97
6 danish 19 49
7 mustakim 25 6
8 mosin 20 78
9 arbaz 22 15>

Q.3: Write a python program to view basic statistical detail of data.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',20,12]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df.describe())

Output:
Name Age Percentage
count 10 10 10
unique 10 6 8
top arbaz 20 15
freq 1 3 2
Q.4: Write a python program to add 5 rows with duplicate values and missing value. Add
a column Remark with empty values.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',21,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,12]
df['Remarks']=None
print(df

Output:
Name Age Percentage Remarks
0 alim 20 78 None
1 tofik 22 45 None
2 subhan 21 15 None
3 NaN 21 65 None
4 subhan 20 12 None

Q.5: Write a python program to get the number of observations missing values and
duplicate values.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,15]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,15]
print(df['Name'].size)
missing=df.isnull()
print(missing)
dup=df.duplicated()
print(dup)

Output:
5
Name Age Percentage
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
0 False
1 False
2 False
3 False
4 True
dtype: bool
Q.6:Write a python program to drop ‘Remark’ column from the dataframe. Also drop all
nulland empty values.print the modified data.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['ALIM',20,15]
df.loc[1]=['SHOAIB',22,45]
df.loc[2]=['SHARIF',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['ALIM',20,15]
df['Remarks']=None
print(df)
df.drop(labels=['Remarks'],axis=1,inplace=True)
print(df)
df.dropna(axis=0,inplace=True)
print(df)

Output:
Name Age Percentage Remarks
0 ALIM 20 15 None
1 SHOAIB 22 45 None
2 SHARIF 20 15 None
3 NaN 21 65 None
4 ALIM 20 15 None
Name Age Percentage
0 ALIM 20 15
1 SHOAIB 22 45
2 SHARIF 20 15
3 NaN 21 65
4 ALIM 20 15
Name Age Percentage
0 ALIM 20 15
1 SHOAIB 22 45
2 SHARIF 20 15
4 ALIM 20 15

Q.7:Write a python program to generate a line plot pf name vs percentage

import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
df.plot(x="Name",y="percentage")
plt.title('Line plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)

Output:
Name age percentage
0 kashish 19 95
1 Ramiza 20 91
2 naki 7 90
3 Faisal 18 85
4 Aman 23 80
5 Anas 24 75
6 Fazil 21 70
7 Mustaqim 22 65
8 Alfiya 20 89
9 Aqsa 21 86

Q.8:Write a python program to generate a scatter plot pf name vs percentage.

import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
plt.scatter(x=df["Name"],y=df["percentage"])
plt.title('Scatter plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)

Output:
Name age percentage
0 kashish 19 95
1 Ramiza 20 91
2 naki 7 90
3 Faisal 18 85
4 Aman 23 80
5 Anas 24 75
6 Fazil 21 70
7 Mustaqim 22 65
8 Alfiya 20 89
9 Aqsa 21 86
Assingment 1:SET-B
Q1) Download the heights and weights dataset and load the data set from a given csv file
into a dataframe. print the first,last 10 rows and random 20 rows.

import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/nfirst 10 rows')
print(df.head(10))

print('/nlast 10 rows')
print(df.tail(10))

print('/n random 10 rows')


print(df.sample(20))

Output:
/nfirst 10 rows
Index Height(Inches) Weight(Pounds)
0 1 65.78331 112.9925
1 2 71.51521 136.4873
2 3 69.39874 153.0269
3 4 68.21660 142.3354
4 5 67.78781 144.2971
5 6 68.69784 123.3024
6 7 69.80204 141.4947
7 8 70.01472 136.4623
8 9 67.90265 112.3723
9 10 66.78236 120.6672
/nlast 10 rows
Index Height(Inches) Weight(Pounds)
24990 24991 69.97767 125.3672
24991 24992 71.91656 128.2840
24992 24993 70.96218 146.1936
24993 24994 66.19462 118.7974
24994 24995 67.21126 127.6603
24995 24996 69.50215 118.0312
24996 24997 64.54826 120.1932
24997 24998 64.69855 118.2655
24998 24999 67.52918 132.2682
24999 25000 68.87761 124.8742
/n random 10 rows
Index Height(Inches) Weight(Pounds)
18515 18516 71.15912 143.7729
3550 3551 65.95300 130.0755
16400 16401 67.20032 130.9151
10718 10719 70.79804 125.7816
4830 4831 66.24238 121.9611
9121 9122 68.05361 137.0546
11516 11517 68.67632 115.3375
3126 3127 67.59507 126.1888
15670 15671 68.55083 137.6187
20293 20294 65.96939 139.4453
5842 5843 68.92916 129.1092
21409 21410 69.27081 124.6497
15365 15366 67.18395 137.6251
16889 16890 69.07788 131.0112
12382 12383 69.50005 135.9850
14220 14221 68.49492 132.0698
2701 2702 69.25709 142.5795
4578 4579 68.91069 103.7011
11468 11469 65.93881 125.8178
19312 19313 68.47651 117.6580

Q2) Write a python program to find the shape, size datatypes of the dataframe object.

import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n shape of dataframe',df.shape)
print('/n size of dataframe',df.size)
print('/n datatype of dataframe',df.dtypes)

Output:
/n shape of dataframe (25000, 3)
/n size of dataframe 75000
/n datatype of dataframe Index int64
Height(Inches) float64
Weight(Pounds) float64
dtype: object

Q3) Write a python program to view basic statistical details of the data.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print("/n basic statistical details of a data:/n",df.describe())

Output:
basic statistical details of a data:
Index Height(Inches) Weight(Pounds)
count 25000.000000 25000.000000 25000.000000
mean 12500.500000 67.993114 127.079421
std 7217.022701 1.901679 11.660898
min 1.000000 60.278360 78.014760
25% 6250.750000 66.704397 119.308675
50% 12500.500000 67.995700 127.157750
75% 18750.250000 69.272958 134.892850
max 25000.000000 75.152800 170.924000
Q4) Write a pyython program to get the number of observation , missing values and nan
values.

import pandas as pd
import numpy as np
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n number of observation:',df['Index'].size)
missing=df.isnull()
nan_values=np.isnan(df)
print("/n nan values:",nan_values.size)

Output:
number of observation : 25000

missing values:
75000

nan values : 75000

Q5) Write a python program to add a column to the dataframe “BMI” which is
calculated as: weight/height^2.

import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
df['BMI']=(df['Weight(Pounds)']/df['Height(Inches)']**2)
print("after adding colom/n",df)

Output:
after adding column/n Index Height(Inches) Weight(Pounds)
BMI
0 1 65.78331 112.9925 2.950311
1 2 71.51521 136.4873 3.642400
2 3 69.39874 153.0269 4.862195
3 4 68.21660 142.3354 4.353572
4 5 67.78781 144.2971 4.531187
... ... ... ... ...
24995 24996 69.50215 118.0312 2.884013
24996 24997 64.54826 120.1932 3.467294
24997 24998 64.69855 118.2655 3.341389
24998 24999 67.52918 132.2682 3.836436
24999 25000 68.87761 124.8742 3.286921

[25000 rows x 4 columns]


Q6) Write a python program to find the maximum and minimum BMI.
import pandas as pd
import numpy as np
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
df['BMI']=((df['Weight(Pounds)']/df['Height(Inches)'])**2)
print("\n Maximum of BMI = ",max(df['BMI']))
print("\n Minimum of BMI = ",min(df['BMI']))

Output::
Maximum of BMI = 5.933879009339526

Minimum of BMI = 1.531593289721334

Q7) Write a python program to generate a scatter plot of height vs weight.

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('HeightWeight.csv')
df = pd.DataFrame(data)

plt.scatter(x=df['Height(Inches)'],y=df['Weight(Pounds)'],c='blue')
plt.title("Scatter Plot")
plt.xlable("Height(Inches)")
plt.ylabel("Weight(Pounds)")
plt.show()

Output::
ROLL NO:

Assingment 2:SET-A
Q1) Create an array using numpy and display mean and median.
importumpy as np
demo = np.array([[30,75,70],[80,90,20],[50,95,60]])
print(demo)
print('/n')
print(np.mean(demo))
print('/n')
print(np.median(demo))
print('/n')

Output::
[[30 75 70]
[80 90 20]
[50 95 60]]
/n
63.333333333333336
/n
70.0
/n

Q2) Create a data frame as follows: Print df.sum


import pandas as pd
import numpy as np
d= {'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(d)
print(df.sum())

Output::
Name RamShamMeenaSeetaGeetaRakeshMadhav
Age 181
Rating 25.61
dtype: object

Q3) For the above data display statistical details:

import pandas as pd
import numpy as np
md={'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(md)
print(df.describe())

Output:
Age Rating
count 7.000000 7.000000
mean 25.857143 3.658571
std 2.734262 0.698628
min 23.000000 2.560000
25% 24.000000 3.220000
50% 25.000000 3.800000
75% 27.500000 4.105000
max 30.000000 4.600000

Q4) Consider the array [13,52,44,32,30,0,36,45].Calculate standard deviation.

import numpy as np
data=np.array([13,52,44,32,30,0,36,45])
print("Standard Deviation of sample is %s"%(np.std(data)))

Output::
Standard Deviation of sample is 16.263455967290593

Q5) Create a data frame as follows:

Virat Rohit
92 89
97 87
85 67
74 55
71 47
55 72
85 76
63 79
42 44
32 92
71 99

55 47

Display the mean of every player :

import pandas as pd
import scipy.stats as s
score={'Virat':[92,97,85,74,71,55,85,63,42,32,71,55],'Rohit':
[89,87,67,55,47,72,76,79,44,92,99,47]}
df=pd.DataFrame(score)
print(df)
print("\nArithmetic Mean Values")
print("Score 1",s.tmean(df["Virat"]).round(2))
print("Score 2",s.tmean(df["Rohit"]).round(2))

Output:
Virat Rohit
0 92 89
1 97 87
2 85 67
3 74 55
4 71 47
5 55 72
6 85 76
7 63 79
8 42 44
9 32 92
10 71 99
11 55 47

Arithmetic Mean Values


Score 1 68.5
Score 2 71.17
Q6) Consider the array[24,29,20,22,24,26,27,30,20,31,26,38,44,47],Calculate IQR.

import numpy as np
mydata=np.array([24,29,20,22,24,26,27,30,20,31,26,38,44,47])
q3,q1=np.percentile(mydata,[75,25])
iqrvalue=q3-q1
print(iqrvalue)

Output:
6.75

Q7) Write a python program to find the maximum value of a given flattened array:

import numpy as np
arr=np.array([[25,26,45],[12,36,42],[8,50,65]])
print("\n Original flattened Array:\n",arr)
arr.flatten()
max=np.max(arr)
print("\n Maximum value of flattened array:\n",max)
min=np.min(arr)
print("\n Minimum value of flattened array:\n",min)

Output:
Original flattened Array:
[[25 26 45]
[12 36 42]
[ 8 50 65]]

Maximum value of flattened array:


65

Minimum value of flattened array:


8

Q8) Write a python program to compute Eclidian Distance between two data points in a
dataset.

import numpy as np
point1= np.array((1,2,3))
point2= np.array((1,1,1))
dist = np.linalg.norm(point1 - point2)
print("Euclidian Distance between two pints: ",dist)

Output:
Euclidian Distance between two pints: 2.23606797749979

Q.9: Create one dataframe of data values.Find out mean , range , and IQR for this data.
import pandas as pd
import numpy as np
import scipy.stats as s
d = [32,36,46,47,56,69,75,79,79,88,89,91,92,93,96,97,101,105,112,116]
data=pd.DataFrame(d)
print("\n DataFrame:\n",data)
print("\n Mean of dataframe : ",s.tmean(data))
data_range = np.max(data)-np.min(data)
print("\n Range of dataframe : ",data_range)
Q1 = np.median(data[:10])
Q3 = np.median(data[10:])
IQR = Q3 - Q1
print("\n Inter Quartile Range (IQR) of dataframe : ",IQR)

Output::
DataFrame:
0
0 32
1 36
2 46
3 47
4 56
5 69
6 75
7 79
8 79
9 88
10 89
11 91
12 92
13 93
14 96
15 97
16 101
17 105
18 112
19 116

Mean of dataframe : 79.95

DataFrame:
0
0 32
1 36
2 46
3 47
4 56
5 69
6 75
7 79
8 79
9 88
10 89
11 91
12 92
13 93
14 96
15 97
16 101
17 105
18 112
19 116

Mean of dataframe : 79.95

Range of dataframe : 0 84
dtype: int64

Inter Quartile Range (IQR) of dataframe : 34.0

Q10)WRITE A PYTHON PROGRAM TO COMPUTE SUM OF MANHATTAN


DISTANCE BETWEEN ALL PAIRS OF POINTS:
x = [2,5,7]
y = [4,6,8]
n = len(x)
def distance_sum (x,y,n):
sum = 0
for i in range(n):
for j in range(i+1,n):

sum += (abs(x[i] - x[j])) + (abs(y[i] - y[j]))


return sum
print("\n sum of Manhattan distance between all pairs of point is =",distance_sum(x,y,n))

OUPUT:
sum of Manhattan distance between all pairs of point is = 18

Q.12: Create a Dataframe for student's information such name, graduation percentage
and age. Display average age of student ,average of graduation percentage. And ,also
describe all basic statistics od data.(Hint:use describe()).
import pandas as pd
import scipy.stats as s
data={'Name':['sharif','shoaib','nafisa','alim'],'Age':[20,22,23,21],
'perc':[65.2,78.4,78.6,74.5]}
df=pd.DataFrame(data)
print("\n Average Age :",sum(df['Age']/len(df['Age'])))
print("\n Average Percentage : ",sum(df['perc']/len(df['perc'])))
print("\n Basic Stastistics of data :\n",df.describe())

Output:
Average Age : 21.5
Average Percentage : 74.17500000000001

Basic Stastistics of data :


Age perc
count 4.000000 4.000000
mean 21.500000 74.175000
std 1.290994 6.273954
min 20.000000 65.200000
25% 20.750000 72.175000
50% 21.500000 76.450000
75% 22.250000 78.450000
max 23.000000 78.600000

Q.11: Write a Numpy program to compute the histogram of nums agains bins.
Sample Output:
nums:[0.5 0.7 1.0 1.2 1.3 2.1]
bins:[0 1 2 3].
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5,0.7,1.0,1.2,1.3,2.1])
bins = np.array([0,1,2,3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()

Output:
nums: [0.5 0.7 1. 1.2 1.3 2.1]
bins: [0 1 2 3]
Result: (array([2, 3, 1]), array([0, 1, 2, 3]))
Assingment 2:SET-B

Q1) Download iris dataset file.Read this csv file using read_csv() function.Take sample
from entire dataset.Display maximum and minimum values of all numeric attributes.

import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
sample=df.sample()
print("\n Sample from Dataset:\n",sample)
print("\n Maximum of sepal Length:",max(df['SepalLengthCm']))
print("\n Minimum of sepal Length:",min(df['SepalLengthCm']))
print("\n Maximum of sepal Width:",max(df['SepalWidthCm']))
print("\n Minimum of sepal Width:",min(df['SepalWidthCm']))

print("\n Maximum of Petal Length:",max(df['PetalLengthCm']))


print("\n Minimum of Petal Length:",min(df['PetalLengthCm']))
print("\n Maximum of Petal Width:",max(df['PetalWidthCm']))
print("\n Minimum of Petal Width:",min(df['PetalWidthCm']))

Output:

Maximum of sepal Length: 7.9

Minimum of sepal Length: 4.3

Maximum of sepal Width: 4.4

Minimum of sepal Width: 2.0

Maximum of Petal Length: 6.9

Minimum of Petal Length: 1.0

Maximum of Petal Width: 2.5


Minimum of Petal Width: 0.1

Q.2:Continue with above dataset ,find number of records for each distinct value of class
attributes.Consider entire dataset and not the sample.

import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
print("\n DataFrame:\n",df)
cnt=df['Species'].value_counts()
print("\n number of records for each distinct value of species attribute:\n",cnt)

Output:

number of records for each distinct value of species attribute:


Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64

Q.3:Display column-wise mean,and median for iris dataset (Hint:Use mean() and
median() function of pandas dataframe.

import pandas as pd
import scipy.stats as s
import statistics as st
data=pd.read_csv('Iris.csv')
df=pd.DataFrame(data)
print("\n DataFrame :\n",df)
print("\n Mean of sepal Length:", s.tmean(df['SepalLengthCm']))
print("\n Median of sepal Length:", st.median(df['SepalLengthCm']))
print("\n Mean of sepal Width:", s.tmean(df['SepalWidthCm']))
print("\n Median of sepal Width:", st.median(df['SepalWidthCm']))

print("\n Mean of Petal Length:", s.tmean(df['PetalLengthCm']))


print("\n Median of Petal Length:", st.median(df['PetalLengthCm']))
print("\n Mean of Petal Width:", s.tmean(df['PetalWidthCm']))
print("\n Median of Petal Width:", st.median(df['PetalWidthCm']))

OUTPUT:
DataFrame :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

Mean of sepal Length: 5.843333333333334

Median of sepal Length: 5.8

Mean of sepal Width: 3.0540000000000003

Median of sepal Width: 3.0

Mean of Petal Length: 3.758666666666666

Median of Petal Length: 4.35

Mean of Petal Width: 1.1986666666666668

Median of Petal Width: 1.3


ASSINGMENT 3: SET-A

Create own dataset and do simple preprocessing


Dataset Name: Data.csv (save following data in excel and save it with .csv extension)

Country Age Salary Purchased


France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

Q.1: Write a program in python to perform following task.


1.import dataset and do the following:
a) Dscribing the dataset
b) Shape of the dataset
c) Display first 3 rows from dataset

import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df.describe())
print("\n Shape of Dataset:\n",df.shape)
print("\n First three rows of Dataset:\n",df.head(3))

OUTPUT:
Describing Dataset:
Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000
Shape of Dataset:
(10, 4)

First three rows of Dataset:


Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No

Q.1:Handling Missing value:

a) Repalce missing value of salary, age column with mean of thatcolumn.

import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Displaying Dataset:\n",df)
data['Salary']= data['Salary'].fillna(data['Salary'].mean())
data['Age']= data['Age'].fillna(data['Age'].mean())
print("\n ****** Modified Dataset ******\n",df)

OUTPUT:
Displaying Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

****** Modified Dataset ******


Country Age Salary Purchased
0 France 44.000000 72000.000000 No
1 Spain 27.000000 48000.000000 Yes
2 Germany 30.000000 54000.000000 No
3 Spain 38.000000 61000.000000 No
4 Germany 40.000000 63777.777778 Yes
5 France 35.000000 58000.000000 Yes
6 Spain 38.777778 52000.000000 No
7 France 48.000000 79000.000000 Yes
8 Germany 50.000000 83000.000000 No
9 France 37.000000 67000.000000 Yes

Q3) Data.csv have two categorical column(the country column,and the purchased
column).
a. Apply OneHot coding on country column.
b.Apply Label encoding on purchased column.

import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df)
one_hot_encoded_data = pd.get_dummies(data, columns = ['Country'])
print("\n *******After applying OneHot coding on Country*******\n",one_hot_encoded_data)
label_encoder = preprocessing.LabelEncoder()
df['Purchased']= label_encoder.fit_transform(df['Purchased'])
df['Purchased'].unique()
print("\n *******After applying OneHot coding on Country**********\n",df)

OUTPUT:
Describing Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

*******After applying OneHot coding on Country*******


Age Salary Purchased Country_France Country_Germany Country_Spain
0 44.0 72000.0 No 1 0 0
1 27.0 48000.0 Yes 0 0 1
2 30.0 54000.0 No 0 1 0
3 38.0 61000.0 No 0 0 1
4 40.0 NaN Yes 0 1 0
5 35.0 58000.0 Yes 1 0 0
6 NaN 52000.0 No 0 0 1
7 48.0 79000.0 Yes 1 0 0
8 50.0 83000.0 No 0 1 0
9 37.0 67000.0 Yes 1 0 0

*******After applying OneHot coding on Country**********


Country Age Salary Purchased
0 France 44.0 72000.0 0
1 Spain 27.0 48000.0 1
2 Germany 30.0 54000.0 0
3 Spain 38.0 61000.0 0
4 Germany 40.0 NaN 1
5 France 35.0 58000.0 1
6 Spain NaN 52000.0 0
7 France 48.0 79000.0 1
8 Germany 50.0 83000.0 0
9 France 37.0 67000.0 1
ASSINGMENT 3: SET-B

Q.1: Import standard data set and use transformation techniques


Dataset Name: winequality-red.csv

Write a program in python to perform following task


1. Import Dtaset from above link.

import pandas as pd
data=pd.read_csv("winequality_red.csv")
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)

OUTPUT:

Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \


0 11.0 34.0 0.99780 3.51 0.56
1 25.0 67.0 0.99680 3.20 0.68
2 15.0 54.0 0.99700 3.26 0.65
3 17.0 60.0 0.99800 3.16 0.58
4 11.0 34.0 0.99780 3.51 0.56
... ... ... ... ... ...
1594 32.0 44.0 0.99490 3.45 0.58
1595 39.0 51.0 0.99512 3.52 0.76
1596 29.0 40.0 0.99574 3.42 0.75
1597 32.0 44.0 0.99547 3.57 0.71
1598 18.0 42.0 0.99549 3.39 0.66

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

Q.2: Rescaling: Normalization the dataset using MinMaxScalar class


import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print(df)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
print(scaled)

OUTPUT:

fixed acidity volatile acidity citric acid residual sugar chlorides \


0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \


0 11.0 34.0 0.99780 3.51 0.56
1 25.0 67.0 0.99680 3.20 0.68
2 15.0 54.0 0.99700 3.26 0.65
3 17.0 60.0 0.99800 3.16 0.58
4 11.0 34.0 0.99780 3.51 0.56
... ... ... ... ... ...
1594 32.0 44.0 0.99490 3.45 0.58
1595 39.0 51.0 0.99512 3.52 0.76
1596 29.0 40.0 0.99574 3.42 0.75
1597 32.0 44.0 0.99547 3.57 0.71
1598 18.0 42.0 0.99549 3.39 0.66

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]


[[0.24778761 0.39726027 0. ... 0.13772455 0.15384615 0.4 ]
[0.28318584 0.52054795 0. ... 0.20958084 0.21538462 0.4 ]
[0.28318584 0.43835616 0.04 ... 0.19161677 0.21538462 0.4 ]
...
[0.15044248 0.26712329 0.13 ... 0.25149701 0.4 0.6 ]
[0.11504425 0.35958904 0.12 ... 0.22754491 0.27692308 0.4 ]
[0.12389381 0.13013699 0.47 ... 0.19760479 0.4 0.6 ]]

Q.3:) Standardizing Data (trans. Them into a standard Guassian distribution with a mean
of 0 and a standard deviation of 1)

import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
standard = preprocessing.scale(df)
print("\n *********Standardized Data*********\n",standard)

OUTPUT:

Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \


0 11.0 34.0 0.99780 3.51 0.56
1 25.0 67.0 0.99680 3.20 0.68
2 15.0 54.0 0.99700 3.26 0.65
3 17.0 60.0 0.99800 3.16 0.58
4 11.0 34.0 0.99780 3.51 0.56
... ... ... ... ... ...
1594 32.0 44.0 0.99490 3.45 0.58
1595 39.0 51.0 0.99512 3.52 0.76
1596 29.0 40.0 0.99574 3.42 0.75
1597 32.0 44.0 0.99547 3.57 0.71
1598 18.0 42.0 0.99549 3.39 0.66

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

*********Standardized Data*********
[[-0.52835961 0.96187667 -1.39147228 ... -0.57920652 -0.96024611
-0.78782264]
[-0.29854743 1.96744245 -1.39147228 ... 0.1289504 -0.58477711
-0.78782264]
[-0.29854743 1.29706527 -1.18607043 ... -0.04808883 -0.58477711
-0.78782264]
...
[-1.1603431 -0.09955388 -0.72391627 ... 0.54204194 0.54162988
0.45084835]
[-1.39015528 0.65462046 -0.77526673 ... 0.30598963 -0.20930812
-0.78782264]
[-1.33270223 -1.21684919 1.02199944 ... 0.01092425 0.54162988
0.45084835]]

Q.4:)Normalizing Data (rescale each observation to a length pof 1 (a unit norm) .For this,
use the normalizer class.)

import pandas as pd
import numpy as np
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
normalized = preprocessing.normalize(df,norm='l2')
print("\n***********Normalized Data***********\n",normalized)

OUTPUT:
Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \


0 11.0 34.0 0.99780 3.51 0.56
1 25.0 67.0 0.99680 3.20 0.68
2 15.0 54.0 0.99700 3.26 0.65
3 17.0 60.0 0.99800 3.16 0.58
4 11.0 34.0 0.99780 3.51 0.56
... ... ... ... ... ...
1594 32.0 44.0 0.99490 3.45 0.58
1595 39.0 51.0 0.99512 3.52 0.76
1596 29.0 40.0 0.99574 3.42 0.75
1597 32.0 44.0 0.99547 3.57 0.71
1598 18.0 42.0 0.99549 3.39 0.66

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

***********Normalized Data***********
[[0.19347777 0.01830195 0. ... 0.01464156 0.24576906 0.13072822]
[0.10698874 0.01207052 0. ... 0.00932722 0.13442175 0.06858252]
[0.13494887 0.01314886 0.00069205 ... 0.01124574 0.16955114 0.08650569]
...
[0.1222319 0.00989496 0.00252225 ... 0.01455142 0.21342078 0.11641133]
[0.10524769 0.01150589 0.00214063 ... 0.0126654 0.18195363 0.08919296]
[0.12491328 0.00645385 0.00978487 ... 0.01374046 0.22900768 0.12491328]]

5.Biarizing Data using we use Binarizer class (Using a banary threshold, it is possible to
transform out data by marking the values above it 1 and those equal to or below it 0)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import Binarizer
data_set = pd.read_csv('Data.csv')

data_set.head()
age = data_set['Age'].values
salary = data_set['Salary'].values
print("\n Original age data values : \n",age)
print("\n Original salary data values : \n",salary)
x = age
x = x.reshape(1,-1)
y = salary
y = y.reshape(1,-1)
binarizer_1 = Binarizer(threshold=35)
binarizer_1 = Binarizer(threshold=61000)

print("\n Binarized age : \n", binarizer_1.fit_transform(x))


print("\n Binarized salary : \n", binarizer_1.fit_transform(y))

OUTPUT:
Original age data values :
[44 27 30 38 40 35 58 48 50 37]

Original salary data values :


[72000 48000 54000 61000 55000 58000 52000 79000 83000 67000]

Binarized age :
[[0 0 0 0 0 0 0 0 0 0]]

Binarized salary :
[[1 0 0 0 0 0 0 1 1 1]]

You might also like