MAJOR PROJECT (Sanket Patil) PDF

In
[6]: # import python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # visualizing data
%matplotlib inline
import seaborn as sns
In [7]: # import csv file

df = pd.read_csv('Diwali Sales Data.csv', encoding= 'unicode_escape')
In [8]: df.shape
Out[8]: (11251, 15)
In [5]: df.head()
Out[5]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupation Product_Category Orders Amount Status
Group
0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Western Healthcare Auto 1 23952.0 NaN
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern Govt Auto 3 23934.0 NaN
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Central Automobile Auto 3 23924.0 NaN
3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southern Construction Auto 2 23912.0 NaN
Food
4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Western Auto 2 23877.0 NaN
Processing
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11251 entries, 0 to 11250
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 11251 non-null int64
1 Cust_name 11251 non-null object
2 Product_ID 11251 non-null object
3 Gender 11251 non-null object
4 Age Group 11251 non-null object
5 Age 11251 non-null int64
6 Marital_Status 11251 non-null int64
7 State 11251 non-null object
8 Zone 11251 non-null object
9 Occupation 11251 non-null object
10 Product_Category 11251 non-null object
11 Orders 11251 non-null int64
12 Amount 11239 non-null float64
13 Status 0 non-null float64
14 unnamed1 0 non-null float64
dtypes: float64(3), int64(4), object(8)
memory usage: 1.3+ MB
In [6]: #drop unrelated/blank columns

df.drop(['Status', 'unnamed1'], axis=1, inplace=True)
In [7]: #check for null values

pd.isnull(df).sum()
Out[7]: User_ID 0
Cust_name 0
Product_ID 0
Gender 0
Age Group 0
Age 0
Marital_Status 0
State 0
Zone 0
Occupation 0
Product_Category 0
Orders 0
Amount 12
dtype: int64
In [8]: # drop null values

df.dropna(inplace=True)
In [9]: # change data type

df['Amount'] = df['Amount'].astype('int')
In [10]: df['Amount'].dtypes
Out[10]: dtype('int32')
In [11]: df.columns
Out[11]: Index(['User_ID', 'Cust_name', 'Product_ID', 'Gender', 'Age Group', 'Age',

'Marital_Status', 'State', 'Zone', 'Occupation', 'Product_Category',
'Orders', 'Amount'],
dtype='object')
In [12]: #rename column

df.rename(columns= {'Marital_Status':'Shaadi'})
Out[12]: Age
User_ID Cust_name Product_ID Gender Age Shaadi State Zone Occupation Product_Category Orders Amount
Group
0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Western Healthcare Auto 1 23952
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern Govt Auto 3 23934
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Central Automobile Auto 3 23924
3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southern Construction Auto 2 23912
Food
4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Western Auto 2 23877
Processing
... ... ... ... ... ... ... ... ... ... ... ... ... ...
11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Western Chemical Office 4 370
11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northern Healthcare Veterinary 3 367
Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Central Textile Office 4 213
Pradesh
11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southern Agriculture Office 3 206
11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Western Healthcare Office 3 188
11239 rows × 13 columns
In [13]: # describe() method returns description of the data in the DataFrame (i.e. count, mean, std, etc)
df.describe()
Out[13]: User_ID Age Marital_Status Orders Amount
count 1.123900e+04 11239.000000 11239.000000 11239.000000 11239.000000
mean 1.003004e+06 35.410357 0.420055 2.489634 9453.610553
std 1.716039e+03 12.753866 0.493589 1.114967 5222.355168
min 1.000001e+06 12.000000 0.000000 1.000000 188.000000
25% 1.001492e+06 27.000000 0.000000 2.000000 5443.000000
50% 1.003064e+06 33.000000 0.000000 2.000000 8109.000000
75% 1.004426e+06 43.000000 1.000000 3.000000 12675.000000
max 1.006040e+06 92.000000 1.000000 4.000000 23952.000000
In [14]: # use describe() for specific columns

df[['Age', 'Orders', 'Amount']].describe()
Out[14]: Age Orders Amount
count 11239.000000 11239.000000 11239.000000
mean 35.410357 2.489634 9453.610553
std 12.753866 1.114967 5222.355168
min 12.000000 1.000000 188.000000
25% 27.000000 2.000000 5443.000000
50% 33.000000 2.000000 8109.000000
75% 43.000000 3.000000 12675.000000
max 92.000000 4.000000 23952.000000
Exploratory Data Analysis

Gender
In [15]: # plotting a bar chart for Gender and it's count
ax = sns.countplot(x = 'Gender',data = df)
for bars in ax.containers:

ax.bar_label(bars)
In [16]: # plotting a bar chart for gender vs total amount
sales_gen = df.groupby(['Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)
sns.barplot(x = 'Gender',y= 'Amount' ,data = sales_gen)
Out[16]: <Axes: xlabel='Gender', ylabel='Amount'>
From above graphs we can see that most of the buyers are females and even the purchasing power of females are greater than men
Age
In [17]: ax = sns.countplot(data = df, x = 'Age Group', hue = 'Gender')

ax.bar_label(bars)
In [6]: # Total Amount vs Age Group

sales_age = df.groupby(['Age Group'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)
sns.barplot(x = 'Age Group',y= 'Amount' ,data = sales_age)
Out[6]: <Axes: xlabel='Age Group', ylabel='Amount'>
From above graphs we can see that most of the buyers are of age group between 26-35 yrs female
State
In [19]: # total number of orders from top 10 states
sales_state = df.groupby(['State'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False).head(10)
sns.set(rc={'figure.figsize':(15,5)})
sns.barplot(data = sales_state, x = 'State',y= 'Orders')
Out[19]: <Axes: xlabel='State', ylabel='Orders'>
In [20]: # total amount/sales from top 10 states
sales_state = df.groupby(['State'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)
sns.barplot(data = sales_state, x = 'State',y= 'Amount')
Out[20]: <Axes: xlabel='State', ylabel='Amount'>
From above graphs we can see that most of the orders & total sales/amount are from Uttar Pradesh, Maharashtra and Karnataka respectively
Marital Status
In [21]: ax = sns.countplot(data = df, x = 'Marital_Status')
ax.bar_label(bars)
In [22]: sales_state = df.groupby(['Marital_Status', 'Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)
sns.barplot(data = sales_state, x = 'Marital_Status',y= 'Amount', hue='Gender')
Out[22]: <Axes: xlabel='Marital_Status', ylabel='Amount'>
From above graphs we can see that most of the buyers are married (women) and they have high purchasing power
Occupation
In [23]: sns.set(rc={'figure.figsize':(20,5)})
ax = sns.countplot(data = df, x = 'Occupation')

ax.bar_label(bars)
In [24]: sales_state = df.groupby(['Occupation'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)
sns.barplot(data = sales_state, x = 'Occupation',y= 'Amount')
Out[24]: <Axes: xlabel='Occupation', ylabel='Amount'>
From above graphs we can see that most of the buyers are working in IT, Healthcare and Aviation sector
Product Category
In [25]: sns.set(rc={'figure.figsize':(20,5)})
ax = sns.countplot(data = df, x = 'Product_Category')

ax.bar_label(bars)
In [26]: sales_state = df.groupby(['Product_Category'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)
sns.barplot(data = sales_state, x = 'Product_Category',y= 'Amount')
Out[26]: <Axes: xlabel='Product_Category', ylabel='Amount'>
From above graphs we can see that most of the sold products are from Food, Clothing and Electronics category
In [27]: sales_state = df.groupby(['Product_ID'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False).head(10)
sns.barplot(data = sales_state, x = 'Product_ID',y= 'Orders')
Out[27]: <Axes: xlabel='Product_ID', ylabel='Orders'>
In [28]: # top 10 most sold products (same thing as above)
fig1, ax1 = plt.subplots(figsize=(12,7))

df.groupby('Product_ID')['Orders'].sum().nlargest(10).sort_values(ascending=False).plot(kind='bar')
Out[28]: <Axes: xlabel='Product_ID'>
REGRESSOR
In [15]: import pandas as pd
In [16]: df=pd.read_csv('Boston.csv')
In [17]: df
Out[17]: Unnamed: 0 crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 502 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4
502 503 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6
503 504 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9
504 505 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0
505 506 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9
In [18]: df.head()
Out[18]: Unnamed: 0 crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
In [19]: #import data

x=df.drop('medv',axis=1)
#output data
y=df['medv']
In [21]: x.shape
Out[21]: (506, 14)
In [23]: import sklearn as sk
In [24]: from sklearn.model_selection import train_test_split
In [43]: x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.25)
In [44]: x_train
Out[44]: Unnamed: 0 crim zn indus chas nox rm age dis rad tax ptratio black lstat
245 246 0.19133 22.0 5.86 0 0.431 5.605 70.2 7.9549 7 330 19.1 389.13 18.46
59 60 0.10328 25.0 5.13 0 0.453 5.927 47.2 6.9320 8 284 19.7 396.90 9.22
276 277 0.10469 40.0 6.41 1 0.447 7.267 49.0 4.7872 4 254 17.6 389.25 6.05
395 396 8.71675 0.0 18.10 0 0.693 6.471 98.8 1.7257 24 666 20.2 391.98 17.12
416 417 10.83420 0.0 18.10 0 0.679 6.782 90.8 1.8195 24 666 20.2 21.57 25.79
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
323 324 0.28392 0.0 7.38 0 0.493 5.708 74.3 4.7211 5 287 19.6 391.13 11.74
192 193 0.08664 45.0 3.44 0 0.437 7.178 26.3 6.4798 5 398 15.2 390.49 2.87
117 118 0.15098 0.0 10.01 0 0.547 6.021 82.6 2.7474 6 432 17.8 394.51 10.30
47 48 0.22927 0.0 6.91 0 0.448 6.030 85.5 5.6894 3 233 17.9 392.74 18.80
172 173 0.13914 0.0 4.05 0 0.510 5.572 88.5 2.5961 5 296 16.6 396.90 14.69
In [45]: x_train.head()
Out[45]: Unnamed: 0 crim zn indus chas nox rm age dis rad tax ptratio black lstat
245 246 0.19133 22.0 5.86 0 0.431 5.605 70.2 7.9549 7 330 19.1 389.13 18.46
59 60 0.10328 25.0 5.13 0 0.453 5.927 47.2 6.9320 8 284 19.7 396.90 9.22
276 277 0.10469 40.0 6.41 1 0.447 7.267 49.0 4.7872 4 254 17.6 389.25 6.05
395 396 8.71675 0.0 18.10 0 0.693 6.471 98.8 1.7257 24 666 20.2 391.98 17.12
416 417 10.83420 0.0 18.10 0 0.679 6.782 90.8 1.8195 24 666 20.2 21.57 25.79
In [46]: x_train.shape
Out[46]: (379, 14)
In [47]: x_test.shape
Out[47]: (127, 14)
In [49]: from sklearn.linear_model import LinearRegression

#import the class
#Create the object
regressor=LinearRegression()
In [50]: regressor.fit(x_train,y_train)
Out[50]: ▾ LinearRegression
LinearRegression()
In [51]: regressor.coef_
Out[51]: array([-6.86626589e-04, -1.18114875e-01, 4.45457552e-02, -5.73095686e-03,

2.40076954e+00, -1.55582202e+01, 3.77695509e+00, -7.50684159e-03,
-1.43843147e+00, 2.45150451e-01, -1.10818418e-02, -9.85916565e-01,
8.44873594e-03, -4.99309080e-01])
In [52]: regressor.intercept_
Out[52]: 36.950771141093725
In [53]: #predictions
y_pred=regressor.predict(x_test)
In [54]: y_pred.shape
Out[54]: (127,)
In [55]: result=pd.DataFrame({'Actual':y_test,'Producted':y_pred})
In [56]: result
Out[56]: Actual Producted
329 22.6 24.888928
371 50.0 23.651784
219 23.0 29.171382
403 8.3 11.960815
78 21.2 21.421473
... ... ...
49 19.4 17.587109
498 21.2 21.311314
309 20.3 23.534179
124 18.8 20.269959
306 33.4 35.110222
In [57]: residual_errors=abs(y_test-y_pred)
In [58]: residual_errors
Out[58]: 329 2.288928

371 26.348216
219 6.171382
403 3.660815
78 0.221473
...
49 1.812891
498 0.111314
309 3.234179
124 1.469959
306 1.710222
Name: medv, Length: 127, dtype: float64
In [59]: residual_errors;
In [60]: # mean absolute error

sum(residual_errors)/len(residual_errors)
Out[60]: 3.660052718913954
In [61]: from sklearn.metrics import mean_absolute_percentage_error
In [62]: mean_absolute_percentage_error(y_test,y_pred)
Out[62]: 0.175030596645347
In [63]: regressor.score(x_test,y_test)
Out[63]: 0.6367909663749035
In [64]: from sklearn.metrics import r2_score

r2_score(y_test,y_pred)
Out[64]: 0.6367909663749035
In [69]: new=[[0.7258,0,8.64,0,0.538,5.727,69.6,3.7965,4,307,22,391.95,11.28,23.65]]
In [70]: new
Out[70]: [[0.7258,
0,
8.64,
0,
0.538,
5.727,
69.6,
3.7965,
4,
307,
22,
391.95,
11.28,
23.65]]
In [72]: regressor.predict(new)
C:\Users\Dell\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py:439: UserWarning: X does not have valid featur

e names, but LinearRegression was fitted with feature names
warnings.warn(
Out[72]: array([-116.50728407])
Classifier
In [86]: import pandas as pd
In [90]: df=pd.read_csv('Social_Network_Ads.csv')
In [91]: df
Out[91]: User ID Gender Age EstimatedSalary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0
... ... ... ... ... ...
395 15691863 Female 46 41000 1
396 15706071 Male 51 23000 1
397 15654296 Female 50 20000 1
398 15755018 Male 36 33000 0
399 15594041 Female 49 36000 1
In [92]: #input data

x=df[['Age','EstimatedSalary']]
#output data
y=df['Purchased']
In [93]: from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
In [83]: from sklearn.linear_model import LogisticRegression
In [95]: #cross. validation

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_scaled,y,random_state=0,test_size=0.25)
In [84]: #create the object

classifier= LogisticRegression()
In [96]: x_train
Out[96]: array([[0.61904762, 0.17777778],

[0.33333333, 0.77777778],
[0.47619048, 0.25925926],
[0.33333333, 0.88888889],
[0.80952381, 0.04444444],
[0.83333333, 0.65925926],
[0.5 , 0.2 ],
[0.47619048, 0.34074074],
[0.42857143, 0.25925926],
[0.42857143, 0.35555556],
[0.4047619 , 0.07407407],
[0.4047619 , 0.25925926],
[0.57142857, 0.42962963],
[0.69047619, 0.25185185],
[0.97619048, 0.1037037 ],
[0.73809524, 0.37037037],
[0.64285714, 0.85925926],
[0.30952381, 0.54814815],
[0.66666667, 0.4962963 ],
[0.69047619, 0.26666667],
[0.19047619, 0. ],
[1. , 0.64444444],
[0.47619048, 0.71851852],
[0.52380952, 0.68148148],
[0.57142857, 0.28148148],
[0.4047619 , 0.32592593],
[0.71428571, 0.19259259],
[0.71428571, 0.88148148],
[0.47619048, 0.72592593],
[0.26190476, 0.98518519],
[0.19047619, 0. ],
[1. , 0.2 ],
[0.14285714, 0.02962963],
[0.57142857, 0.99259259],
[0.66666667, 0.6 ],
[0.23809524, 0.32592593],
[0.5 , 0.6 ],
[0.23809524, 0.54814815],
[0.54761905, 0.42222222],
[0.64285714, 0.08148148],
[0.35714286, 0.4 ],
[0.04761905, 0.4962963 ],
[0.30952381, 0.43703704],
[0.57142857, 0.48148148],
[0.4047619 , 0.42222222],
[0.35714286, 0.99259259],
[0.52380952, 0.41481481],
[0.78571429, 0.97037037],
[0.66666667, 0.47407407],
[0.4047619 , 0.44444444],
[0.47619048, 0.26666667],
[0.42857143, 0.44444444],
[0.45238095, 0.46666667],
[0.47619048, 0.34074074],
[1. , 0.68888889],
[0.04761905, 0.4962963 ],
[0.92857143, 0.43703704],
[0.57142857, 0.37037037],
[0.19047619, 0.48148148],
[0.66666667, 0.75555556],
[0.4047619 , 0.34074074],
[0.07142857, 0.39259259],
[0.23809524, 0.21481481],
[0.54761905, 0.53333333],
[0.45238095, 0.13333333],
[0.21428571, 0.55555556],
[0.5 , 0.2 ],
[0.23809524, 0.8 ],
[0.30952381, 0.76296296],
[0.16666667, 0.53333333],
[0.4047619 , 0.41481481],
[0.45238095, 0.40740741],
[0.4047619 , 0.17777778],
[0.69047619, 0.05925926],
[0.4047619 , 0.97777778],
[0.71428571, 0.91111111],
[0.19047619, 0.52592593],
[0.16666667, 0.47407407],
[0.80952381, 0.91111111],
[0.78571429, 0.05925926],
[0.4047619 , 0.33333333],
[0.35714286, 0.72592593],
[0.28571429, 0.68148148],
[0.71428571, 0.13333333],
[0.54761905, 0.48148148],
[0.71428571, 0.6 ],
[0.30952381, 0.02222222],
[0.30952381, 0.41481481],
[0.5952381 , 0.84444444],
[0.97619048, 0.45185185],
[0. , 0.21481481],
[0.42857143, 0.76296296],
[0.57142857, 0.55555556],
[0.69047619, 0.11111111],
[0.19047619, 0.20740741],
[0.52380952, 0.46666667],
[0.66666667, 0.32592593],
[0.97619048, 0.2 ],
[0.66666667, 0.43703704],
[0.4047619 , 0.56296296],
[0.23809524, 0.32592593],
[0.52380952, 0.31111111],
[0.97619048, 0.94814815],
[0.92857143, 0.08148148],
[0.80952381, 0.17037037],
[0.69047619, 0.72592593],
[0.83333333, 0.94814815],
[0.4047619 , 0.08888889],
[0.95238095, 0.63703704],
[0.64285714, 0.22222222],
[0.11904762, 0.4962963 ],
[0.66666667, 0.05925926],
[0.57142857, 0.37037037],
[0.23809524, 0.51111111],
[0.47619048, 0.32592593],
[0.19047619, 0.51111111],
[0.26190476, 0.0962963 ],
[0.45238095, 0.41481481],
[0.0952381 , 0.2962963 ],
[0.71428571, 0.14814815],
[0.73809524, 0.0962963 ],
[0.47619048, 0.37037037],
[0.21428571, 0.01481481],
[0.66666667, 0.0962963 ],
[0.71428571, 0.93333333],
[0.19047619, 0.01481481],
[0.4047619 , 0.60740741],
[0.5 , 0.32592593],
[0.14285714, 0.08888889],
[0.33333333, 0.02222222],
[0.66666667, 0.54074074],
[0.4047619 , 0.31851852],
[0.9047619 , 0.33333333],
[0.69047619, 0.14074074],
[0.52380952, 0.42222222],
[0.33333333, 0.62962963],
[0.02380952, 0.04444444],
[0.16666667, 0.55555556],
[0.4047619 , 0.54074074],
[0.23809524, 0.12592593],
[0.76190476, 0.03703704],
[0.52380952, 0.32592593],
[0.76190476, 0.21481481],
[0.4047619 , 0.42222222],
[0.52380952, 0.94074074],
[0.66666667, 0.12592593],
[0.5 , 0.41481481],
[0.04761905, 0.43703704],
[0.26190476, 0.44444444],
[0.30952381, 0.45185185],
[0.69047619, 0.07407407],
[0.52380952, 0.34074074],
[0.38095238, 0.71851852],
[0.47619048, 0.48148148],
[0.57142857, 0.44444444],
[0.69047619, 0.23703704],
[0.5 , 0.44444444],
[0.02380952, 0.07407407],
[0.45238095, 0.48148148],
[0.42857143, 0.33333333],
[0.54761905, 0.27407407],
[0.42857143, 0.81481481],
[0.71428571, 0.1037037 ],
[0.42857143, 0.82222222],
[0.78571429, 0.88148148],
[0.21428571, 0.31111111],
[0.47619048, 0.41481481],
[0.5 , 0.34074074],
[0.0952381 , 0.08888889],
[0.35714286, 0.33333333],
[0.71428571, 0.43703704],
[0.95238095, 0.05925926],
[0.83333333, 0.42222222],
[0.33333333, 0.75555556],
[0.85714286, 0.40740741],
[0.28571429, 0.48148148],
[0.95238095, 0.59259259],
[0.19047619, 0.27407407],
[0.64285714, 0.47407407],
[0.14285714, 0.2962963 ],
[0.52380952, 0.44444444],
[0.35714286, 0.0962963 ],
[0.61904762, 0.91851852],
[0.0952381 , 0.02222222],
[0.35714286, 0.26666667],
[0.5952381 , 0.87407407],
[0.14285714, 0.12592593],
[0.66666667, 0.05185185],
[0.4047619 , 0.2962963 ],
[0.85714286, 0.65925926],
[0.71428571, 0.77037037],
[0.4047619 , 0.28148148],
[0.45238095, 0.95555556],
[0.11904762, 0.37777778],
[0.45238095, 0.9037037 ],
[0.30952381, 0.31851852],
[0.35714286, 0.19259259],
[0.64285714, 0.05185185],
[0.28571429, 0. ],
[0.02380952, 0.02962963],
[0.73809524, 0.43703704],
[0.5 , 0.79259259],
[0.4047619 , 0.42962963],
[0.5 , 0.41481481],
[0.14285714, 0.05925926],
[0.54761905, 0.42222222],
[0.26190476, 0.5037037 ],
[0.85714286, 0.08148148],
[0.4047619 , 0.21481481],
[0.45238095, 0.44444444],
[0.26190476, 0.23703704],
[0.30952381, 0.39259259],
[0.57142857, 0.28888889],
[0.28571429, 0.88888889],
[0.80952381, 0.73333333],
[0.76190476, 0.15555556],
[0.9047619 , 0.87407407],
[0.26190476, 0.34074074],
[0.28571429, 0.54814815],
[0.19047619, 0.00740741],
[0.35714286, 0.11851852],
[0.54761905, 0.42222222],
[0.42857143, 0.13333333],
[0.88095238, 0.81481481],
[0.71428571, 0.85925926],
[0.54761905, 0.41481481],
[0.28571429, 0.34814815],
[0.45238095, 0.42222222],
[0.54761905, 0.35555556],
[0.95238095, 0.23703704],
[0.28571429, 0.74814815],
[0.04761905, 0.25185185],
[0.45238095, 0.43703704],
[0.54761905, 0.32592593],
[0.73809524, 0.54814815],
[0.23809524, 0.47407407],
[0.83333333, 0.4962963 ],
[0.52380952, 0.31111111],
[1. , 0.14074074],
[0.4047619 , 0.68888889],
[0.07142857, 0.42222222],
[0.47619048, 0.41481481],
[0.5 , 0.67407407],
[0.45238095, 0.31111111],
[0.19047619, 0.42222222],
[0.4047619 , 0.05925926],
[0.85714286, 0.68888889],
[0.28571429, 0.01481481],
[0.5 , 0.88148148],
[0.26190476, 0.20740741],
[0.35714286, 0.20740741],
[0.4047619 , 0.17037037],
[0.54761905, 0.22222222],
[0.54761905, 0.42222222],
[0.5 , 0.88148148],
[0.21428571, 0.9037037 ],
[0.07142857, 0.00740741],
[0.19047619, 0.12592593],
[0.30952381, 0.37777778],
[0.5 , 0.42962963],
[0.54761905, 0.47407407],
[0.69047619, 0.25925926],
[0.54761905, 0.11111111],
[0.45238095, 0.57777778],
[1. , 0.22962963],
[0.16666667, 0.05185185],
[0.23809524, 0.16296296],
[0.47619048, 0.2962963 ],
[0.42857143, 0.28888889],
[0.04761905, 0.15555556],
[0.9047619 , 0.65925926],
[0.52380952, 0.31111111],
[0.57142857, 0.68888889],
[0.04761905, 0.05925926],
[0.52380952, 0.37037037],
[0.69047619, 0.03703704],
[0. , 0.52592593],
[0.4047619 , 0.47407407],
[0.92857143, 0.13333333],
[0.38095238, 0.42222222],
[0.73809524, 0.17777778],
[0.21428571, 0.11851852],
[0.02380952, 0.40740741],
[0.5 , 0.47407407],
[0.19047619, 0.48888889],
[0.16666667, 0.48148148],
[0.23809524, 0.51851852],
[0.88095238, 0.17777778],
[0.76190476, 0.54074074],
[0.73809524, 0.54074074],
[0.80952381, 1. ],
[0.4047619 , 0.37037037],
[0.57142857, 0.28888889],
[0.38095238, 0.20740741],
[0.45238095, 0.27407407],
[0.71428571, 0.11111111],
[0.26190476, 0.20740741],
[0.42857143, 0.27407407],
[0.21428571, 0.28888889],
[0.19047619, 0.76296296]])
In [97]: y_train
Out[97]: 250 0
63 1
312 0
159 1
283 1
..
323 1
192 0
117 0
47 0
172 0
Name: Purchased, Length: 300, dtype: int64
In [98]: from sklearn.linear_model import LogisticRegression
In [99]: #creat the object

classifier = LogisticRegression()
In [100… classifier.fit(x_train,y_train)
Out[100]: ▾ LogisticRegression
LogisticRegression()
In [102… #predication
y_pred = classifier.predict(x_test)
In [101… y_train.shape
Out[101]: (300,)
In [103… x_train.shape
Out[103]: (300, 2)
In [104… y_pred
Out[104]: array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)
In [105… y_test
Out[105]: 132 0
309 0
341 0
196 0
246 0
..
146 1
135 0
390 1
264 1
364 1
Name: Purchased, Length: 100, dtype: int64
In [106… from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)
Out[106]: 0.89
In [108… from sklearn.metrics import classification_report
In [109… print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.87 0.99 0.92 68

1 0.96 0.69 0.80 32
accuracy 0.89 100

macro avg 0.91 0.84 0.86 100
weighted avg 0.90 0.89 0.88 100
In [110… new1=[[26,34000]]
new2=[[57,138000]]
In [111… classifier.predict(scaler.transform(new1))

e names, but MinMaxScaler was fitted with feature names
warnings.warn(
Out[111]: array([0], dtype=int64)
In [112… classifier.predict(scaler.transform(new2))

e names, but MinMaxScaler was fitted with feature names
warnings.warn(
Out[112]: array([1], dtype=int64)
In [ ]:

MAJOR PROJECT (Sanket Patil) PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MAJOR PROJECT (Sanket Patil) PDF

Uploaded by

Copyright:

Available Formats

In

[6]: # import python libraries

In [7]: # import csv file

Out[8]: (11251, 15)

In [6]: #drop unrelated/blank columns

In [7]: #check for null values

In [8]: # drop null values

In [9]: # change data type

Out[11]: Index(['User_ID', 'Cust_name', 'Product_ID', 'Gender', 'Age Group', 'Age',

In [12]: #rename column

0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Western Healthcare Auto 1 23952

1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern Govt Auto 3 23934

3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southern Construction Auto 2 23912

11239 rows × 13 columns

Out[13]: User_ID Age Marital_Status Orders Amount

count 1.123900e+04 11239.000000 11239.000000 11239.000000 11239.000000

mean 1.003004e+06 35.410357 0.420055 2.489634 9453.610553

std 1.716039e+03 12.753866 0.493589 1.114967 5222.355168

min 1.000001e+06 12.000000 0.000000 1.000000 188.000000

25% 1.001492e+06 27.000000 0.000000 2.000000 5443.000000

50% 1.003064e+06 33.000000 0.000000 2.000000 8109.000000

75% 1.004426e+06 43.000000 1.000000 3.000000 12675.000000

max 1.006040e+06 92.000000 1.000000 4.000000 23952.000000

In [14]: # use describe() for specific columns

Out[14]: Age Orders Amount

count 11239.000000 11239.000000 11239.000000

mean 35.410357 2.489634 9453.610553

std 12.753866 1.114967 5222.355168

min 12.000000 1.000000 188.000000

25% 27.000000 2.000000 5443.000000

50% 33.000000 2.000000 8109.000000

75% 43.000000 3.000000 12675.000000

max 92.000000 4.000000 23952.000000

Exploratory Data Analysis

ax = sns.countplot(x = 'Gender',data = df)

for bars in ax.containers:

In [16]: # plotting a bar chart for gender vs total amount

sales_gen = df.groupby(['Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

sns.barplot(x = 'Gender',y= 'Amount' ,data = sales_gen)

Out[16]: <Axes: xlabel='Gender', ylabel='Amount'>

for bars in ax.containers:

In [6]: # Total Amount vs Age Group

sns.barplot(x = 'Age Group',y= 'Amount' ,data = sales_age)

Out[6]: <Axes: xlabel='Age Group', ylabel='Amount'>

sales_state = df.groupby(['State'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False).head(10)

Out[19]: <Axes: xlabel='State', ylabel='Orders'>

In [20]: # total amount/sales from top 10 states

sales_state = df.groupby(['State'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)

Out[20]: <Axes: xlabel='State', ylabel='Amount'>

In [22]: sales_state = df.groupby(['Marital_Status', 'Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

Out[22]: <Axes: xlabel='Marital_Status', ylabel='Amount'>

for bars in ax.containers:

In [24]: sales_state = df.groupby(['Occupation'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

Out[24]: <Axes: xlabel='Occupation', ylabel='Amount'>

for bars in ax.containers:

In [26]: sales_state = df.groupby(['Product_Category'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)

Out[26]: <Axes: xlabel='Product_Category', ylabel='Amount'>

In [27]: sales_state = df.groupby(['Product_ID'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False).head(10)

Out[27]: <Axes: xlabel='Product_ID', ylabel='Orders'>

In [28]: # top 10 most sold products (same thing as above)

fig1, ax1 = plt.subplots(figsize=(12,7))

Out[28]: <Axes: xlabel='Product_ID'>