Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .

ipynb - Colaboratory

Name: Samana Tatheer ID: 20U00323 Assign 4

import pandas as pd

Q1. Import Dataset

df323 = pd.read_csv("/content/train.csv")
df323

User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_Ci

0-
0 1000001 P00069042 F 10 A
17

0-
1 1000001 P00248942 F 10 A
17

0-
2 1000001 P00087842 F 10 A
17

0-
3 1000001 P00085442 F 10 A
17

4 1000002 P00285442 M 55+ 16 C

... ... ... ... ... ... ...

51-
550063 1006033 P00372445 M 13 B
55

26-
550064 1006035 P00375436 F 1 C
35

26-
550065 1006036 P00375436 F 15 B
35

550066 1006038 P00375436 F 55+ 1 C

46-
550067 1006039 P00371644 F 0 B
50

550068 rows × 12 columns

Q2. Data Profiling:tells everything about the data set-the number of variables in a dataset

pip install ydata_profiling

Collecting ydata_profiling
Downloading ydata_profiling-4.5.1-py2.py3-none-any.whl (357 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 357.3/357.3 kB 5.1 MB/s eta 0:00:00
Requirement already satisfied: scipy<1.12,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profili
Requirement already satisfied: pandas!=1.4.0,<2.1,>1.1 in /usr/local/lib/python3.10/dist-packages (from ydata_pr
Requirement already satisfied: matplotlib<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ydata_profili
Collecting pydantic<2,>=1.8.1 (from ydata_profiling)
Downloading pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 43.3 MB/s eta 0:00:00
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profili
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profil
Collecting visions[type_image_path]==0.7.5 (from ydata_profiling)
Downloading visions-0.7.5-py3-none-any.whl (102 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.7/102.7 kB 10.4 MB/s eta 0:00:00
Requirement already satisfied: numpy<1.24,>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profil
Collecting htmlmin==0.1.12 (from ydata_profiling)
Downloading htmlmin-0.1.12.tar.gz (19 kB)
Preparing metadata (setup.py) ... done

https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 1/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
Collecting phik<0.13,>=0.11.1 (from ydata_profiling)
Downloading phik-0.12.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (679 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 679.5/679.5 kB 50.7 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata_profil
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata_prof
Collecting multimethod<2,>=1.4 (from ydata_profiling)
Downloading multimethod-1.9.1-py3-none-any.whl (10 kB)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from ydata_pro
Collecting typeguard<3,>=2.13.2 (from ydata_profiling)
Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Collecting imagehash==4.3.1 (from ydata_profiling)
Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 296.5/296.5 kB 25.4 MB/s eta 0:00:00
Requirement already satisfied: wordcloud>=1.9.1 in /usr/local/lib/python3.10/dist-packages (from ydata_profiling
Collecting dacite>=1.8 (from ydata_profiling)
Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->yda
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata_p
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image
Collecting tangled-up-in-unicode>=0.0.4 (from visions[type_image_path]==0.7.5->ydata_profiling)
Downloading tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MB 85.2 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.1
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2-
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<2.1,
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pydanti
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from request
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=

from ydata_profiling import ProfileReport

profile = ProfileReport(df323, title="Data profile")


profile

https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 2/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory

Summarize dataset: 58/58 [00:34<00:00, 1.70it/s,

100% Completed]

Generate report structure: 1/1 [00:09<00:00,

100% 9.52s/it]

Render HTML: 100% 1/1 [00:01<00:00, 1.81s/it]

Overview

Dataset statistics
Number of variables 12

Number of observations 550068

There are 12Missing


variables in the dataset
cells 556885

Missing cells (%) 8.4%


df323.dtypes
Duplicate rows 0
User_IDDuplicate rows (%) int64 0.0%
Product_ID object
Gender Total size in memory object 50.4 MiB
Age object
Average record size in memory
Occupation int64 96.0 B
City_Category object
Stay_In_Current_City_Years
Variable types object
Marital_Status int64
Product_Category_1 int64
Numeric 6
Product_Category_2 float64
Product_Category_3
Text float64 1
Purchase int64
dtype: Categorical
object 5

Q3. ConvertAlerts
the variables in categorical and numerical data type
Product_Category_2 is highly overall correlated with High correlation

Making an Product_Category_3
array

cols=['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category','Stay_In_Current_City_Years','Marital_Stat

converting all the array variables into categorical

df323[cols] = df323[cols].astype("category")

converting the purchase variable into float

df323["Purchase"]=df323["Purchase"].astype(float)

Q4. Identification of outliers

https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 3/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory

import numpy as np
Q1,Q3=np.percentile(df323["Purchase"],[25,75])
IRQ=Q3-Q1
IRQ

6231.0

upper=np.where(df323["Purchase"]>(Q3+1.5*IRQ))
lower=np.where(df323["Purchase"]>(Q1-1.5*IRQ))

Replace outliers with missing values

df323["Purchase"]=df323["Purchase"].replace(upper[0],np.nan)

df323.isnull().sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of User_ID Product_ID Gender Age


Occupation City_Category \
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
... ... ... ... ... ... ...
550063 False False False False False False
550064 False False False False False False
550065 False False False False False False
550066 False False False False False False
550067 False False False False False False

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
550063 False False False
550064 False False False
550065 False False False
550066 False False False
550067 False False False

Product_Category_2 Product_Category_3 Purchase


0 True True False
1 False False False
2 True True False
3 False True False
4 True True False
... ... ... ...
550063 True True False
550064 True True False
550065 True True False
550066 True True False
550067 True True False

[550068 rows x 12 columns]>

Q5. Dropping Variables

https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 4/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory
df323b = df323.drop(["User_ID","Product_ID"], axis=1)
df323b.shape

(550068, 10)

Q6 Repalcing missing values with average

df323b["Purchase"]=df323b["Purchase"].fillna(df323b["Purchase"].median())

df323b["Product_Category_2"]=df323b["Product_Category_2"].fillna(df323b["Product_Category_2"].mode()[0])

df323b["Product_Category_3"]=df323b["Product_Category_3"].fillna(df323b["Product_Category_3"].mode()[0])

df323b.isnull().sum()

Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 0
Product_Category_3 0
Purchase 0
dtype: int64

Q7. Descriptive Statistics

df323b.describe(include='all')

Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_

count 550068 550068 550068.0 550068 550068 55

unique 2 7 21.0 3 5

top M 26-35 4.0 B 1

freq 414259 219587 72308.0 231173 193821 32

mean NaN NaN NaN NaN NaN

std NaN NaN NaN NaN NaN

min NaN NaN NaN NaN NaN

25% NaN NaN NaN NaN NaN

50% NaN NaN NaN NaN NaN

75% NaN NaN NaN NaN NaN

max NaN NaN NaN NaN NaN

https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 5/6
9/7/23, 1:49 PM Assign 4-Samana Tatheer 20U00323 .ipynb - Colaboratory

check 0s completed at 13:49

https://colab.research.google.com/drive/146Saq0pZI_Bj-nEicmWsnkoA83FDynR_#scrollTo=konJxjGp1Al5&printMode=true 6/6

You might also like