Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Regression Analysis

Problem Statement
Google Play Store team is about to launch a new feature wherein, certain apps that are
promising are boosted in visibility. The boost will manifest in multiple ways including higher
priority in recommendations sections (“Similar apps”, “You might also like”, “New and updated
games”). These will also get a boost in search results visibility. This feature will help bring more
attention to newer apps that have the potential.

Analysis to be done:
The problem is to identify the apps that are going to be good for Google to promote. App
ratings, which are provided by the customers, are always great indicators of the goodness of the
app. The problem reduces to: predict which apps will have high ratings.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

df = pd.read_csv('googleplaystore.csv')

df.shape

(10841, 13)

df.head(3)

App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7

Reviews Size Installs Type Price Content_Rating \


0 159 19M 10,000+ Free 0 Everyone
1 967 14M 500,000+ Free 0 Everyone
2 87510 8.7M 5,000,000+ Free 0 Everyone

Genres Last Updated Current Ver Android


Ver
0 Art_&_Design January 7, 2018 1.0.0 4.0.3 and
up
1 Art_&_Design_Pretend_Play January 15, 2018 2.0.0 4.0.3 and
up
2 Art_&_Design August 1, 2018 1.2.4 4.0.3 and
up

# Data Information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null object
4 Size 10841 non-null object
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content_Rating 10840 non-null object
9 Genres 10841 non-null object
10 Last Updated 10841 non-null object
11 Current Ver 10833 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

# Find And Treat Missing Values

df.isnull().sum()

App 0
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content_Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64
print("Missing Value %age :" ,
(df.isnull().sum().sum()/df.shape[0])*100)

Missing Value %age : 13.716446822248871

# Dropping missing Values row

df.dropna(how='any', inplace= True)

df.shape

(9360, 13)

df.isnull().sum()

App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content_Rating 0
Genres 0
Last Updated 0
Current Ver 0
Android Ver 0
dtype: int64

Data Wrangling
# Check for Duplicate Data
df[df.duplicated()]

App Category
Rating \
229 Quick PDF Scanner + OCR FREE BUSINESS
4.2
236 Box BUSINESS
4.2
239 Google My Business BUSINESS
4.4
256 ZOOM Cloud Meetings BUSINESS
4.4
261 join.me - Simple Meetings BUSINESS
4.0
... ... ...
...
8643 Wunderlist: To-Do List & Tasks PRODUCTIVITY
4.6
8654 TickTick: To Do List with Reminder, Day Planner PRODUCTIVITY
4.6
8658 ColorNote Notepad Notes PRODUCTIVITY
4.6
10049 Airway Ex - Intubate. Anesthetize. Train. MEDICAL
4.3
10768 AAFP MEDICAL
3.8

Reviews Size Installs Type Price


Content_Rating \
229 80805 Varies with device 5,000,000+ Free 0
Everyone
236 159872 Varies with device 10,000,000+ Free 0
Everyone
239 70991 Varies with device 5,000,000+ Free 0
Everyone
256 31614 37M 10,000,000+ Free 0
Everyone
261 6989 Varies with device 1,000,000+ Free 0
Everyone
... ... ... ... ... ...
...
8643 404610 Varies with device 10,000,000+ Free 0
Everyone
8654 25370 Varies with device 1,000,000+ Free 0
Everyone
8658 2401017 Varies with device 100,000,000+ Free 0
Everyone
10049 123 86M 10,000+ Free 0
Everyone
10768 63 24M 10,000+ Free 0
Everyone

Genres Last Updated Current Ver


Android Ver
229 Business February 26, 2018 Varies with device
4.0.3 and up
236 Business July 31, 2018 Varies with device Varies
with device
239 Business July 24, 2018 2.19.0.204537701
4.4 and up
256 Business July 20, 2018 4.1.28165.0716
4.0 and up
261 Business July 16, 2018 4.3.0.508
4.4 and up
... ... ... ...
...
8643 Productivity April 6, 2018 Varies with device Varies
with device
8654 Productivity August 6, 2018 Varies with device Varies
with device
8658 Productivity June 27, 2018 Varies with device Varies
with device
10049 Medical June 1, 2018 0.6.88
5.0 and up
10768 Medical June 22, 2018 2.3.1
5.0 and up

[474 rows x 13 columns]

# Drop The duplicates


df = df.drop_duplicates(keep='last')

# Price

df.Price.value_counts()

0 8275
$2.99 110
$0.99 104
$4.99 68
$1.99 59
...
$1.29 1
$299.99 1
$379.99 1
$39.99 1
$1.20 1
Name: Price, Length: 73, dtype: int64

float('$299.99'[1:])

299.99

float('$299.99'.replace('$', ''))

299.99

df['Price'] = df.Price.map(lambda x: 0 if x=='0' else float(x[1:]))

df.Price.value_counts()

0.00 8275
2.99 110
0.99 104
4.99 68
1.99 59
...
1.29 1
299.99 1
379.99 1
39.99 1
1.20 1
Name: Price, Length: 73, dtype: int64

df.Price.info()

<class 'pandas.core.series.Series'>
Int64Index: 8886 entries, 0 to 10840
Series name: Price
Non-Null Count Dtype
-------------- -----
8886 non-null float64
dtypes: float64(1)
memory usage: 138.8 KB

df.head()

App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
3 Sketch - Draw & Paint ART_AND_DESIGN
4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
4.3

Reviews Size Installs Type Price Content_Rating \


0 159 19M 10,000+ Free 0.0 Everyone
1 967 14M 500,000+ Free 0.0 Everyone
2 87510 8.7M 5,000,000+ Free 0.0 Everyone
3 215644 25M 50,000,000+ Free 0.0 Teen
4 967 2.8M 100,000+ Free 0.0 Everyone

Genres Last Updated Current Ver \


0 Art_&_Design January 7, 2018 1.0.0
1 Art_&_Design_Pretend_Play January 15, 2018 2.0.0
2 Art_&_Design August 1, 2018 1.2.4
3 Art_&_Design June 8, 2018 Varies with device
4 Art_&_Design_Creativity June 20, 2018 1.1

Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up

df.Reviews.value_counts()

2 82
3 76
4 74
5 74
1 67
..
70189 1
10859051 1
111066 1
129272 1
398307 1
Name: Reviews, Length: 5990, dtype: int64

# Convert the datatype of reviews to int

df.Reviews = df.Reviews.astype('int')

df.Reviews.info()

<class 'pandas.core.series.Series'>
Int64Index: 8886 entries, 0 to 10840
Series name: Reviews
Non-Null Count Dtype
-------------- -----
8886 non-null int64
dtypes: int64(1)
memory usage: 138.8 KB

df.Size.value_counts()

Varies with device 1468


14M 153
13M 152
12M 151
11M 149
...
454k 1
812k 1
442k 1
842k 1
619k 1
Name: Size, Length: 413, dtype: int64

def change_size(size):
if 'M' in size:
x = size[:-1]
x= float(x)*1000
return x
elif 'k' in size:
x = float(size[:-1])
return x
else:
return None

df['Size'] = df.Size.map(change_size)

df.Size.info()

<class 'pandas.core.series.Series'>
Int64Index: 8886 entries, 0 to 10840
Series name: Size
Non-Null Count Dtype
-------------- -----
7418 non-null float64
dtypes: float64(1)
memory usage: 138.8 KB

df.Size.isnull().sum()

1468

df.tail(10)

App
Category \
10828 Manga-FR - Anime Vostfr
COMICS
10829 Bulgarian French Dictionary Fr
BOOKS_AND_REFERENCE
10830 News Minecraft.fr
NEWS_AND_MAGAZINES
10832 FR Tides
WEATHER
10833 Chemin (fr)
BOOKS_AND_REFERENCE
10834 FR Calculator
FAMILY
10836 Sya9a Maroc - FR
FAMILY
10837 Fr. Mike Schmitz Audio Teachings
FAMILY
10839 The SCP Foundation DB fr nn5n
BOOKS_AND_REFERENCE
10840 iHoroscope - 2018 Daily Horoscope & Astrology
LIFESTYLE
Rating Reviews Size Installs Type Price
Content_Rating \
10828 3.4 291 13000.0 10,000+ Free 0.0
Everyone
10829 4.6 603 7400.0 10,000+ Free 0.0
Everyone
10830 3.8 881 2300.0 100,000+ Free 0.0
Everyone
10832 3.8 1195 582.0 100,000+ Free 0.0
Everyone
10833 4.8 44 619.0 1,000+ Free 0.0
Everyone
10834 4.0 7 2600.0 500+ Free 0.0
Everyone
10836 4.5 38 53000.0 5,000+ Free 0.0
Everyone
10837 5.0 4 3600.0 100+ Free 0.0
Everyone
10839 4.5 114 NaN 1,000+ Free 0.0
Mature_17+
10840 4.5 398307 19000.0 10,000,000+ Free 0.0
Everyone

Genres Last Updated Current Ver \


10828 Comics May 15, 2017 2.0.1
10829 Books_&_Reference June 19, 2016 2.96
10830 News_&_Magazines January 20, 2014 1.5
10832 Weather February 16, 2014 6
10833 Books_&_Reference March 23, 2014 0.8
10834 Education June 18, 2017 1.0.0
10836 Education July 25, 2017 1.48
10837 Education July 6, 2018 1
10839 Books_&_Reference January 19, 2015 Varies with device
10840 Lifestyle July 25, 2018 Varies with device

Android Ver
10828 4.0 and up
10829 4.1 and up
10830 1.6 and up
10832 2.1 and up
10833 2.2 and up
10834 4.1 and up
10836 4.1 and up
10837 4.1 and up
10839 Varies with device
10840 Varies with device

df.Size.describe()
count 7418.000000
mean 22760.828862
std 23439.210125
min 8.500000
25% 5100.000000
50% 14000.000000
75% 33000.000000
max 100000.000000
Name: Size, dtype: float64

# Use forward fill method of fillna to impute missing values created


in size
df.Size.fillna(method = 'ffill', inplace = True)

df.Size.isnull().sum()

df.Size.describe()

count 8886.000000
mean 22982.979912
std 23343.630842
min 8.500000
25% 5300.000000
50% 14000.000000
75% 33000.000000
max 100000.000000
Name: Size, dtype: float64

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8886 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 8886 non-null object
1 Category 8886 non-null object
2 Rating 8886 non-null float64
3 Reviews 8886 non-null int64
4 Size 8886 non-null float64
5 Installs 8886 non-null object
6 Type 8886 non-null object
7 Price 8886 non-null float64
8 Content_Rating 8886 non-null object
9 Genres 8886 non-null object
10 Last Updated 8886 non-null object
11 Current Ver 8886 non-null object
12 Android Ver 8886 non-null object
dtypes: float64(3), int64(1), object(9)
memory usage: 971.9+ KB

df.Installs

0 10,000+
1 500,000+
2 5,000,000+
3 50,000,000+
4 100,000+
...
10834 500+
10836 5,000+
10837 100+
10839 1,000+
10840 10,000,000+
Name: Installs, Length: 8886, dtype: object

df['Installs'] = df.Installs.map(lambda x:
int(x.replace(',','').replace('+','')))

df.Installs

0 10000
1 500000
2 5000000
3 50000000
4 100000
...
10834 500
10836 5000
10837 100
10839 1000
10840 10000000
Name: Installs, Length: 8886, dtype: int64

df.head()

App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
3 Sketch - Draw & Paint ART_AND_DESIGN
4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
4.3
Reviews Size Installs Type Price Content_Rating \
0 159 19000.0 10000 Free 0.0 Everyone
1 967 14000.0 500000 Free 0.0 Everyone
2 87510 8700.0 5000000 Free 0.0 Everyone
3 215644 25000.0 50000000 Free 0.0 Teen
4 967 2800.0 100000 Free 0.0 Everyone

Genres Last Updated Current Ver \


0 Art_&_Design January 7, 2018 1.0.0
1 Art_&_Design_Pretend_Play January 15, 2018 2.0.0
2 Art_&_Design August 1, 2018 1.2.4
3 Art_&_Design June 8, 2018 Varies with device
4 Art_&_Design_Creativity June 20, 2018 1.1

Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up

# Sanity Checks
# 1. Ratings must be between 1 and 5
# 2. No.of Reviews cannot be more than installs
# 3. For Free app the price must be 0
# 4. Paid App the price should be greater than 0

df.Rating.describe()

count 8886.000000
mean 4.187959
std 0.522428
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64

len(df[df.Reviews>df.Installs])

df = df[df.Reviews<=df.Installs].copy() # With copy() you can create a


shallow copy of the dataframe

df.shape

(8879, 13)

df[(df.Type == 'Free') & (df.Price>0)]


Empty DataFrame
Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price,
Content_Rating, Genres, Last Updated, Current Ver, Android Ver]
Index: []

df[(df.Type == 'Paid') & (df.Price == 0)]

Empty DataFrame
Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price,
Content_Rating, Genres, Last Updated, Current Ver, Android Ver]
Index: []

# EDA - Exploratory Data Analysis

# Outlier Detection
sns.boxplot(x = 'Price', data = df)
plt.show()

sns.boxplot(x = 'Reviews', data = df)


plt.show()
# Distribution in Ratings

sns.histplot(x = 'Rating', data = df)


plt.show()

sns.histplot(x = 'Size', data = df)


plt.show()
sns.pairplot(df)
plt.show()
# Outlier Treatment:

len(df[df.Price>200])

15

df[df.Price>200]

App Category Rating Reviews


Size \
4197 most expensive app (H) FAMILY 4.3 6
1500.0
4362 💎 I'm rich LIFESTYLE 3.8 718
26000.0
4367 I'm Rich - Trump Edition LIFESTYLE 3.6 275
7300.0
5351 I am rich LIFESTYLE 3.8 3547
1800.0
5354 I am Rich Plus FAMILY 4.0 856
8700.0
5355 I am rich VIP LIFESTYLE 3.8 411
2600.0
5356 I Am Rich Premium FINANCE 4.1 1867
4700.0
5357 I am extremely Rich LIFESTYLE 2.9 41
2900.0
5358 I am Rich! FINANCE 3.8 93
22000.0
5359 I am rich(premium) FINANCE 3.5 472
965.0
5362 I Am Rich Pro FAMILY 4.4 201
2700.0
5364 I am rich (Most expensive app) FINANCE 4.1 129
2700.0
5366 I Am Rich FAMILY 3.6 217
4900.0
5369 I am Rich FINANCE 4.3 180
3800.0
5373 I AM RICH PRO PLUS FINANCE 4.0 36
41000.0

Installs Type Price Content_Rating Genres Last


Updated \
4197 100 Paid 399.99 Everyone Entertainment July
16, 2018
4362 10000 Paid 399.99 Everyone Lifestyle March
11, 2018
4367 10000 Paid 400.00 Everyone Lifestyle May
3, 2018
5351 100000 Paid 399.99 Everyone Lifestyle January
12, 2018
5354 10000 Paid 399.99 Everyone Entertainment May
19, 2018
5355 10000 Paid 299.99 Everyone Lifestyle July
21, 2018
5356 50000 Paid 399.99 Everyone Finance November
12, 2017
5357 1000 Paid 379.99 Everyone Lifestyle July
1, 2018
5358 1000 Paid 399.99 Everyone Finance December
11, 2017
5359 5000 Paid 399.99 Everyone Finance May
1, 2017
5362 5000 Paid 399.99 Everyone Entertainment May
30, 2017
5364 1000 Paid 399.99 Teen Finance December
6, 2017
5366 10000 Paid 389.99 Everyone Entertainment June
22, 2018
5369 5000 Paid 399.99 Everyone Finance March
22, 2018
5373 1000 Paid 399.99 Everyone Finance June
25, 2018

Current Ver Android Ver


4197 1 7.0 and up
4362 1.0.0 4.4 and up
4367 1.0.1 4.1 and up
5351 2 4.0.3 and up
5354 3 4.4 and up
5355 1.1.1 4.3 and up
5356 1.6 4.0 and up
5357 1 4.0 and up
5358 1 4.1 and up
5359 3.4 4.4 and up
5362 1.54 1.6 and up
5364 2 4.0.3 and up
5366 1.5 4.2 and up
5369 1 4.2 and up
5373 1.0.2 4.1 and up

df.drop(df[df.Price>200].index, axis =0, inplace= True)

df.shape

(8864, 13)

len(df[df.Reviews>25000000])

20

df.drop(df[df.Reviews>25000000].index, axis =0, inplace= True)

for cols in ['Price', 'Reviews', 'Installs', 'Size']:


sns.boxplot(x=cols, data = df)
plt.show()
# Do not drop more than 5% of the data for outlier removal

df.Installs.quantile([0.1,0.25,0.4,0.5,0.7,0.75,0.9,0.95,0.98,0.99])

0.10 1000.0
0.25 10000.0
0.40 100000.0
0.50 500000.0
0.70 1000000.0
0.75 5000000.0
0.90 10000000.0
0.95 100000000.0
0.98 100000000.0
0.99 500000000.0
Name: Installs, dtype: float64

len(df[df.Installs>500000000])

33

df.drop(df[df.Installs>500000000].index, axis=0, inplace = True)

df.shape

(8811, 13)

# Bivariate Analysis

# Joint plot between Price, Reviews, Size & Rating

for col in ['Price', 'Reviews', 'Installs', 'Size']:


sns.jointplot(x = col, y= 'Rating', data= df, )
plt.show()
for cols in ['Content_Rating', 'Category']:
plt.figure(figsize = (10,8))
sns.boxplot(x = cols, y= 'Rating', data= df)
plt.xticks(rotation =90)
plt.show()
df.head()

App Category
Rating \
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
4.1
1 Coloring book moana ART_AND_DESIGN
3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN
4.7
3 Sketch - Draw & Paint ART_AND_DESIGN
4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN
4.3

Reviews Size Installs Type Price Content_Rating \


0 159 19000.0 10000 Free 0.0 Everyone
1 967 14000.0 500000 Free 0.0 Everyone
2 87510 8700.0 5000000 Free 0.0 Everyone
3 215644 25000.0 50000000 Free 0.0 Teen
4 967 2800.0 100000 Free 0.0 Everyone

Genres Last Updated Current Ver \


0 Art_&_Design January 7, 2018 1.0.0
1 Art_&_Design_Pretend_Play January 15, 2018 2.0.0
2 Art_&_Design August 1, 2018 1.2.4
3 Art_&_Design June 8, 2018 Varies with device
4 Art_&_Design_Creativity June 20, 2018 1.1

Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up

df.nunique()

App 8146
Category 33
Rating 39
Reviews 5933
Size 410
Installs 17
Type 2
Price 68
Content_Rating 6
Genres 115
Last Updated 1299
Current Ver 2590
Android Ver 31
dtype: int64

df.drop(columns= ['App','Last Updated','Current Ver', 'Android Ver'],


inplace= True)

df.shape

(8811, 9)

# Data - Preprocessing

df.skew()
Rating -1.825597
Reviews 8.388563
Size 1.424330
Installs 8.657071
Price 16.445954
dtype: float64

df.kurtosis()

Rating 5.586197
Reviews 90.948655
Size 1.419297
Installs 86.881887
Price 469.767060
dtype: float64

df.columns

Index(['Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',


'Price',
'Content_Rating', 'Genres'],
dtype='object')

for col in ['Reviews', 'Size', 'Installs', 'Price']:


sns.kdeplot(x = col, data = df)
plt.show()
df.describe()

Rating Reviews Size Installs


Price
count 8811.000000 8.811000e+03 8811.000000 8.811000e+03
8811.000000
mean 4.187277 3.448017e+05 22943.343264 1.098901e+07
0.304967
std 0.523442 1.504349e+06 23283.356000 4.571031e+07
1.885203
min 1.000000 1.000000e+00 8.500000 5.000000e+00
0.000000
25% 4.000000 1.625000e+02 5350.000000 1.000000e+04
0.000000
50% 4.300000 4.600000e+03 14000.000000 5.000000e+05
0.000000
75% 4.500000 6.897400e+04 33000.000000 5.000000e+06
0.000000
max 5.000000 2.490100e+07 100000.000000 5.000000e+08
79.990000

for col in ['Price', 'Reviews', 'Installs', 'Size']:


sns.kdeplot(x = col, data= df, )
plt.show()
newdf = df.copy()

newdf.Installs = newdf.Installs.apply(np.log1p)

newdf.Reviews = newdf.Reviews.apply(np.log1p)
df.Installs.value_counts()

1000000 1485
10000000 1132
100000 1109
10000 982
1000 693
5000000 683
500000 515
50000 460
5000 422
100000000 366
100 302
50000000 272
500 199
10 67
500000000 60
50 56
5 8
Name: Installs, dtype: int64

newdf.skew()

Rating -1.825597
Reviews -0.039850
Size 1.424330
Installs -0.312952
Price 16.445954
dtype: float64

newdf.kurtosis()

Rating 5.586197
Reviews -0.922165
Size 1.419297
Installs -0.622031
Price 469.767060
dtype: float64

newdf

Category Rating Reviews Size Installs


Type \
0 ART_AND_DESIGN 4.1 5.075174 19000.0 9.210440
Free
1 ART_AND_DESIGN 3.9 6.875232 14000.0 13.122365
Free
2 ART_AND_DESIGN 4.7 11.379520 8700.0 15.424949
Free
3 ART_AND_DESIGN 4.5 12.281389 25000.0 17.727534
Free
4 ART_AND_DESIGN 4.3 6.875232 2800.0 11.512935
Free
... ... ... ... ... ... ..
.
10834 FAMILY 4.0 2.079442 2600.0 6.216606
Free
10836 FAMILY 4.5 3.663562 53000.0 8.517393
Free
10837 FAMILY 5.0 1.609438 3600.0 4.615121
Free
10839 BOOKS_AND_REFERENCE 4.5 4.744932 3600.0 6.908755
Free
10840 LIFESTYLE 4.5 12.894981 19000.0 16.118096
Free

Price Content_Rating Genres


0 0.0 Everyone Art_&_Design
1 0.0 Everyone Art_&_Design_Pretend_Play
2 0.0 Everyone Art_&_Design
3 0.0 Teen Art_&_Design
4 0.0 Everyone Art_&_Design_Creativity
... ... ... ...
10834 0.0 Everyone Education
10836 0.0 Everyone Education
10837 0.0 Everyone Education
10839 0.0 Mature_17+ Books_&_Reference
10840 0.0 Everyone Lifestyle

[8811 rows x 9 columns]

newdf2 = pd.get_dummies(newdf)

newdf2.shape

(8811, 161)

# Train Test Split

from sklearn.model_selection import train_test_split

newdf2.columns

Index(['Rating', 'Reviews', 'Size', 'Installs', 'Price',


'Category_ART_AND_DESIGN', 'Category_AUTO_AND_VEHICLES',
'Category_BEAUTY', 'Category_BOOKS_AND_REFERENCE',
'Category_BUSINESS',
...
'Genres_Tools', 'Genres_Tools_Education',
'Genres_Travel_&_Local',
'Genres_Travel_&_Local_Action_&_Adventure', 'Genres_Trivia',
'Genres_Video_Players_&_Editors',
'Genres_Video_Players_&_Editors_Creativity',
'Genres_Video_Players_&_Editors_Music_&_Video',
'Genres_Weather',
'Genres_Word'],
dtype='object', length=161)

X = newdf2.iloc[:,1:] # newdf.iloc[1:]
Y = newdf2['Rating']

X.shape

(8811, 160)

Y.shape

(8811,)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size =


0.3, random_state = 25)

x_train.shape

(6167, 160)

# Regression Algorithms (Linear)


# Simple Linear Regression
# Multiple Linear Regression
# Polynomial Regression
# Ridge Regression
# Lasso Regression
# ElasticNet Regression

# if the target variable is continuous numerical value, then we use


regression

# Simple Linear Regression - It is Regression method when we have only


one Predictor Variable for the target
# Multiple Linear Regression - It is Regression method when we have
Multiple Predictor Variables for the target
# Polynomial Regression - it is applied when data is not formed in
stright line. used to fit lineal model to non linear data

# Evaluation metrics

# Total Sum of Squares

# TSS = SSR(variation from mean) + SSE(variation from predicted )

# r2score - used to evaluate the performance of lr model


# proprortion of SSR with SSE - data points inside the regression
equation line

import statsmodels.api as sm
model = sm.OLS(y_train, x_train.astype(float))

model = model.fit()

model.summary()

<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results

======================================================================
========
Dep. Variable: Rating R-squared:
0.175
Model: OLS Adj. R-squared:
0.159
Method: Least Squares F-statistic:
10.54
Date: Sat, 16 Dec 2023 Prob (F-statistic):
5.91e-172
Time: 07:25:18 Log-Likelihood:
-4093.6
No. Observations: 6167 AIC:
8433.
Df Residuals: 6044 BIC:
9261.
Df Model: 122

Covariance Type: nonrobust

======================================================================
==========================================
coef std err
t P>|t| [0.025 0.975]
----------------------------------------------------------------------
------------------------------------------
Reviews 0.1724 0.006
28.418 0.000 0.160 0.184
Size -3.571e-07 3.27e-07
-1.092 0.275 -9.98e-07 2.84e-07
Installs -0.1488 0.006
-24.360 0.000 -0.161 -0.137
Price -0.0046 0.004
-1.152 0.250 -0.012 0.003
Category_ART_AND_DESIGN 0.1894 0.164
1.157 0.247 -0.131 0.510
Category_AUTO_AND_VEHICLES 0.1690 0.037
4.602 0.000 0.097 0.241
Category_BEAUTY 0.2483 0.043
5.840 0.000 0.165 0.332
Category_BOOKS_AND_REFERENCE 0.2305 0.024
9.699 0.000 0.184 0.277
Category_BUSINESS 0.1468 0.020
7.278 0.000 0.107 0.186
Category_COMICS 0.4136 0.156
2.654 0.008 0.108 0.719
Category_COMMUNICATION 0.0941 0.020
4.771 0.000 0.055 0.133
Category_DATING 0.0330 0.030
1.093 0.274 -0.026 0.092
Category_EDUCATION 0.2499 0.064
3.928 0.000 0.125 0.375
Category_ENTERTAINMENT 0.2300 0.065
3.535 0.000 0.102 0.358
Category_EVENTS 0.2927 0.039
7.539 0.000 0.217 0.369
Category_FAMILY 0.3064 0.036
8.617 0.000 0.237 0.376
Category_FINANCE 0.1025 0.019
5.292 0.000 0.065 0.141
Category_FOOD_AND_DRINK 0.0828 0.030
2.738 0.006 0.024 0.142
Category_GAME 0.3896 0.055
7.059 0.000 0.281 0.498
Category_HEALTH_AND_FITNESS 0.1801 0.021
8.656 0.000 0.139 0.221
Category_HOUSE_AND_HOME 0.1583 0.035
4.480 0.000 0.089 0.227
Category_LIBRARIES_AND_DEMO 0.1928 0.038
5.132 0.000 0.119 0.266
Category_LIFESTYLE 0.0949 0.155
0.614 0.539 -0.208 0.398
Category_MAPS_AND_NAVIGATION 0.0646 0.026
2.490 0.013 0.014 0.115
Category_MEDICAL 0.1716 0.020
8.462 0.000 0.132 0.211
Category_NEWS_AND_MAGAZINES 0.0957 0.023
4.238 0.000 0.051 0.140
Category_PARENTING 0.2524 0.118
2.147 0.032 0.022 0.483
Category_PERSONALIZATION 0.2026 0.019
10.560 0.000 0.165 0.240
Category_PHOTOGRAPHY 0.1148 0.019
5.891 0.000 0.077 0.153
Category_PRODUCTIVITY 0.1440 0.019
7.630 0.000 0.107 0.181
Category_SHOPPING 0.1491 0.022
6.737 0.000 0.106 0.192
Category_SOCIAL 0.1324 0.021
6.174 0.000 0.090 0.174
Category_SPORTS 0.4164 0.161
2.591 0.010 0.101 0.731
Category_TOOLS 0.1847 0.154
1.198 0.231 -0.117 0.487
Category_TRAVEL_AND_LOCAL 0.1792 0.155
1.160 0.246 -0.124 0.482
Category_VIDEO_PLAYERS 0.0862 0.025
3.414 0.001 0.037 0.136
Category_WEATHER 0.1344 0.035
3.841 0.000 0.066 0.203
Type_Free 3.0865 0.055
56.271 0.000 2.979 3.194
Type_Paid 3.0461 0.056
53.928 0.000 2.935 3.157
Content_Rating_Adults_only_18+ 1.2008 0.294
4.083 0.000 0.624 1.777
Content_Rating_Everyone 1.2399 0.053
23.439 0.000 1.136 1.344
Content_Rating_Everyone_10+ 1.2376 0.060
20.772 0.000 1.121 1.354
Content_Rating_Mature_17+ 1.2226 0.060
20.397 0.000 1.105 1.340
Content_Rating_Teen 1.2317 0.054
22.674 0.000 1.125 1.338
Content_Rating_Unrated 4.115e-15 9.57e-15
0.430 0.667 -1.47e-14 2.29e-14
Genres_Action -0.1515 0.064
-2.370 0.018 -0.277 -0.026
Genres_Action_Action_&_Adventure 0.0301 0.152
0.198 0.843 -0.268 0.328
Genres_Adventure -0.1986 0.086
-2.299 0.022 -0.368 -0.029
Genres_Adventure_Action_&_Adventure -0.0408 0.148
-0.275 0.783 -0.331 0.249
Genres_Adventure_Brain_Games 0.2074 0.470
0.441 0.659 -0.715 1.129
Genres_Adventure_Education -0.2690 0.334
-0.807 0.420 -0.923 0.385
Genres_Arcade -0.1114 0.069
-1.624 0.104 -0.246 0.023
Genres_Arcade_Action_&_Adventure -0.1383 0.161
-0.861 0.389 -0.453 0.177
Genres_Arcade_Pretend_Play -4.66e-16 1.27e-15
-0.366 0.715 -2.96e-15 2.03e-15
Genres_Art_&_Design 0.3627 0.178
2.037 0.042 0.014 0.712
Genres_Art_&_Design_Creativity -0.3297 0.375
-0.879 0.379 -1.065 0.405
Genres_Art_&_Design_Pretend_Play 0.1564 0.375
0.417 0.677 -0.579 0.892
Genres_Auto_&_Vehicles 0.1690 0.037
4.602 0.000 0.097 0.241
Genres_Beauty 0.2483 0.043
5.840 0.000 0.165 0.332
Genres_Board -0.0749 0.105
-0.711 0.477 -0.281 0.132
Genres_Board_Action_&_Adventure -0.2419 0.333
-0.726 0.468 -0.895 0.411
Genres_Board_Brain_Games 0.0839 0.146
0.576 0.565 -0.202 0.369
Genres_Board_Pretend_Play 0.6364 0.470
1.353 0.176 -0.286 1.559
Genres_Books_&_Reference 0.2305 0.024
9.699 0.000 0.184 0.277
Genres_Books_&_Reference_Education -0.0438 0.333
-0.132 0.895 -0.697 0.609
Genres_Business 0.1468 0.020
7.278 0.000 0.107 0.186
Genres_Card -0.2819 0.098
-2.885 0.004 -0.473 -0.090
Genres_Card_Action_&_Adventure 1.706e-16 2.46e-16
0.694 0.488 -3.11e-16 6.52e-16
Genres_Card_Brain_Games 0.3636 0.470
0.774 0.439 -0.557 1.285
Genres_Casino -0.0752 0.107
-0.704 0.481 -0.284 0.134
Genres_Casual -0.1864 0.053
-3.524 0.000 -0.290 -0.083
Genres_Casual_Action_&_Adventure -0.1409 0.139
-1.010 0.312 -0.414 0.133
Genres_Casual_Brain_Games 0.3627 0.169
2.148 0.032 0.032 0.694
Genres_Casual_Creativity 0.0838 0.194
0.432 0.666 -0.296 0.464
Genres_Casual_Education 0.0501 0.333
0.150 0.880 -0.603 0.703
Genres_Casual_Music_&_Video 0.0543 0.470
0.115 0.908 -0.867 0.975
Genres_Casual_Pretend_Play -0.0974 0.100
-0.977 0.328 -0.293 0.098
Genres_Comics -0.1835 0.169
-1.087 0.277 -0.514 0.147
Genres_Comics_Creativity 0.5971 0.315
1.894 0.058 -0.021 1.215
Genres_Communication 0.0941 0.020
4.771 0.000 0.055 0.133
Genres_Communication_Creativity 1.622e-16 1.75e-16
0.929 0.353 -1.8e-16 5.04e-16
Genres_Dating 0.0330 0.030
1.093 0.274 -0.026 0.092
Genres_Education 0.1673 0.045
3.690 0.000 0.078 0.256
Genres_Education_Action_&_Adventure 0.2533 0.237
1.070 0.285 -0.211 0.718
Genres_Education_Brain_Games 0.0522 0.240
0.218 0.828 -0.419 0.523
Genres_Education_Creativity 0.4886 0.237
2.058 0.040 0.023 0.954
Genres_Education_Education 0.0966 0.096
1.011 0.312 -0.091 0.284
Genres_Education_Music_&_Video -0.3034 0.470
-0.646 0.518 -1.224 0.617
Genres_Education_Pretend_Play 0.2577 0.132
1.957 0.050 -0.000 0.516
Genres_Educational -0.1788 0.108
-1.656 0.098 -0.390 0.033
Genres_Educational_Action_&_Adventure 0.0094 0.333
0.028 0.978 -0.644 0.662
Genres_Educational_Brain_Games 0.0853 0.212
0.401 0.688 -0.331 0.502
Genres_Educational_Creativity -0.1173 0.333
-0.352 0.725 -0.770 0.536
Genres_Educational_Education 0.0945 0.110
0.857 0.391 -0.122 0.311
Genres_Educational_Pretend_Play 0.0416 0.135
0.309 0.757 -0.222 0.305
Genres_Entertainment -0.0735 0.044
-1.660 0.097 -0.160 0.013
Genres_Entertainment_Action_&_Adventure 0.1184 0.333
0.356 0.722 -0.535 0.771
Genres_Entertainment_Brain_Games 0.0791 0.181
0.437 0.662 -0.276 0.434
Genres_Entertainment_Creativity 0.3138 0.273
1.148 0.251 -0.222 0.849
Genres_Entertainment_Education -2.344e-16 1.78e-16
-1.316 0.188 -5.84e-16 1.15e-16
Genres_Entertainment_Music_&_Video -0.0525 0.123
-0.427 0.669 -0.293 0.188
Genres_Entertainment_Pretend_Play -0.3276 0.333
-0.983 0.325 -0.981 0.325
Genres_Events 0.2927 0.039
7.539 0.000 0.217 0.369
Genres_Finance 0.1025 0.019
5.292 0.000 0.065 0.141
Genres_Food_&_Drink 0.0828 0.030
2.738 0.006 0.024 0.142
Genres_Health_&_Fitness 0.1801 0.021
8.656 0.000 0.139 0.221
Genres_Health_&_Fitness_Action_&_Adventure -0.4306 0.470
-0.916 0.360 -1.352 0.491
Genres_Health_&_Fitness_Education 0.2147 0.470
0.457 0.648 -0.706 1.136
Genres_House_&_Home 0.1583 0.035
4.480 0.000 0.089 0.227
Genres_Libraries_&_Demo 0.1928 0.038
5.132 0.000 0.119 0.266
Genres_Lifestyle 0.1262 0.163
0.775 0.438 -0.193 0.445
Genres_Lifestyle_Education -7.666e-17 6.36e-17
-1.205 0.228 -2.01e-16 4.8e-17
Genres_Lifestyle_Pretend_Play -0.0312 0.315
-0.099 0.921 -0.648 0.586
Genres_Maps_&_Navigation 0.0646 0.026
2.490 0.013 0.014 0.115
Genres_Medical 0.1716 0.020
8.462 0.000 0.132 0.211
Genres_Music -0.2048 0.141
-1.449 0.147 -0.482 0.072
Genres_Music_&_Audio_Music_&_Video 0.3781 0.470
0.805 0.421 -0.543 1.299
Genres_Music_Music_&_Video 0.2404 0.333
0.722 0.470 -0.412 0.893
Genres_News_&_Magazines 0.0957 0.023
4.238 0.000 0.051 0.140
Genres_Parenting 0.2384 0.138
1.726 0.084 -0.032 0.509
Genres_Parenting_Brain_Games -0.1344 0.386
-0.348 0.728 -0.892 0.623
Genres_Parenting_Education -0.0719 0.244
-0.294 0.769 -0.551 0.407
Genres_Parenting_Music_&_Video 0.2203 0.220
1.001 0.317 -0.211 0.652
Genres_Personalization 0.2026 0.019
10.560 0.000 0.165 0.240
Genres_Photography 0.1148 0.019
5.891 0.000 0.077 0.153
Genres_Productivity 0.1440 0.019
7.630 0.000 0.107 0.181
Genres_Puzzle 0.0741 0.062
1.186 0.236 -0.048 0.197
Genres_Puzzle_Action_&_Adventure 0.0544 0.273
0.200 0.842 -0.480 0.589
Genres_Puzzle_Brain_Games 0.0895 0.139
0.642 0.521 -0.184 0.363
Genres_Puzzle_Creativity 0.0742 0.333
0.223 0.824 -0.579 0.727
Genres_Puzzle_Education 0.5283 0.470
1.125 0.261 -0.393 1.449
Genres_Racing -0.1413 0.081
-1.741 0.082 -0.300 0.018
Genres_Racing_Action_&_Adventure 0.1638 0.126
1.301 0.193 -0.083 0.411
Genres_Racing_Pretend_Play 0.6221 0.470
1.323 0.186 -0.299 1.544
Genres_Role_Playing -0.0791 0.068
-1.161 0.246 -0.213 0.054
Genres_Role_Playing_Action_&_Adventure 0.2145 0.471
0.455 0.649 -0.709 1.138
Genres_Role_Playing_Brain_Games 0.0286 0.470
0.061 0.951 -0.892 0.950
Genres_Role_Playing_Pretend_Play -0.1079 0.273
-0.395 0.693 -0.643 0.427
Genres_Shopping 0.1491 0.022
6.737 0.000 0.106 0.192
Genres_Simulation -0.1484 0.054
-2.764 0.006 -0.254 -0.043
Genres_Simulation_Action_&_Adventure 0.1695 0.169
1.004 0.315 -0.161 0.501
Genres_Simulation_Education -0.0395 0.273
-0.145 0.885 -0.575 0.495
Genres_Simulation_Pretend_Play -0.1765 0.333
-0.530 0.596 -0.830 0.477
Genres_Social 0.1324 0.021
6.174 0.000 0.090 0.174
Genres_Sports -0.1774 0.164
-1.078 0.281 -0.500 0.145
Genres_Sports_Action_&_Adventure -0.0561 0.273
-0.205 0.837 -0.591 0.479
Genres_Strategy -0.1698 0.065
-2.596 0.009 -0.298 -0.042
Genres_Strategy_Action_&_Adventure 0.1935 0.333
0.580 0.562 -0.460 0.847
Genres_Strategy_Creativity -0.1839 0.470
-0.391 0.696 -1.105 0.737
Genres_Strategy_Education 0.5190 0.470
1.105 0.269 -0.402 1.440
Genres_Tools -0.0069 0.162
-0.043 0.966 -0.324 0.310
Genres_Tools_Education 0.1916 0.314
0.610 0.542 -0.424 0.808
Genres_Travel_&_Local 0.0400 0.163
0.245 0.806 -0.280 0.360
Genres_Travel_&_Local_Action_&_Adventure 0.1392 0.314
0.443 0.658 -0.477 0.756
Genres_Trivia -0.3725 0.113
-3.290 0.001 -0.594 -0.151
Genres_Video_Players_&_Editors 0.0862 0.025
3.414 0.001 0.037 0.136
Genres_Video_Players_&_Editors_Creativity -0.2947 0.470
-0.627 0.530 -1.216 0.626
Genres_Video_Players_&_Editors_Music_&_Video -0.1985 0.333
-0.595 0.552 -0.852 0.455
Genres_Weather 0.1344 0.035
3.841 0.000 0.066 0.203
Genres_Word -0.0164 0.121
-0.135 0.892 -0.254 0.221
======================================================================
========
Omnibus: 2383.378 Durbin-Watson:
2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB):
15000.327
Skew: -1.721 Prob(JB):
0.00
Kurtosis: 9.821 Cond. No.
1.87e+21
======================================================================
========

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The smallest eigenvalue is 1.81e-30. This might indicate that
there are
strong multicollinearity problems or that the design matrix is
singular.
"""

F- Statistic: It given through F- test, whether the lr model provides a better fit to the data in
comparison to a model that contains no independent variable

from sklearn.metrics import mean_squared_error, r2_score

y_test_predicted = model.predict(x_test)

RMSE = np.sqrt(mean_squared_error(y_test, y_test_predicted))


print("RMSE is {}".format(RMSE))

RMSE is 0.5019999148264612

r2_score(y_test, y_test_predicted)

0.12557936638825973
data = {'coef':model.params,'std err': model.bse,'t': model.tvalues,
'P>|t|': model.pvalues,
'[0.025': model.conf_int()[0],'0.975]': model.conf_int()[0]}

olssum = pd.DataFrame(data).round(3)

olssum

coef std err t


P>|t| \
Reviews 0.172 0.006 28.418
0.000
Size -0.000 0.000 -1.092
0.275
Installs -0.149 0.006 -24.360
0.000
Price -0.005 0.004 -1.152
0.250
Category_ART_AND_DESIGN 0.189 0.164 1.157
0.247
... ... ... ...
...
Genres_Video_Players_&_Editors 0.086 0.025 3.414
0.001
Genres_Video_Players_&_Editors_Creativity -0.295 0.470 -0.627
0.530
Genres_Video_Players_&_Editors_Music_&_Video -0.198 0.333 -0.595
0.552
Genres_Weather 0.134 0.035 3.841
0.000
Genres_Word -0.016 0.121 -0.135
0.892

[0.025 0.975]
Reviews 0.160 0.160
Size -0.000 -0.000
Installs -0.161 -0.161
Price -0.012 -0.012
Category_ART_AND_DESIGN -0.131 -0.131
... ... ...
Genres_Video_Players_&_Editors 0.037 0.037
Genres_Video_Players_&_Editors_Creativity -1.216 -1.216
Genres_Video_Players_&_Editors_Music_&_Video -0.852 -0.852
Genres_Weather 0.066 0.066
Genres_Word -0.254 -0.254

[160 rows x 6 columns]

olssum[olssum['P>|t|']<0.05].index
Index(['Reviews', 'Installs', 'Category_AUTO_AND_VEHICLES',
'Category_BEAUTY',
'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
'Category_COMICS',
'Category_COMMUNICATION', 'Category_EDUCATION',
'Category_ENTERTAINMENT', 'Category_EVENTS', 'Category_FAMILY',
'Category_FINANCE', 'Category_FOOD_AND_DRINK', 'Category_GAME',
'Category_HEALTH_AND_FITNESS', 'Category_HOUSE_AND_HOME',
'Category_LIBRARIES_AND_DEMO', 'Category_MAPS_AND_NAVIGATION',
'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES',
'Category_PARENTING',
'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
'Category_PRODUCTIVITY', 'Category_SHOPPING',
'Category_SOCIAL',
'Category_SPORTS', 'Category_VIDEO_PLAYERS',
'Category_WEATHER',
'Type_Free', 'Type_Paid', 'Content_Rating_Adults_only_18+',
'Content_Rating_Everyone', 'Content_Rating_Everyone_10+',
'Content_Rating_Mature_17+', 'Content_Rating_Teen',
'Genres_Action',
'Genres_Adventure', 'Genres_Art_&_Design',
'Genres_Auto_&_Vehicles',
'Genres_Beauty', 'Genres_Books_&_Reference', 'Genres_Business',
'Genres_Card', 'Genres_Casual', 'Genres_Casual_Brain_Games',
'Genres_Communication', 'Genres_Education',
'Genres_Education_Creativity', 'Genres_Events',
'Genres_Finance',
'Genres_Food_&_Drink', 'Genres_Health_&_Fitness',
'Genres_House_&_Home',
'Genres_Libraries_&_Demo', 'Genres_Maps_&_Navigation',
'Genres_Medical',
'Genres_News_&_Magazines', 'Genres_Personalization',
'Genres_Photography', 'Genres_Productivity', 'Genres_Shopping',
'Genres_Simulation', 'Genres_Social', 'Genres_Strategy',
'Genres_Trivia', 'Genres_Video_Players_&_Editors',
'Genres_Weather'],
dtype='object')

df3 = newdf2[['Rating','Reviews', 'Installs',


'Category_AUTO_AND_VEHICLES', 'Category_BEAUTY',
'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
'Category_COMICS',
'Category_COMMUNICATION', 'Category_EDUCATION',
'Category_ENTERTAINMENT', 'Category_EVENTS', 'Category_FAMILY',
'Category_FINANCE', 'Category_FOOD_AND_DRINK', 'Category_GAME',
'Category_HEALTH_AND_FITNESS', 'Category_HOUSE_AND_HOME',
'Category_LIBRARIES_AND_DEMO', 'Category_MAPS_AND_NAVIGATION',
'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES',
'Category_PARENTING',
'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
'Category_PRODUCTIVITY', 'Category_SHOPPING',
'Category_SOCIAL',
'Category_SPORTS', 'Category_VIDEO_PLAYERS',
'Category_WEATHER',
'Type_Free', 'Type_Paid', 'Content_Rating_Adults_only_18+',
'Content_Rating_Everyone', 'Content_Rating_Everyone_10+',
'Content_Rating_Mature_17+', 'Content_Rating_Teen',
'Genres_Action',
'Genres_Adventure', 'Genres_Art_&_Design',
'Genres_Auto_&_Vehicles',
'Genres_Beauty', 'Genres_Books_&_Reference', 'Genres_Business',
'Genres_Card', 'Genres_Casual', 'Genres_Casual_Brain_Games',
'Genres_Communication', 'Genres_Education',
'Genres_Education_Creativity', 'Genres_Events',
'Genres_Finance',
'Genres_Food_&_Drink', 'Genres_Health_&_Fitness',
'Genres_House_&_Home',
'Genres_Libraries_&_Demo', 'Genres_Maps_&_Navigation',
'Genres_Medical',
'Genres_News_&_Magazines', 'Genres_Personalization',
'Genres_Photography', 'Genres_Productivity', 'Genres_Shopping',
'Genres_Simulation', 'Genres_Social', 'Genres_Strategy',
'Genres_Trivia', 'Genres_Video_Players_&_Editors',
'Genres_Weather']]

df3.shape

(8811, 70)

You might also like