Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics

Data Cleaning and Exploratory Data Analysis with Pandas on Trending
Youtube Video Statistics

This article is written as a log of our work for our Data Science project. Our project was aimed at analyzing the daily YouTube trending videos data sets that are
available here:
YouTube maintains a list of the top trending videos on the platform. This data set includes several months of data on daily trending YouTube videos. Data is
included for the USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, India and Japan. This list is determined by using user interactions
such as views, comments and likes to identify which videos are user preferred and displays them on the trending page. Ranking of these videos is done based on
a ratio of views, likes, comments and shares, in order to display the best videos at the top of the page. We used the data sets available and began the process of
data cleaning followed by Exploratory data analysis (EDA). This article will be divided into two sessions as follows:
Exploratory Data Analysis
Data Cleaning
Importing Libraries
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
import json
import pickle
import six
sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None
Importing all the CSV files

AllCSV = [i for i in glob.glob('*.{}'.format('csv'))]
AllCSV
Reading all CSV files

all_dataframes = [] # list to store each data frame separately
for csv in AllCSV:
df = pd.read_csv(csv)
df['country'] = csv[0:2] # adding column 'country' so that each dataset could be identified uniquely
all_dataframes.append(df)all_dataframes[0].head() # index 0 to 9 for [CA, DE, FR, GB, IN, JP, KR, MX, RU, US] datasets
Fixing Data Types
First part of the data cleaning process was to fix the data types of all the columns in order to make them easier to manipulate and be more manageable. It should
be noted that for several columns the data type was changed to strings, when the data types are displayed, they show up as objects as strings are a type of object
in pandas.
for df in all_dataframes:
# video_id
df['video_id'] = df['video_id'].astype('str')
# trending date
df['trending_date'] = df['trending_date'].astype('str')
date_pieces = (df['trending_date']
.str.split('.')
) df['Year'] = date_pieces.str[0].astype(int)
df['Day'] = date_pieces.str[1].astype(int)
df['Month'] = date_pieces.str[2].astype(int) updatedyear = []
for i in range(len(df)) :
y = df.loc[i, "Year"]
newy = y+2000
updatedyear.append(newy) for i in range(len(df)):
newy = updatedyear[i]
tr = df.loc[i, "Year"]
df['Year'].replace(to_replace = tr, value = newy, inplace=True) del df['trending_date']
df['trending_date'] = pd.to_datetime(df[['Year', 'Month', 'Day']], format = "%Y-%m-%d")
del df['Year']
del df['Day']
del df['Month']
#title
df['title'] = df['title'].astype('str') #channel_title
df['channel_title'] = df['channel_title'].astype('str') #category_id
df['category_id'] = df['category_id'].astype(str)
#tags
df['tags'] = df['tags'].astype('str')
# views, likes, dislikes, comment_count are already in correct data types i.e int64
#thumbnail_link
df['thumbnail_link'] = df['thumbnail_link'].astype('str')
#description
df['description'] = df['description'].astype('str')
# Changing comments_disabled, ratings_disabled, video_error_or_removed from bool to categorical

df['comments_disabled'] = df['comments_disabled'].astype('category')
df['ratings_disabled'] = df['ratings_disabled'].astype('category')
df['video_error_or_removed'] = df['video_error_or_removed'].astype('category')
# publish_time
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')
Separating ‘publish_time’ into ‘publish_date’ and ‘publish_time’

df.insert(4, 'publish_date', df['publish_time'].dt.date) # loc, column name, values for column to be inserted
df['publish_time'] = df['publish_time'].dt.time# Changing data type for 'publish_date' from object to 'datetime64[ns]'for df in all_dataframes:
df['publish_date'] = pd.to_datetime(df['publish_date'], format = "%Y-%m-%d")
Now a quick look at the updated data types!

# We can use any index from 0 to 9 inclusive (for each of the 10 dataframesall_dataframes[1].dtypes
For the index, we chose video_id
df.set_index('video_id', inplace=True)
Examining Missing Values

Missing values are necessary to handle as they can reduce the statistical power of the data set and lead to bias thus resulting in invalid conclusions. The missing
data can be handled by either removing the respective tuple, which is often done in cases of non_numeric data. Otherwise, we can impute the missing data by
either taking the mean or median of the data set and replacing the missing value with either.
We did this using a heat-map, where any missing value in a column would appears as an orange square against the black background of the heat-map. As you
can see from one of the screenshots from the notebook below, no data set had any missing values, thus there was no handling necessary.
sns.heatmap(df.isnull(), cbar=False)
plt.figure()
Combining Every Dataframe Into One Huge Dataframe

We combined all the cleaned data sets into one massive data set in order to perform EDA, as all necessary operations can be performed on this unified, clean
data set without any issues.
combined_df = pd.concat(all_dataframes)
Next, we decided to further clean and refine the data by sorting the entries of the data set by trending_date. This would result in the latest trending videos to be
moved to the top of the data set. This was done so that we can view the current trends of the trending videos of each country, as they are more relevant to our
project.
Before we did so, however, we created a duplicate copy of our data frame. We did this as a safety precaution and to keep a copy of the original data frame at
hand as we also decided to remove any duplicate video entries while sorting the videos from the other data frame.
# Making copy of original dataframe
backup_df = combined_df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Sorting according to latest trending date w
combined_df = combined_df.reset_index().sort_values('trending_date', ascending=False).drop_duplicates('video_id',keep='first').set_index('video_id')#
df = df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Printing results
combined_df[['publish_date','publish_time','trending_date', 'country']].head()# It can be seen that latest publications and trending information is at
Inserting Category Column

One of our final steps for the data cleaning of the data sets was checking the JSON files that were available with the data sets. We needed to see whether or not
these files contained any useful data. As there were multiple files, we decided to read two files at random, in order to check whether they contained the same
data or were they all containing different data.
# read file
with open('US_category_id.json', 'r') as f: # reading one randomly selected json files to make sense of its contents
data = f.read()# parse file
obj = json.loads(data)# printing
obj
One of the other randomly selected JSON file had similar data. Each of the JSON file contains id ranging from 1 to 44 (both inclusive). And with each id is
given its category and other information related to title, kind etc. Hence, we can use any one of the JSON files to map category to category id in our data frame.
category_id = {}with open('DE_category_id.json', 'r') as f:
d = json.load(f)
for category in d['items']:
category_id[category['id']] = category['snippet']['title']combined_df.insert(2, 'category', combined_df['category_id'].map(category_id))
backup_df.insert(2, 'category', backup_df['category_id'].map(category_id))for df in all_dataframes:
df.insert(2, 'category', df['category_id'].map(category_id))# Printing cleaned combined dataframe
combined_df.head(3)
combined_df['category'].unique()
Thus, we cleaned up and refined our data sets into a finalized data frame, ready to be used for the upcoming EDA section of the project. We pickled both the
finalized data frame and a copy of the original cleaned data frame into files, ready for use.

Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics

Uploaded by

Copyright:

Available Formats

You might also like

Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics

Uploaded by

Copyright:

Available Formats

Data Cleaning and Exploratory Data Analysis with Pandas on Trending

Youtube Video Statistics

Exploratory Data Analysis

Importing all the CSV files

Reading all CSV files

# Changing comments_disabled, ratings_disabled, video_error_or_removed from bool to categorical

Separating ‘publish_time’ into ‘publish_date’ and ‘publish_time’

Now a quick look at the updated data types!

Examining Missing Values

Combining Every Dataframe Into One Huge Dataframe

Inserting Category Column

You might also like