Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Data Cleaning and Exploratory Data Analysis with Pandas on Trending

Youtube Video Statistics


This article is written as a log of our work for our Data Science project. Our project was aimed at analyzing the daily YouTube trending videos data sets that are
available here:

YouTube maintains a list of the top trending videos on the platform. This data set includes several months of data on daily trending YouTube videos. Data is
included for the USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, India and Japan. This list is determined by using user interactions
such as views, comments and likes to identify which videos are user preferred and displays them on the trending page. Ranking of these videos is done based on
a ratio of views, likes, comments and shares, in order to display the best videos at the top of the page. We used the data sets available and began the process of
data cleaning followed by Exploratory data analysis (EDA). This article will be divided into two sessions as follows:

Exploratory Data Analysis

Data Cleaning
Importing Libraries
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
import json
import pickle
import six
sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None

Importing all the CSV files


AllCSV = [i for i in glob.glob('*.{}'.format('csv'))]
AllCSV

Reading all CSV files


all_dataframes = [] # list to store each data frame separately
for csv in AllCSV:
df = pd.read_csv(csv)
df['country'] = csv[0:2] # adding column 'country' so that each dataset could be identified uniquely
all_dataframes.append(df)all_dataframes[0].head() # index 0 to 9 for [CA, DE, FR, GB, IN, JP, KR, MX, RU, US] datasets
Fixing Data Types
First part of the data cleaning process was to fix the data types of all the columns in order to make them easier to manipulate and be more manageable. It should
be noted that for several columns the data type was changed to strings, when the data types are displayed, they show up as objects as strings are a type of object
in pandas.
for df in all_dataframes:
# video_id
df['video_id'] = df['video_id'].astype('str')

# trending date
df['trending_date'] = df['trending_date'].astype('str')
date_pieces = (df['trending_date']
.str.split('.')
) df['Year'] = date_pieces.str[0].astype(int)
df['Day'] = date_pieces.str[1].astype(int)
df['Month'] = date_pieces.str[2].astype(int) updatedyear = []
for i in range(len(df)) :
y = df.loc[i, "Year"]
newy = y+2000
updatedyear.append(newy) for i in range(len(df)):
newy = updatedyear[i]
tr = df.loc[i, "Year"]
df['Year'].replace(to_replace = tr, value = newy, inplace=True) del df['trending_date']
df['trending_date'] = pd.to_datetime(df[['Year', 'Month', 'Day']], format = "%Y-%m-%d")
del df['Year']
del df['Day']
del df['Month']

#title
df['title'] = df['title'].astype('str') #channel_title
df['channel_title'] = df['channel_title'].astype('str') #category_id
df['category_id'] = df['category_id'].astype(str)

#tags
df['tags'] = df['tags'].astype('str')

# views, likes, dislikes, comment_count are already in correct data types i.e int64

#thumbnail_link
df['thumbnail_link'] = df['thumbnail_link'].astype('str')

#description
df['description'] = df['description'].astype('str')

# Changing comments_disabled, ratings_disabled, video_error_or_removed from bool to categorical


df['comments_disabled'] = df['comments_disabled'].astype('category')
df['ratings_disabled'] = df['ratings_disabled'].astype('category')
df['video_error_or_removed'] = df['video_error_or_removed'].astype('category')

# publish_time
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')

Separating ‘publish_time’ into ‘publish_date’ and ‘publish_time’


for df in all_dataframes:
df.insert(4, 'publish_date', df['publish_time'].dt.date) # loc, column name, values for column to be inserted
df['publish_time'] = df['publish_time'].dt.time# Changing data type for 'publish_date' from object to 'datetime64[ns]'for df in all_dataframes:
df['publish_date'] = pd.to_datetime(df['publish_date'], format = "%Y-%m-%d")

Now a quick look at the updated data types!


# We can use any index from 0 to 9 inclusive (for each of the 10 dataframesall_dataframes[1].dtypes
For the index, we chose video_id
for df in all_dataframes:
df.set_index('video_id', inplace=True)

Examining Missing Values


Missing values are necessary to handle as they can reduce the statistical power of the data set and lead to bias thus resulting in invalid conclusions. The missing
data can be handled by either removing the respective tuple, which is often done in cases of non_numeric data. Otherwise, we can impute the missing data by
either taking the mean or median of the data set and replacing the missing value with either.

We did this using a heat-map, where any missing value in a column would appears as an orange square against the black background of the heat-map. As you
can see from one of the screenshots from the notebook below, no data set had any missing values, thus there was no handling necessary.
for df in all_dataframes:
sns.heatmap(df.isnull(), cbar=False)
plt.figure()

Combining Every Dataframe Into One Huge Dataframe


We combined all the cleaned data sets into one massive data set in order to perform EDA, as all necessary operations can be performed on this unified, clean
data set without any issues.
combined_df = pd.concat(all_dataframes)

Next, we decided to further clean and refine the data by sorting the entries of the data set by trending_date. This would result in the latest trending videos to be
moved to the top of the data set. This was done so that we can view the current trends of the trending videos of each country, as they are more relevant to our
project.

Before we did so, however, we created a duplicate copy of our data frame. We did this as a safety precaution and to keep a copy of the original data frame at
hand as we also decided to remove any duplicate video entries while sorting the videos from the other data frame.
# Making copy of original dataframe
backup_df = combined_df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Sorting according to latest trending date w
combined_df = combined_df.reset_index().sort_values('trending_date', ascending=False).drop_duplicates('video_id',keep='first').set_index('video_id')#
for df in all_dataframes:
df = df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Printing results
combined_df[['publish_date','publish_time','trending_date', 'country']].head()# It can be seen that latest publications and trending information is at

Inserting Category Column


One of our final steps for the data cleaning of the data sets was checking the JSON files that were available with the data sets. We needed to see whether or not
these files contained any useful data. As there were multiple files, we decided to read two files at random, in order to check whether they contained the same
data or were they all containing different data.
# read file
with open('US_category_id.json', 'r') as f: # reading one randomly selected json files to make sense of its contents
data = f.read()# parse file
obj = json.loads(data)# printing
obj

One of the other randomly selected JSON file had similar data. Each of the JSON file contains id ranging from 1 to 44 (both inclusive). And with each id is
given its category and other information related to title, kind etc. Hence, we can use any one of the JSON files to map category to category id in our data frame.
category_id = {}with open('DE_category_id.json', 'r') as f:
d = json.load(f)
for category in d['items']:
category_id[category['id']] = category['snippet']['title']combined_df.insert(2, 'category', combined_df['category_id'].map(category_id))
backup_df.insert(2, 'category', backup_df['category_id'].map(category_id))for df in all_dataframes:
df.insert(2, 'category', df['category_id'].map(category_id))# Printing cleaned combined dataframe
combined_df.head(3)
combined_df['category'].unique()

Thus, we cleaned up and refined our data sets into a finalized data frame, ready to be used for the upcoming EDA section of the project. We pickled both the
finalized data frame and a copy of the original cleaned data frame into files, ready for use.

You might also like