Professional Documents
Culture Documents
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
YouTube maintains a list of the top trending videos on the platform. This data set includes several months of data on daily trending YouTube videos. Data is
included for the USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, India and Japan. This list is determined by using user interactions
such as views, comments and likes to identify which videos are user preferred and displays them on the trending page. Ranking of these videos is done based on
a ratio of views, likes, comments and shares, in order to display the best videos at the top of the page. We used the data sets available and began the process of
data cleaning followed by Exploratory data analysis (EDA). This article will be divided into two sessions as follows:
Data Cleaning
Importing Libraries
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib import cm
from datetime import datetime
import glob
import os
import json
import pickle
import six
sns.set()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None
# trending date
df['trending_date'] = df['trending_date'].astype('str')
date_pieces = (df['trending_date']
.str.split('.')
) df['Year'] = date_pieces.str[0].astype(int)
df['Day'] = date_pieces.str[1].astype(int)
df['Month'] = date_pieces.str[2].astype(int) updatedyear = []
for i in range(len(df)) :
y = df.loc[i, "Year"]
newy = y+2000
updatedyear.append(newy) for i in range(len(df)):
newy = updatedyear[i]
tr = df.loc[i, "Year"]
df['Year'].replace(to_replace = tr, value = newy, inplace=True) del df['trending_date']
df['trending_date'] = pd.to_datetime(df[['Year', 'Month', 'Day']], format = "%Y-%m-%d")
del df['Year']
del df['Day']
del df['Month']
#title
df['title'] = df['title'].astype('str') #channel_title
df['channel_title'] = df['channel_title'].astype('str') #category_id
df['category_id'] = df['category_id'].astype(str)
#tags
df['tags'] = df['tags'].astype('str')
# views, likes, dislikes, comment_count are already in correct data types i.e int64
#thumbnail_link
df['thumbnail_link'] = df['thumbnail_link'].astype('str')
#description
df['description'] = df['description'].astype('str')
# publish_time
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')
We did this using a heat-map, where any missing value in a column would appears as an orange square against the black background of the heat-map. As you
can see from one of the screenshots from the notebook below, no data set had any missing values, thus there was no handling necessary.
for df in all_dataframes:
sns.heatmap(df.isnull(), cbar=False)
plt.figure()
Next, we decided to further clean and refine the data by sorting the entries of the data set by trending_date. This would result in the latest trending videos to be
moved to the top of the data set. This was done so that we can view the current trends of the trending videos of each country, as they are more relevant to our
project.
Before we did so, however, we created a duplicate copy of our data frame. We did this as a safety precaution and to keep a copy of the original data frame at
hand as we also decided to remove any duplicate video entries while sorting the videos from the other data frame.
# Making copy of original dataframe
backup_df = combined_df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Sorting according to latest trending date w
combined_df = combined_df.reset_index().sort_values('trending_date', ascending=False).drop_duplicates('video_id',keep='first').set_index('video_id')#
for df in all_dataframes:
df = df.reset_index().sort_values('trending_date', ascending=False).set_index('video_id')# Printing results
combined_df[['publish_date','publish_time','trending_date', 'country']].head()# It can be seen that latest publications and trending information is at
One of the other randomly selected JSON file had similar data. Each of the JSON file contains id ranging from 1 to 44 (both inclusive). And with each id is
given its category and other information related to title, kind etc. Hence, we can use any one of the JSON files to map category to category id in our data frame.
category_id = {}with open('DE_category_id.json', 'r') as f:
d = json.load(f)
for category in d['items']:
category_id[category['id']] = category['snippet']['title']combined_df.insert(2, 'category', combined_df['category_id'].map(category_id))
backup_df.insert(2, 'category', backup_df['category_id'].map(category_id))for df in all_dataframes:
df.insert(2, 'category', df['category_id'].map(category_id))# Printing cleaned combined dataframe
combined_df.head(3)
combined_df['category'].unique()
Thus, we cleaned up and refined our data sets into a finalized data frame, ready to be used for the upcoming EDA section of the project. We pickled both the
finalized data frame and a copy of the original cleaned data frame into files, ready for use.