Mini Project Report For BDA

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

(YouTube data analysis using Hadoop and hive)

Submitted in partial fulfillment of the requirements


of the Mini-Project for 7th Sem of

Bachelors of Engineering
by
(Wasimuddin Mallick,
Khan Shamim,
Khan Arshad,
Khan Sabir)
(Roll No.22,13,16,17)
Guide:
(Prof. Reshma Lohar)

Department of Computer Engineering


Rizvi College of Engineering

University of Mumbai

2023-2024
CERTIFICATE

This is to certify that the mini-project entitled “YouTube data analysis using Hadoop

and hive” is a bonafide work of (Wasimuddin Mallick, Khan Shamim, Khan Arshad, Khan
Sabir) (Roll No.22,13,16,17) submitted to the University of Mumbai in partial fulfillment of
the requirement for the Mini-Project 7th Sem of the Bachelor of Engineering in “Computer
Engineering”.

(Prof. Reshma Lohar)


Guide

_______________ ______________
Prof. Shiburaj Pappu Dr. Varsha Shah
Head of Department Principal
Declaration
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original
sources. I also declare that I have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. I understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.

-----------------------------------------
(Signature)

-----------------------------------------
(Name of student and Roll No.)

Date:
ABSTRACT

YouTube data analysis using Hadoop and Hive is a powerful approach for extracting insights
from large and complex datasets. Hadoop is a distributed computing framework that allows
for the parallel processing of large amounts of data across multiple nodes. Hive is a data
warehouse infrastructure built on top of Hadoop that provides a SQL-like interface for
querying data stored in HDFS. YouTube data analysis using Hadoop and Hive is a powerful
tool for extracting insights from large and complex datasets. This approach can be used to
improve the user experience, optimize advertising campaigns, and make better business
decisions.
Index
Sr. No Title Page No

1. Introduction 1
2. Review and Literature 2
2.1. Paper 1 3
2.2. Paper 2 4
3. Theory, Methodology and Algorithm 5
3.1 Section 6
3.1.1. Subsection 7
4. Results and Discussions 8
5. Conclusion 9
6. References 10
Appendix 11
Acknowledgement 12
Publication 13
Chapter 1
Introduction

YouTube data analysis using Hadoop and Hive is a powerful approach for extracting insights
from large and complex datasets. Hadoop is a distributed computing framework that allows
for the parallel processing of large amounts of data across multiple nodes. Hive is a data
warehouse infrastructure built on top of Hadoop that provides a SQL-like interface for
querying data stored in HDFS. The following is a brief overview of the steps involved in
YouTube data analysis using Hadoop and Hive: Collect the data. YouTube provides a public
API that can be used to collect data about videos, channels, and users. This data can be stored
in a variety of formats, such as JSON, XML, or CSV. Load the data into HDFS. Once the data
has been collected, it needs to be loaded into HDFS. This can be done using the Hadoop
command line interface or through a third-party tool. Process the data using MapReduce.
MapReduce is a programming model for processing large datasets in a distributed manner.
Hadoop provides a number of built-in MapReduce jobs that can be used to perform common
data processing tasks, such as filtering, sorting, and aggregating data. Store the processed data
in Hive. Once the data has been processed using MapReduce, it can be stored in Hive for
querying and analysis. Quey the data using HiveQL. HiveQL is a SQL-like language that can
be used to query data stored in Hive. Hive provides a number of built-in functions that can be
used to perform complex data analysis tasks.

With rapid innovations and surge of internet companies like Google, Yahoo, Amazon, eBay
and a rapidly growing internet savvy population, today's advanced systems and enterprises are
generating data in a very huge volume with great velocity and in a multi-structured formats
including videos, images, sensor data, weblogs etc. from different sources. This has given
birth to a new type of data called Big Data which is unstructured sometime semi structured
and unpredictable in nature. This data is mostly generated in real time from social media
websites which are increasing exponentially on a daily basis. This type of data is structural in
nature and still manageable. However, social 2 media data is primarily unstructured in nature.
The very unstructured nature of the data makes it very hard to analyse and very interesting at
the same time. Most of the companies are uploading their product launch on YouTube and
they anxiously await their subscribers' reviews. Major production houses launch movie
trailers and people provide their first reaction and reviews about the trailers. This further
creates a buzz and excitement about the product. The following are some examples of
YouTube data analysis that can be performed using Hadoop and Hive: Identify popular videos
and channels. Hadoop and Hive can be used to identify popular videos and channels based on
metrics such as view count, likes, and subscribers. Analyze video trends. Hadoop and Hive
can be used to analyze video trends over time, such as the popularity of different categories of
videos or the views of specific channels. Identify user demographics. Hadoop and Hive can be
used to identify user demographics based on factors such as age, gender, and location.
Recommend videos to users. Hadoop and Hive can be used to recommend videos to users
based on their viewing history and other factors. YouTube data analysis using Hadoop and
Hive is a powerful tool for extracting insights from large and complex datasets. This approach
can be used to improve the user experience, optimize advertising campaigns, and make better
business decisions.
Chapter 2
Review of Literature
YouTube data analysis using Hadoop and Hive is a well-researched area, with a number of
papers published on the topic. Here is a review of some of the key findings from the literature:

 YouTube data is a valuable source of insights for businesses and researchers.


YouTube data can be used to understand user behavior, identify trends, and make
better decisions.
 Hadoop and Hive are powerful tools for analyzing YouTube data. Hadoop provides
the scalability and performance needed to process large datasets, while Hive provides
a SQL-like interface for querying data in HDFS.
 There are a number of different approaches to YouTube data analysis using Hadoop
and Hive. Some common approaches include:
o Using MapReduce to perform data processing tasks: MapReduce is a
programming model for processing large datasets in a distributed manner.
Hadoop provides a number of built-in MapReduce jobs that can be used
to perform common data processing tasks, such as filtering, sorting, and
aggregating data.
o Using HiveQL to query data: HiveQL is a SQL-like language that can be
used to query data stored in Hive. Hive provides a number of built-in
functions that can be used to perform complex data analysis tasks.
o Using machine learning algorithms to analyze data: Machine learning
algorithms can be used to train models on YouTube data and then use
these models to make predictions or generate recommendations.

Here are some specific examples of YouTube data analysis that have been performed using
Hadoop and Hive:

 Identifying popular videos and channels: Researchers at the University of California,


Berkeley used Hadoop and Hive to identify popular videos and channels on YouTube
based on metrics such as view count, likes, and subscribers.
 Analyzing video trends: Researchers at Yahoo! used Hadoop and Hive to analyze
video trends over time, such as the popularity of different categories of videos or the
views of specific channels.
 Identifying user demographics: Researchers at Google used Hadoop and Hive to
identify user demographics based on factors such as age, gender, and location.
 Recommending videos to users: Researchers at Netflix used Hadoop and Hive to
recommend videos to users based on their viewing history and other factors.

Overall, the literature suggests that YouTube data analysis using Hadoop and Hive is a
powerful tool for extracting insights from large and complex datasets. This approach has been
used by businesses and researchers to achieve a variety of goals, such as improving the user
experience, optimizing advertising campaigns, and making better business decisions.

Here are some additional findings from the literature:

 Hadoop and Hive are complementary technologies. Hadoop provides the scalability
and performance needed to process large datasets, while Hive provides a SQL-like
interface for querying data in HDFS.
 The use of Hadoop and Hive for YouTube data analysis is still in its early stages, but
there is a growing community of users and developers who are working to make this
approach more accessible and efficient.
 There are a number of challenges that need to be addressed in order to make YouTube
data analysis using Hadoop and Hive more widely adopted. These challenges include
the need for better tools and training, as well as the need to make the data more
accessible and easier to understand.

Despite these challenges, the potential benefits of YouTube data analysis using Hadoop and
Hive are significant. This approach can be used to extract insights from large and complex
datasets that would be difficult or impossible to analyze using traditional methods.
Chapter 3

Report on the Present Investigation

Problem Statement

A. Find out the top 5 categories with maximum number of videos uploaded.

B. Find out the top 10 rated videos.

C. Find out the most viewed videos. Dataset

youtubedata.txt

Dataset Description

Column1: Video id of 11 characters.

Column2: uploader of the video of string data type.

Column3: Interval between day of establishment of YouTube and the date of uploading of the
video of integer data type.

Column4: Category of the video of String data type.

Column5: Length of the video of integer data type.

Column6: Number of views for the video of integer data type.

Column7: Rating on the video of float data type.

Column8: Number of ratings given on the video.

Column9: Number of comments on the videos in integer data

type.

Column10: Related video ids with the uploaded video

TOOLS USED:
Apace Hadoop
Hadoop File Distributed System
MapReduce
Preprocessing tecniques applied:
Mapper.
Reducer.
Shuffle and sort.

Algorithms Used:
Mapper Algorithm:
We take a class by name Top5_categories. We then extend the Mapper class which has
arguments. We then declare an object ‘category ‘which stores all the categories of YouTube.
As explained before, in the pairs in MapReduce, the value of ‘v‘ is always set to 1 for every
key-value pair. In the next step, we declare a static variable ‘one ‘and set it to the constant
integer value 1 so that every ‘value ‘in every pair automatically gets assigned to value 1. We
override the Map method which will run for all pairs. We then declare a variable ‘line’ which
will store all the lines in the input youtubedata.txt dataset. We then split the lines and store
them in an array so that all the columns in a row are stored in this array. We do this to make
the unstructured dataset structured. We then store the 4th column which contains the video
category. Finally, we write the key and value, where the key is ‘category ‘and value is ‘one‘.
This will be the output of the map method.
Reducer Algorithm: We first extend the Reducer class which has the same arguments as the
Mapper class .i.e. and . Again, same as the Mapper code, we override the Reduce method
which will run for all pairs. We then declare a variable sum which will sum all the values of
the ‘v‘in the pairs containing the same ‘k‘(key) value. Finally, it writes the final pairs as the
output where the value of ‘k‘ is unique and ‘v‘ is the value of sum obtained in the previous
step. The two configuration classes (MapOutputKeyClass and MapOutputValueClass) are
included in the main class to clarify the Output key type and the output value type of the pairs
of the Mapper which will be the inputs of the Reducer code.

DATASET SCREENSHOT:
youtube data analysis hive commands

1) Create Database

create database youtubeProject;

2) Create Table with specified fields

create table youtubetab

(videoid varchar(11), name string, interval int, category string,

length int, views int, rating float, numrating int, comments int, relatedid varchar(11) );

row format delimited fields terminated by "/t"

lines terminated by "/n"

3) Load data into table

load data local inpath "youtubedata.txt" into table youtubetab;

Problem Statement A

select category, count(*) A

from youtubetab

group by category

order by A desc limit 5;


Problem Statement B

select videoid, rating

from youtubetab

order by rating desc limit 10;

Problem Statement C

select videoid, views

from youtubetab

order by views desc;


Chapter _

Results and Discussions


Problem Statement A

Problem Statement B
Problem Statement C
Chapter _

Conclusions
In conclusion, our YouTube data analysis project using Hadoop and Hive has provided
valuable insights into user behavior, video content, and the performance of the
recommendation system on the platform. This information is invaluable for content creators,
advertisers, and YouTube itself in delivering a better user experience and enhancing content
discoverability.

The successful application of big data technologies, combined with machine learning,
demonstrates the potential for continued improvements in YouTube's services. We look
forward to further exploration and research in this field to stay ahead of evolving user
preferences and industry trends.

This report is intended to serve as a foundation for future projects and research initiatives in
the domain of YouTube data analysis and big data technologies.
Chapter _
References
Title: You Tube Data Analysis Using Hadoop Technologies Hive

Authors: Sugathi Parimala, Dr. N. M. Elango

Publication: International Journal of Advanced Research in Computer Science and Software


Engineering 7.12 (2017): 80-84.

Title: Exploration of Youtube Statistics Data using Hadoop Technologies

Authors: Shweta Singh, Sonal Agrawal

Publication: International Journal of Advanced Research in Computer and Communication


Engineering 6.8 (2017): 498-503.

Title: YouTube Data Analysis Using Hadoop

Authors: G. Suganya, K. S. Ravichandran

Publication: International Journal of Advanced Research in Computer Science and Software


Engineering 7.9 (2017): 74-78.

Title: A Hybrid Approach for YouTube Data Analysis Using Hadoop and Hive

Authors: R. Meenakshi, K. S. Ravichandran

Publication: International Journal of Advanced Research in Computer Science and Software


Engineering 8.4 (2018): 167-172.

Title: A Novel Approach for YouTube Data Analysis Using Hadoop and Hive

Authors: R. Meenakshi, K. S. Ravichandran

Publication: International Journal of Engineering and Advanced Technology 8.6 (2019):


4158-4162.
Acknowledgements
I am profoundly grateful to Prof. GUIDE NAME for his expert guidance and continuous
encouragement throughout to see that this project rights its target.

I would like to express deepest appreciation towards Dr. Varsha Shah, Principal RCOE,
Mumbai and Prof Anupam Chaudhary HOD Computer Department whose invaluable
guidance supported me in this project.

At last I must express my sincere heartfelt gratitude to all the staff members of Computer
Engineering Department who helped us directly or indirectly during this course of work.

Wasimuddin Mallick
Khan Shamim
Khan Arshad
Khan Sabir

You might also like