Professional Documents
Culture Documents
Mini Project Report For BDA
Mini Project Report For BDA
Mini Project Report For BDA
Bachelors of Engineering
by
(Wasimuddin Mallick,
Khan Shamim,
Khan Arshad,
Khan Sabir)
(Roll No.22,13,16,17)
Guide:
(Prof. Reshma Lohar)
University of Mumbai
2023-2024
CERTIFICATE
This is to certify that the mini-project entitled “YouTube data analysis using Hadoop
and hive” is a bonafide work of (Wasimuddin Mallick, Khan Shamim, Khan Arshad, Khan
Sabir) (Roll No.22,13,16,17) submitted to the University of Mumbai in partial fulfillment of
the requirement for the Mini-Project 7th Sem of the Bachelor of Engineering in “Computer
Engineering”.
_______________ ______________
Prof. Shiburaj Pappu Dr. Varsha Shah
Head of Department Principal
Declaration
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original
sources. I also declare that I have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. I understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
-----------------------------------------
(Signature)
-----------------------------------------
(Name of student and Roll No.)
Date:
ABSTRACT
YouTube data analysis using Hadoop and Hive is a powerful approach for extracting insights
from large and complex datasets. Hadoop is a distributed computing framework that allows
for the parallel processing of large amounts of data across multiple nodes. Hive is a data
warehouse infrastructure built on top of Hadoop that provides a SQL-like interface for
querying data stored in HDFS. YouTube data analysis using Hadoop and Hive is a powerful
tool for extracting insights from large and complex datasets. This approach can be used to
improve the user experience, optimize advertising campaigns, and make better business
decisions.
Index
Sr. No Title Page No
1. Introduction 1
2. Review and Literature 2
2.1. Paper 1 3
2.2. Paper 2 4
3. Theory, Methodology and Algorithm 5
3.1 Section 6
3.1.1. Subsection 7
4. Results and Discussions 8
5. Conclusion 9
6. References 10
Appendix 11
Acknowledgement 12
Publication 13
Chapter 1
Introduction
YouTube data analysis using Hadoop and Hive is a powerful approach for extracting insights
from large and complex datasets. Hadoop is a distributed computing framework that allows
for the parallel processing of large amounts of data across multiple nodes. Hive is a data
warehouse infrastructure built on top of Hadoop that provides a SQL-like interface for
querying data stored in HDFS. The following is a brief overview of the steps involved in
YouTube data analysis using Hadoop and Hive: Collect the data. YouTube provides a public
API that can be used to collect data about videos, channels, and users. This data can be stored
in a variety of formats, such as JSON, XML, or CSV. Load the data into HDFS. Once the data
has been collected, it needs to be loaded into HDFS. This can be done using the Hadoop
command line interface or through a third-party tool. Process the data using MapReduce.
MapReduce is a programming model for processing large datasets in a distributed manner.
Hadoop provides a number of built-in MapReduce jobs that can be used to perform common
data processing tasks, such as filtering, sorting, and aggregating data. Store the processed data
in Hive. Once the data has been processed using MapReduce, it can be stored in Hive for
querying and analysis. Quey the data using HiveQL. HiveQL is a SQL-like language that can
be used to query data stored in Hive. Hive provides a number of built-in functions that can be
used to perform complex data analysis tasks.
With rapid innovations and surge of internet companies like Google, Yahoo, Amazon, eBay
and a rapidly growing internet savvy population, today's advanced systems and enterprises are
generating data in a very huge volume with great velocity and in a multi-structured formats
including videos, images, sensor data, weblogs etc. from different sources. This has given
birth to a new type of data called Big Data which is unstructured sometime semi structured
and unpredictable in nature. This data is mostly generated in real time from social media
websites which are increasing exponentially on a daily basis. This type of data is structural in
nature and still manageable. However, social 2 media data is primarily unstructured in nature.
The very unstructured nature of the data makes it very hard to analyse and very interesting at
the same time. Most of the companies are uploading their product launch on YouTube and
they anxiously await their subscribers' reviews. Major production houses launch movie
trailers and people provide their first reaction and reviews about the trailers. This further
creates a buzz and excitement about the product. The following are some examples of
YouTube data analysis that can be performed using Hadoop and Hive: Identify popular videos
and channels. Hadoop and Hive can be used to identify popular videos and channels based on
metrics such as view count, likes, and subscribers. Analyze video trends. Hadoop and Hive
can be used to analyze video trends over time, such as the popularity of different categories of
videos or the views of specific channels. Identify user demographics. Hadoop and Hive can be
used to identify user demographics based on factors such as age, gender, and location.
Recommend videos to users. Hadoop and Hive can be used to recommend videos to users
based on their viewing history and other factors. YouTube data analysis using Hadoop and
Hive is a powerful tool for extracting insights from large and complex datasets. This approach
can be used to improve the user experience, optimize advertising campaigns, and make better
business decisions.
Chapter 2
Review of Literature
YouTube data analysis using Hadoop and Hive is a well-researched area, with a number of
papers published on the topic. Here is a review of some of the key findings from the literature:
Here are some specific examples of YouTube data analysis that have been performed using
Hadoop and Hive:
Overall, the literature suggests that YouTube data analysis using Hadoop and Hive is a
powerful tool for extracting insights from large and complex datasets. This approach has been
used by businesses and researchers to achieve a variety of goals, such as improving the user
experience, optimizing advertising campaigns, and making better business decisions.
Hadoop and Hive are complementary technologies. Hadoop provides the scalability
and performance needed to process large datasets, while Hive provides a SQL-like
interface for querying data in HDFS.
The use of Hadoop and Hive for YouTube data analysis is still in its early stages, but
there is a growing community of users and developers who are working to make this
approach more accessible and efficient.
There are a number of challenges that need to be addressed in order to make YouTube
data analysis using Hadoop and Hive more widely adopted. These challenges include
the need for better tools and training, as well as the need to make the data more
accessible and easier to understand.
Despite these challenges, the potential benefits of YouTube data analysis using Hadoop and
Hive are significant. This approach can be used to extract insights from large and complex
datasets that would be difficult or impossible to analyze using traditional methods.
Chapter 3
Problem Statement
A. Find out the top 5 categories with maximum number of videos uploaded.
youtubedata.txt
Dataset Description
Column3: Interval between day of establishment of YouTube and the date of uploading of the
video of integer data type.
type.
TOOLS USED:
Apace Hadoop
Hadoop File Distributed System
MapReduce
Preprocessing tecniques applied:
Mapper.
Reducer.
Shuffle and sort.
Algorithms Used:
Mapper Algorithm:
We take a class by name Top5_categories. We then extend the Mapper class which has
arguments. We then declare an object ‘category ‘which stores all the categories of YouTube.
As explained before, in the pairs in MapReduce, the value of ‘v‘ is always set to 1 for every
key-value pair. In the next step, we declare a static variable ‘one ‘and set it to the constant
integer value 1 so that every ‘value ‘in every pair automatically gets assigned to value 1. We
override the Map method which will run for all pairs. We then declare a variable ‘line’ which
will store all the lines in the input youtubedata.txt dataset. We then split the lines and store
them in an array so that all the columns in a row are stored in this array. We do this to make
the unstructured dataset structured. We then store the 4th column which contains the video
category. Finally, we write the key and value, where the key is ‘category ‘and value is ‘one‘.
This will be the output of the map method.
Reducer Algorithm: We first extend the Reducer class which has the same arguments as the
Mapper class .i.e. and . Again, same as the Mapper code, we override the Reduce method
which will run for all pairs. We then declare a variable sum which will sum all the values of
the ‘v‘in the pairs containing the same ‘k‘(key) value. Finally, it writes the final pairs as the
output where the value of ‘k‘ is unique and ‘v‘ is the value of sum obtained in the previous
step. The two configuration classes (MapOutputKeyClass and MapOutputValueClass) are
included in the main class to clarify the Output key type and the output value type of the pairs
of the Mapper which will be the inputs of the Reducer code.
DATASET SCREENSHOT:
youtube data analysis hive commands
1) Create Database
length int, views int, rating float, numrating int, comments int, relatedid varchar(11) );
Problem Statement A
from youtubetab
group by category
from youtubetab
Problem Statement C
from youtubetab
Problem Statement B
Problem Statement C
Chapter _
Conclusions
In conclusion, our YouTube data analysis project using Hadoop and Hive has provided
valuable insights into user behavior, video content, and the performance of the
recommendation system on the platform. This information is invaluable for content creators,
advertisers, and YouTube itself in delivering a better user experience and enhancing content
discoverability.
The successful application of big data technologies, combined with machine learning,
demonstrates the potential for continued improvements in YouTube's services. We look
forward to further exploration and research in this field to stay ahead of evolving user
preferences and industry trends.
This report is intended to serve as a foundation for future projects and research initiatives in
the domain of YouTube data analysis and big data technologies.
Chapter _
References
Title: You Tube Data Analysis Using Hadoop Technologies Hive
Title: A Hybrid Approach for YouTube Data Analysis Using Hadoop and Hive
Title: A Novel Approach for YouTube Data Analysis Using Hadoop and Hive
I would like to express deepest appreciation towards Dr. Varsha Shah, Principal RCOE,
Mumbai and Prof Anupam Chaudhary HOD Computer Department whose invaluable
guidance supported me in this project.
At last I must express my sincere heartfelt gratitude to all the staff members of Computer
Engineering Department who helped us directly or indirectly during this course of work.
Wasimuddin Mallick
Khan Shamim
Khan Arshad
Khan Sabir