Professional Documents
Culture Documents
Industrial Internship Report
Industrial Internship Report
Submitted by:
December
2022
Declaration
I hereby declare that the internship report entitled “Youtube Adview Prediction” is
my own work and that, to the best of my knowledge and belief, it contains no material
previously published or written by another person nor material which to substantial
extent has been accepted for the award of any degree of the university or another
institute of higher learning.
Date: 13/12/2022
Department of Computer Science & Engineering
Certificate of Approval
This is to certify that we have examined the training report entitled “YouTube Adview
Prediction” submitted by, Niharika Patnaik (Regd No.-1901227448), CGU,
Bhubaneswar. We hereby accord our approval of the training work carried out and
presented in a manner required for its acceptance as per the academic regulation, for the
partial fulfillment for the 7th Semester in Computer Science & Engineering. This
training has fulfilled all the requirements as per the regulations of the university.
It gives me immense pleasure to express my sincere gratitude to our faculty coordinator Prof.
Monalisa Mishra for her support and advices to get and complete internship in the above said
organization.
I extend my sincere thanks to our HOD Dr. R. Priyadarshini for her immeasurable support
throughout my internship.
I also like to acknowledge the contribution of other faculty members of the Department of CSE for
their cooperation and kind assistance in successful completion of this internship.
Machine learning is a method of data analysis that automates analytical model building. It is a
branch of artificial intelligence based on the idea that systems can learn from data, identify
patterns and make decisions with minimal human intervention. Machine learning is behind
chatbots and predictive text, language translation apps, the shows Netflix suggests, and how
the social media feeds are presented. It powers autonomous vehicles and machines that can
diagnose medical conditions based on images.
Machine learning starts with data — numbers, photos, or text, like bank transactions, pictures
of people or even bakery items, repair records, time series data from sensors, or sales reports.
The data is gathered and prepared to be used as training data, or the information the machine
learning model will be trained on. The more data, the better the program.
From there, programmers choose a machine learning model to use, supply the data, and let the
computer model train itself to find patterns or make predictions. Over time the human
programmer can also tweak the model, including changing its parameters, to help push it
toward more accurate results.
The project completed during the course of internship was titled “Youtube Adview
Prediction” in which using different machine learning models like Linear Regression (LR),
Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and a deep
learning model Artificial Neural Network (ANN), the number of adviews for youtube videos
are predicted. Finally, a comparative analysis is done based on experimental results acquired
from different models.
Contents
DECLARATION…………………………………………………………………………………………...i
CERTIFICATE OF APPROVAL………………………………………………………………………...ii
INTERNSHIP CERTIFICATE…………………………………………………………………………..iii
ACKNOWLEDGEMENT.............................................................................................................. .............iv
ABSTRACT………………………………………………………………………………………………...v
CONTENTS …………………………………………………………………………………………….....vi
WEEKLY OVERVIEW…………………………………………………………………………………...1
INTRODUCTION.........................................................................................................................................2
OVERVIEW...............................
BACKGROUND AND MOTIVATION....................................................................
LEARNING OBJECTIVE…………………………………………………………………………………………………………..
METHODOLOGY.................................................................................................................. .................
RESULT/LEARNING OUTCOME…………………………………………………………………...
CONCLUSION ………………………………………………………………………………………….
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES
1.1 Overview:
YouTube is a world-famous video sharing interactive platform which allows its users to rate,
share, save, comment, and upload the content. Unlike popular videos which get number of likes
and views by the time they are stated as popular, YouTube trending videos represents the
content which is gaining viewership over a certain time period and has a potential to be popular.
Youtube advertisers pay content creators based on adviews and clicks. They want to estimate
the adview based on other metrics like comments, likes, dislikes etc. for the products, goods,
and services being marketed. Analyzing this information manually is a very tedious task. This
may be time-consuming and even the results won’t be accurate and efficient. This will therefore
affect wrong predictions of data leading to the declination of profit for products.
The aim of this project is to train various models and choose the best one to predict the number
of ad views or so-called advertisements. The data or information based on different attributes
is needed to be refined or filtered and cleaned before feeding in the algorithms to get better
results. Techniques like Linear Regression (LR), Support Vector Machine (SVM), Decision
Tree (DT), Random Forest (RF), and Artificial Neural Network (ANN) are used and based on
their results comparative analysis is done. This will help the influencers to get an idea of how
the view count is going to be before making and finalizing the video.
YouTube is the largest online video sharing platform in the world. Launched in May 2005,
YouTube allows billions of people around the world to discover, watch, and share originally
created videos. YouTube allows individuals all around the world to interact, educate, and
inspire one another and acts as a distribution platform for original content creators and
advertisers, both large and small. YouTube offers interactive video features for public and
content creators such as Views, which denotes the total number of viewership gathered by
the particular video till date. The video view count is an important metric for determining a
video's popularity or "user engagement," as well as the parameter by which YouTube
compensates the content creators. Whenever a video gains popularity, it is made available to
the large number of viewers for free and it gains masses attention for a while. It is hard to keep
track of which content might get trending in near future or might become popular, hence
predictive analysis using Machine Learning is introduced.
Content creators or as they are called YouTubers also generate revenue from their videos.
YouTube is the sole source of income for a lot of YouTubers and this study will help creators
to analyse their contents life cycle and make improvements in required areas. Such as feedback
from viewers is a very important aspect for YouTubers as they can understand that how their
content is being received by people, and this study helps YouTube and YouTubers understand
how the interactive features affect their videos performance on the social platform.
1.3 Objective:
To analyse and compare different Machine Learning regression algorithms trained on criteria
and metrics like comments, dislikes, likes, etc. to predict the number of adviews for a video.
Implementation:
The detailed representation of the entire methodology for Youtube Adview Prediction is shown
in the figure below.
The proposed approach for YouTube Ad view sentiment analysis with implementation is step-
wise explained in this section. The steps involved are:
Data Description:
The dataset that has been is data.csv file which includes the metrics and a few other details of
15000 YouTube videos. The metric criteria consist of views, comments, likes, dislikes,
duration, published date, and category. The attributes in the dataset are vivid, likes, views,
adviews, dislikes, published, comments, category, and duration.
Attribute Information:-
Initially, the pre-installed python libraries or packages like numpy, pandas, matplotlib, and
seaborn were imported and used for cleaning data and visualization. Then the dataset in csv
format was imported using pandas as a pandas dataframe. The number of features and samples
in the data were explored.
The Seaborn and matplotlib libraries were used for plotting. The individual features were
plotted (as shown in fig. 2 and fig. 3) and the distribution of the data was analyzed. This was
used to spot the outliers (if any) in the data which also helped the model to train better. The
heatmap was also plotted (as shown in fig. 4) using the seaborn library which helped to
visualize correlations with respect to each feature.
Data cleaning:
Cleaning the dataset is one of the vital steps while interpreting and dealing with a machine
learning problem. So, Cleaning of the dataset is done by removing missing values and other
things. And at last, remove the missing values such as null or any other miscellaneous data so
that they do not interfere with further process.
• Drop or remove null characters and unnecessary data.
• Rearrange the columns so that it is easy to split while training the data.
The categorical data and data which were in other formats, were converted into numerical form.
The date, time, and label encoder functions were used for it. This process is also named as
feature engineering.
Further, the data is converted into float for other process and evaluation and also manipulate
time into seconds and date into numeric format and also split the date into year, month and day
for further analysis.
• Convert views, likes, dislikes, comment data into numeric using panda.to numeric ()
with errors="coerce", so that if it is not able to convert to numeric it converts to NULL.
• Converting published date into numeric and splitting it into year, month, day.
• Converting time into seconds’ format.
• Converting or labelling the category for faster and easy analysis
The dataset was distinguished into the training and the testing data in the ratio 80:20
respectively. Then, normalization was done using MinMax Scaler (transforms variable in the
range of 0 to 1), to verify if all the features were appropriately weighted in the training stage.
Data should be normalised and In the right ratio, divide the data into training and test sets.
Several machine learning models like Linear Regression (LR), Support Vector Regressor
(SVR), Decision Tree Regressor (DTR), and Random Forest Regressor (RFR) are used to train
the data. The scikit-learn library was used to import these models and train them, providing
necessary labeled data or hyperparameters.
Train the data for each respective model and make a note of errors. Also, a deep learning model
Artificial Neural Network (ANN) is trained to compare its performance with the above
mentioned machine learning models, so as to get a better and accurate prediction of adviews.
Training the model using ANN:
Training an ANN is an iterative process in which training data examples are presented to the
network one by one, and the values of the weights are adjusted each time. After all examples
get run through the network, one training epoch is finished and the process often starts again.
Initially, the model architecture was defined including layers, number of neurons, activation
function, and cost function. Then the model was trained for different epochs using keras, which
resulted in the improvement of the model.
The results obtained were in the form of Root Mean Squared Error (RMSE). One machine
learning model with minimum error and ANN model were selected for testing. Both the models
were saved using keras and scikit-learn. Finally, the test data was used for the prediction of
YouTube ad views from the chosen models.
Chapter 3
RESULTS/ OUTCOME
Results:
The deep learning and machine learning models were applied to the dataset to perform
experiments, in order to examine the performance of algorithms. For every model used namely,
Linear regression, Decision tree regressor, Random-forest regressor, Support vector regressor
and finally the Artificial neural network, the mean absolute error, mean squared error, root
mean squared error and variance score and R2 score are calculated and based on these metrics
a model is selected which has minimum root mean squared error or higher the R2 score.
Linear Regression
Decision Tree
Regressor
Random Forest
Regressor
Artificial Neuron
Network
Implementing all the algorithms resulted the above values and based on these values the
Random Forest Regressor was chosen since it had the minimum Root Mean Squared Error
(RMSE) or the highest R2 score.
Finally, after training of a model, it is desirable to have a way to persist the model for future
use without having to retrain. So, the Random Forest Regressor model is saved using
Joblib.dump() method and then the model is tested on the test.csv data.
It then predicts the number of adviews for the test data and saves it in prediction.csv file.
This project focuses on predicting ad views for YouTube videos using advanced technologies
like Deep Learning and Machine Learning. The techniques like Linear Regression (LR),
Support Vector Regressor (SVR), Decision Tree Regressor (DTR), Random Forest Regressor
(RFR), and Artificial Neural Network (ANN) were used to train the model. The result achieved
is that the Random Forest algorithm obtained the minimum RMSE value of 24594.19 and
ANN acquired 28794.886 RMSE value. Finally, the Random Forest Regressor was saved and
implemented on test data for predicting the ad-views for the test data.