Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Industrial Internship Training Report

YouTube Ad view Prediction

Submitted by:

Name - Niharika Patnaik


Regd. No.- 1901227448
Semester- 7th Sem
Branch - CSE

Under Supervision of:

Mr. Kashish Kumar


INTERNSHIP STUDIO
(Duration: 05 January, 2022- 09th February, 2022)
th

Department of Computer Science & Engineering


C. V. RAMAN GLOBAL UNIVERSITY,
BHUBANESWAR, ODISHA

December
2022
Declaration

I hereby declare that the internship report entitled “Youtube Adview Prediction” is
my own work and that, to the best of my knowledge and belief, it contains no material
previously published or written by another person nor material which to substantial
extent has been accepted for the award of any degree of the university or another
institute of higher learning.

Name of the Student: Niharika Patnaik

Regn No.: 1901227448

Date: 13/12/2022
Department of Computer Science & Engineering

C. V. RAMAN GLOBAL UNIVERSITY

Certificate of Approval

This is to certify that we have examined the training report entitled “YouTube Adview
Prediction” submitted by, Niharika Patnaik (Regd No.-1901227448), CGU,
Bhubaneswar. We hereby accord our approval of the training work carried out and
presented in a manner required for its acceptance as per the academic regulation, for the
partial fulfillment for the 7th Semester in Computer Science & Engineering. This
training has fulfilled all the requirements as per the regulations of the university.

Prof. M.Mishra Dr. R. Priyadarshini


(Internship Coordinator) (H.O.D, CSE)
Acknowledgement

It gives me immense pleasure to express my sincere gratitude to our faculty coordinator Prof.
Monalisa Mishra for her support and advices to get and complete internship in the above said
organization.

I extend my sincere thanks to our HOD Dr. R. Priyadarshini for her immeasurable support
throughout my internship.

I also like to acknowledge the contribution of other faculty members of the Department of CSE for
their cooperation and kind assistance in successful completion of this internship.

13 December 2022 Niharika Patnaik (1901227448)


Abstract

Machine learning is a method of data analysis that automates analytical model building. It is a
branch of artificial intelligence based on the idea that systems can learn from data, identify
patterns and make decisions with minimal human intervention. Machine learning is behind
chatbots and predictive text, language translation apps, the shows Netflix suggests, and how
the social media feeds are presented. It powers autonomous vehicles and machines that can
diagnose medical conditions based on images.

Machine learning starts with data — numbers, photos, or text, like bank transactions, pictures
of people or even bakery items, repair records, time series data from sensors, or sales reports.
The data is gathered and prepared to be used as training data, or the information the machine
learning model will be trained on. The more data, the better the program.

From there, programmers choose a machine learning model to use, supply the data, and let the
computer model train itself to find patterns or make predictions. Over time the human
programmer can also tweak the model, including changing its parameters, to help push it
toward more accurate results.

The project completed during the course of internship was titled “Youtube Adview
Prediction” in which using different machine learning models like Linear Regression (LR),
Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and a deep
learning model Artificial Neural Network (ANN), the number of adviews for youtube videos
are predicted. Finally, a comparative analysis is done based on experimental results acquired
from different models.
Contents

DECLARATION…………………………………………………………………………………………...i
CERTIFICATE OF APPROVAL………………………………………………………………………...ii
INTERNSHIP CERTIFICATE…………………………………………………………………………..iii
ACKNOWLEDGEMENT.............................................................................................................. .............iv
ABSTRACT………………………………………………………………………………………………...v
CONTENTS …………………………………………………………………………………………….....vi
WEEKLY OVERVIEW…………………………………………………………………………………...1
INTRODUCTION.........................................................................................................................................2
OVERVIEW...............................
BACKGROUND AND MOTIVATION....................................................................
LEARNING OBJECTIVE…………………………………………………………………………………………………………..

METHODOLOGY.................................................................................................................. .................
RESULT/LEARNING OUTCOME…………………………………………………………………...
CONCLUSION ………………………………………………………………………………………….
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES

Week Date Day Name of the


Topic/Module
Completed
Monday
Tuesday
1st week 05/01/2022 Wednesday Introduction to python
programming and
Introduction to Jupyter
notebook.
06/01/2022 Thursday Conditional-control
statements.
07/01/2022 Friday Understanding loops.
08/01/2022 Saturday Understanding
functions and
dictionary.
10/01/2022 Monday Introduction to
statistics, confidence
2nd week interval and probability
distribution.
11//01/2022 Tuesday Basics of Numpy.
12/01/2022 Wednesday Indexing and slicing.
13/01/2022 Thursday Introduction to Pandas.
14/01/2022 Friday Understanding Series
in pandas.
15/01/2022 Saturday Understanding
DataFrames in pandas.
17/01/2022 Monday Converting-
DataFrame-and-Series-
3rd week into-numpy-array.
18/01/2022 Tuesday Functions in Pandas.
19/01/2022 Wednesday Introduction to
Matplotlib. Graphical-
representation-of-data-
2-Pyplot-API.
20/01/2022 Thursday Introduction to
Machine Learning and
different ML
techniques.
21/01/2022 Friday Supervised and
Unsupervised
Learning.
22/01/2022 Saturday Reinforcement
Learning, Steps to
solve a ML problem.
24/01/2022 Monday Introduction-to-Scikit-
th
4 week Learn and
Programming-Practice-
on-IRIS-Data-Set.
25/01/2022 Tuesday Introduction-to-
Regression and
Multivariable-Linear-
Regression.
26/01/2022 Wednesday Introduction-to-
Logistic-Regression
and Understanding-
Cross-Validation-
ROC-Curve-and-
Confusion-Matrix.
27/01/2022 Thursday Understanding
Decision tree, random
forest, ensemble
learning, bagging and
boosting.
28/01/2022 Friday Understanding Naïve
bayes, introduction to
unsupervised learning,
different clustering
techniques.
29/01/2022 Saturday Understanding
dimension reduction
and PCA.
31/01/2022 Monday Introduction to Deep
Learning, working
5th week details and resources
required.
01/02/2022 Tuesday Project Problem
statement and
description.
2nd Jan- 9th Feb Wednesday- Project completion and
Wednesday submission.
Chapter 1
INTRODUCTION

1.1 Overview:

YouTube is a world-famous video sharing interactive platform which allows its users to rate,
share, save, comment, and upload the content. Unlike popular videos which get number of likes
and views by the time they are stated as popular, YouTube trending videos represents the
content which is gaining viewership over a certain time period and has a potential to be popular.
Youtube advertisers pay content creators based on adviews and clicks. They want to estimate
the adview based on other metrics like comments, likes, dislikes etc. for the products, goods,
and services being marketed. Analyzing this information manually is a very tedious task. This
may be time-consuming and even the results won’t be accurate and efficient. This will therefore
affect wrong predictions of data leading to the declination of profit for products.

The aim of this project is to train various models and choose the best one to predict the number
of ad views or so-called advertisements. The data or information based on different attributes
is needed to be refined or filtered and cleaned before feeding in the algorithms to get better
results. Techniques like Linear Regression (LR), Support Vector Machine (SVM), Decision
Tree (DT), Random Forest (RF), and Artificial Neural Network (ANN) are used and based on
their results comparative analysis is done. This will help the influencers to get an idea of how
the view count is going to be before making and finalizing the video.

1.2 Background and Motivation:

YouTube is the largest online video sharing platform in the world. Launched in May 2005,
YouTube allows billions of people around the world to discover, watch, and share originally
created videos. YouTube allows individuals all around the world to interact, educate, and
inspire one another and acts as a distribution platform for original content creators and
advertisers, both large and small. YouTube offers interactive video features for public and
content creators such as Views, which denotes the total number of viewership gathered by
the particular video till date. The video view count is an important metric for determining a
video's popularity or "user engagement," as well as the parameter by which YouTube
compensates the content creators. Whenever a video gains popularity, it is made available to
the large number of viewers for free and it gains masses attention for a while. It is hard to keep
track of which content might get trending in near future or might become popular, hence
predictive analysis using Machine Learning is introduced.

Content creators or as they are called YouTubers also generate revenue from their videos.
YouTube is the sole source of income for a lot of YouTubers and this study will help creators
to analyse their contents life cycle and make improvements in required areas. Such as feedback
from viewers is a very important aspect for YouTubers as they can understand that how their
content is being received by people, and this study helps YouTube and YouTubers understand
how the interactive features affect their videos performance on the social platform.

1.3 Objective:

To analyse and compare different Machine Learning regression algorithms trained on criteria
and metrics like comments, dislikes, likes, etc. to predict the number of adviews for a video.

The objectives are:


• Model identification – best model selection through comparative study
• Model fitting – fitting all the models on trained data
• Model evaluation – evaluating results with significance testing
Chapter 2
METHODOLOGY

Implementation:

The detailed representation of the entire methodology for Youtube Adview Prediction is shown
in the figure below.

Figure-1: Flow Diagram of Methodology

The proposed approach for YouTube Ad view sentiment analysis with implementation is step-
wise explained in this section. The steps involved are:
Data Description:

The dataset that has been is data.csv file which includes the metrics and a few other details of
15000 YouTube videos. The metric criteria consist of views, comments, likes, dislikes,
duration, published date, and category. The attributes in the dataset are vivid, likes, views,
adviews, dislikes, published, comments, category, and duration.

Attribute Information:-

• 'vidid' : Unique Identification ID for each video.


• 'adview' : The number of adviews for each video.
• 'views' : The number of unique views for each video.
• 'likes' : The number of likes for each video.
• 'dislikes' : The number of likes for each video.
• 'comment' : The number of unique comments for each video.
• 'published' : The data of uploading the video.
• 'duration' : The duration of the video (in min. and seconds).
• 'category' : Category niche of each of the video.

Importing the dataset and libraries:

Initially, the pre-installed python libraries or packages like numpy, pandas, matplotlib, and
seaborn were imported and used for cleaning data and visualization. Then the dataset in csv
format was imported using pandas as a pandas dataframe. The number of features and samples
in the data were explored.

Applying data visualization techniques:

The Seaborn and matplotlib libraries were used for plotting. The individual features were
plotted (as shown in fig. 2 and fig. 3) and the distribution of the data was analyzed. This was
used to spot the outliers (if any) in the data which also helped the model to train better. The
heatmap was also plotted (as shown in fig. 4) using the seaborn library which helped to
visualize correlations with respect to each feature.

Figure-2: Individual Plot of Categories

Figure-3: Individual Plot of Adviews


Figure-4: Heatmap of the data

Data cleaning:

Cleaning the dataset is one of the vital steps while interpreting and dealing with a machine
learning problem. So, Cleaning of the dataset is done by removing missing values and other
things. And at last, remove the missing values such as null or any other miscellaneous data so
that they do not interfere with further process.
• Drop or remove null characters and unnecessary data.
• Rearrange the columns so that it is easy to split while training the data.

Making necessary transformations:

The categorical data and data which were in other formats, were converted into numerical form.
The date, time, and label encoder functions were used for it. This process is also named as
feature engineering.
Further, the data is converted into float for other process and evaluation and also manipulate
time into seconds and date into numeric format and also split the date into year, month and day
for further analysis.

• Convert views, likes, dislikes, comment data into numeric using panda.to numeric ()
with errors="coerce", so that if it is not able to convert to numeric it converts to NULL.
• Converting published date into numeric and splitting it into year, month, day.
• Converting time into seconds’ format.
• Converting or labelling the category for faster and easy analysis

Splitting and normalizing the data:

The dataset was distinguished into the training and the testing data in the ratio 80:20
respectively. Then, normalization was done using MinMax Scaler (transforms variable in the
range of 0 to 1), to verify if all the features were appropriately weighted in the training stage.

Data should be normalised and In the right ratio, divide the data into training and test sets.

• Separate the data into training and test data.


• Develop a function to calculate mean absolute error, mean square error, and mean
square root error.

Training the model :

Several machine learning models like Linear Regression (LR), Support Vector Regressor
(SVR), Decision Tree Regressor (DTR), and Random Forest Regressor (RFR) are used to train
the data. The scikit-learn library was used to import these models and train them, providing
necessary labeled data or hyperparameters.

Train the data for each respective model and make a note of errors. Also, a deep learning model
Artificial Neural Network (ANN) is trained to compare its performance with the above
mentioned machine learning models, so as to get a better and accurate prediction of adviews.
Training the model using ANN:

Training an ANN is an iterative process in which training data examples are presented to the
network one by one, and the values of the weights are adjusted each time. After all examples
get run through the network, one training epoch is finished and the process often starts again.

Initially, the model architecture was defined including layers, number of neurons, activation
function, and cost function. Then the model was trained for different epochs using keras, which
resulted in the improvement of the model.

Analyzing the results:

The results obtained were in the form of Root Mean Squared Error (RMSE). One machine
learning model with minimum error and ANN model were selected for testing. Both the models
were saved using keras and scikit-learn. Finally, the test data was used for the prediction of
YouTube ad views from the chosen models.
Chapter 3
RESULTS/ OUTCOME

Results:

The deep learning and machine learning models were applied to the dataset to perform
experiments, in order to examine the performance of algorithms. For every model used namely,
Linear regression, Decision tree regressor, Random-forest regressor, Support vector regressor
and finally the Artificial neural network, the mean absolute error, mean squared error, root
mean squared error and variance score and R2 score are calculated and based on these metrics
a model is selected which has minimum root mean squared error or higher the R2 score.

Models Metrics for Evaluation


Support Vector
Regressor

Linear Regression

Decision Tree
Regressor

Random Forest
Regressor
Artificial Neuron
Network

Table-1: Models and evaluation metrics values

Implementing all the algorithms resulted the above values and based on these values the
Random Forest Regressor was chosen since it had the minimum Root Mean Squared Error
(RMSE) or the highest R2 score.

Finally, after training of a model, it is desirable to have a way to persist the model for future
use without having to retrain. So, the Random Forest Regressor model is saved using
Joblib.dump() method and then the model is tested on the test.csv data.

It then predicts the number of adviews for the test data and saves it in prediction.csv file.

Figure-5: Adview Prediction


Chapter 4
CONCLUSION

This project focuses on predicting ad views for YouTube videos using advanced technologies
like Deep Learning and Machine Learning. The techniques like Linear Regression (LR),
Support Vector Regressor (SVR), Decision Tree Regressor (DTR), Random Forest Regressor
(RFR), and Artificial Neural Network (ANN) were used to train the model. The result achieved
is that the Random Forest algorithm obtained the minimum RMSE value of 24594.19 and
ANN acquired 28794.886 RMSE value. Finally, the Random Forest Regressor was saved and
implemented on test data for predicting the ad-views for the test data.

You might also like