Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Sultan Qaboos University

College of science
Department of computer science

Final report
Data Exploration and Visualization Project COMP3602 – Fall 2021
Udemy courses
Done by : Maryam Abdullah Al-kathiri
Id : 126922
Date: 01/06/2020
Introduction :
Data analysis and visualization have become very important in the last few years.
It is not only showing data but it also aims to explore data and sum up some
information by using some algorithms.Here data of the udemy website courses
will be analysed and 4 research questions will be answered.

Dataset Description:
This data set belongs to the udemy website. It is one of the popular platforms that provide
online courses in many different fields.It consists of courses that relate to 4 subjects :
Business Finance, Graphic Design, Musical Instruments and Web Design . The attributes of
the dataset are : course_id(unique) , course_title , url , is_paid
(boolean),price,num_subscribers , num_lectures , level , content_duration ,
published_timestamp (date) subject .

Fig(1) : Reading dataset code.

Fig(2) : Udemy courses dataset.


The dataset has 3678 rows (courses) and 12 columns. Hence, the It is saved as a
csv file, multivariate, and downloaded from Kaggle. The table above represents the
dataset .

Tools and Methods :


The tools that are used in the project are : I will use time series and the correlation
to describe and to analyze the dataset.Then spout the ability of the prediction and
the accuracy I will use the logistic regression.

Research questions Answering :


This Research Question is written in order to explore and visualize the data, I will
answer it and I will put the result and it will be followed by the observations.

1.Which course has the maximum subscribers ?


First the index of the maximum will be found , and then by showing the course
which is in that index.

The name of the course is shown above.


2.there is correlation between : Number of lectures and Content
Duration , Price and number of lectures , Price and number of subscribers ?
correlation codes(using seaborn library) in order to plot the relationships.

No correlation because as the graph above shows between the price and the
number of subscribers and the correlation coefficient equals 0.05 , but we can
see that the free courses have a large number of subscribers . Moreover, some
expensive courses have a high number of subscribers and that can be explained
by the fact that the udemy always conducts sails to attract the customers .
The next correlation is between the number of subscribers and the content
duration. There is a slight correlation but also very weak because the correlation
coefficient equals 0.16 , and we can consider the content duration either long or
short as a factor that attracts the customers.

We can see the last correlation is stronger than the others , and the coefficient is
high between the number of lectures and content duration , when the content
duration increases the number of lectures also increases .
3.How the number of subscribers differ year by year in each subject ?
In order to answer this question a column for year will be added by using this
code
Then plot a barchart for each subject how the number of subscribers for it’s courses
differs year by year , and this what we got :

ot

Overall all the subjects have decreased in number of subscribers in the last two
years , all the subjects have a high number in 2011 except business finance . I f
we focus more we will see there is a sharp difference between the number of
subscribers between business finance and the web development , the first one it,s
max in 2013 and it was not reach 4000 , the second one it,s minimum number in
2017 it was not less than 4000 , that's mean web subject are very required
because of the wealth technology .
4.if we want to predict the prices of next year 2018 based on the content duration and
the number of lectures , what is the percentage of the accuracy of our prediction ?
We will use the logistic regression and we will split the dataset into parts , 80% of
our dataset will be training and the remaining will be the test set , then will fit the
x and y of our training and find the accuracy between the training and the test set.

The accuracy approximately 25% thats mean very weak , so we can not predict
based on this attributes only , because the correlation between the two and the
price not strong , we need to find a good attribute , but approximately no attribute
that the price related to it very much , it just change in an exact range based on
another factors .

Conclusion :
Data analysis and visualization is an important field in our days , and a
wide field at the same time. First we explore the data set by the
relationships between the attributes and how the point in it changes and
others . I found the relationships between the attributes is weak even I
wrote here just 3 , the strong one is between the number of reviews
aand subscribers and this is predictable and normal , the another
strong relationship between the number of lectures and the content
duration .Then I want what is happen year by year in the number of
subscribers in the all subjects , and the barshart gave us an idea about
that . Then next I want to predict the prices , but first I checked the
accuracy by just testing in the part of my dataset , then I discovered I
can not predict because of the weak accuracy. Maybe we can increase
the accuracy by increasing the attributes and be specific in more
attributes like the subject and the level.
References:
1.https://realpython.com/numpy-scipy-pandas-correlation-python/
2.https://www.investopedia.com/ask/answers/032515/what-does-it-mea
n-if-correlation-coefficient-positive-negative-or-zero.asp
3.Abdulhamit Subasi, in Practical Machine Learning for Data Analysis Using
Python, 2020

You might also like