Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38

INTRODUCTION TO MACHINE LEARNING

Undoubtedly, Machine Learning is the most in-demand technology in


today’s market. Its applications range from self-driving cars to predicting
deadly diseases such as ALS. The high demand for Machine Learning skills
is the motivation behind this blog. In this blog on Introduction To Machine
Learning, you will understand all the basic concepts of Machine Learning
and a Practical Implementation of Machine Learning by using the R
language.

Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques used
to learn patterns from data and draw significant information from it. It is the logic
behind a Machine Learning model. An example of a Machine Learning algorithm is the
Linear Regression algorithm.

Model: A model is the main component of Machine Learning. A model is trained by


using a Machine Learning Algorithm. An algorithm maps all the decisions that a model is
supposed to take based on the given input, in order to get the correct output.

Predictor Variable: It is a feature(s) of the data that can be used to predict the output.

Response Variable: It is the feature or the output variable that needs to be predicted by
using the predictor variable(s).

Training Data: The Machine Learning model is built using the training data. The training
data helps the model to identify key trends and patterns essential to predict the output.

Testing Data: After the model is trained, it must be tested to evaluate how accurately it
can predict an outcome. This is done by the testing data set.

Need For Machine Learning


Ever since the technical revolution, we’ve been generating an immeasurable amount of
data. As per research, we generate around 2.5 quintillion bytes of data every single day!
It is estimated that by 2020, 1.7MB of data will be created every second for every person
on earth.

With the availability of so much data, it is finally possible to build predictive models that
can study and analyze complex data to find useful insights and deliver more accurate
results.

Top Tier companies such as Netflix and Amazon build such Machine Learning models by
using tons of data in order to identify profitable opportunities and avoid unwanted
risks.

Here’s a list of reasons why Machine Learning is so important:

 Increase in Data Generation: Due to excessive production of data, we need a


method that can be used to structure, analyze and draw useful insights from
data. This is where Machine Learning comes in. It uses data to solve problems
and find solutions to the most complex tasks faced by organizations.

 Improve Decision Making: By making use of various algorithms, Machine


Learning can be used to make better business decisions. For example,
Machine Learning is used to forecast sales, predict downfalls in the stock
market, identify risks and anomalies, etc.
Introduction to Scikit-learn
A machine learning (ML) library for the Python programming language, Scikit-learn has
a large number of algorithms that can be readily deployed by programmers and data
scientists in machine learning models.
What Is Scikit-learn?
Scikit-learn is a popular and robust machine learning library that has a vast assortment
of algorithms, as well as tools for ML visualizations, preprocessing, model fitting,
selection, and evaluation.
Building on NumPy, SciPy, and matplotlib, Scikit-learn features a number of efficient
algorithms for classification, regression, and clustering. These include support vector
machines, rain forests, gradient boosting, k-means, and DBSCAN.
Scikit-learn boasts relative ease-of-development owing to its consistent and efficiently
designed APIs, extensive documentation for most algorithms, and numerous online
tutorials.
Current releases are available for popular platforms including Linux, MacOS, and
Windows.
Why Scikit-learn?
The Scikit-learn API has become the de facto standard for machine learning
implementations thanks to its relative ease of use, thoughtful design, and enthusiastic
community.
Scikit-learn provides the following modules for ML model building, fitting, and evaluation:

 Preprocessing refers to Scikit-learn tools useful in feature extraction and


normalization during data analysis.
 Classification refers to a set of tools that identify the category associated with
data in a machine learning model. These tools can be used to categorize email
messages as either valid or as spam, for example. Essentially, classification
identifies to which category an object belongs.
 Regression refers to the creation of an ML model that tries to understand the
relationship between input and output data, such as the behavior or stock prices.
Regression predicts a continuous-valued-attribute associated with an object.
 Clustering tools in Scikit-learn automatically group data with similar
characteristics into sets, such as customer data arranged in sets based on
physical location.
 Dimensionality reduction reduces the number of random variables for analysis.
For example, to increase the efficiency of visualizations, outlying data may be left
out.
 Model selection refers to algorithms and their ability to offer tools that compare,
validate, and select optimal parameters for use in data science machine learning
projects.
 Pipeline refers to utilities for building a model workflow.
 Visualizations for machine learning allow for quick plotting and visual
adjustments.

How Does Scikit-learn Work?


Scikit-learn is written primarily in Python and uses NumPy for high-performance linear
algebra, as well as for array operations. Some core Scikit-learn algorithms are written
in Cython to boost overall performance.
As a higher-level library that includes several implementations of various machine
learning algorithms, Scikit-learn lets users build, train, and evaluate a model in a few
lines of code.
Scikit-learn provides a uniform set of high-level APIs for building ML pipelines or
workflows.

You use a Scikit-learn ML Pipeline to pass the data through transformers to extract the
features and an estimator to produce the model, and then evaluate predictions to
measure the accuracy of the model.
 Transformer: This is an algorithm that transforms or inputs the data for pre-
processing.
 Estimator: This is a machine learning algorithm that trains or fits the data to build
a model, which can be used for predictions.
 Pipeline: A pipeline chains Transformers and Estimators together to specify an
ML workflow

Overfitting and Underfitting

Overfitting and Underfitting Principles

A lot of articles have been written about overfitting, but almost all of
them are simply a list of tools. “How to handle overfitting — top 10 tools” or
“best techniques to prevent overfitting”. It’s like being shown nails without
explaining how to hammer them. It can be very confusing for people who
are trying to figure out how overfitting works. Also, these articles often do
not consider underfitting, as if it does not exist at all.

In this article, I would like to list the basic principles (exactly principles)
for improving the quality of your model and, accordingly, preventing
underfitting and overfitting on a particular example. This is a very general
issue that can apply to all algorithms and models, so it is very difficult to
fully describe it. But I want to try to give you an understanding of why
underfitting and overfitting occur and why one or another particular
technique should be used.

Underfitting and Overfitting and Bias/Variance Trade-


off

Although I’m not describing all the concepts you need to know here (for
example, quality metrics or cross-validation), I think it’s important to
explain to you (or just remind you) what underfitting/overfitting is.

To figure this out, let’s create some dataset, split it into train and test sets,
and then train three models on it — simple, good and complex (I will not use
a validation set in this example to simplify it, but I will tell about it later). All
code is available in this GitLab repo.

Generated dataset. Image by Author

Underfitting is a situation when your model is too simple for your data.
More formally, your hypothesis about data distribution is wrong and too
simple — for example, your data is quadratic and your model is linear. This
situation is also called high bias. This means that your algorithm can do
accurate predictions, but the initial assumption about the data is incorrect.
Underfitting. The linear model trained on cubic data. Image by Author

Opposite, overfitting is a situation when your model is too complex for


your data. More formally, your hypothesis about data distribution is wrong
and too complex — for example, your data is linear and your model is high-
degree polynomial. This situation is also called high variance. This means
that your algorithm can’t do accurate predictions — changing the input data
only a little, the model output changes very much.

Overfitting. The 13-degree polynomial model trained on cubic data. Image by Author

These are two extremes of the same problem and the optimal solution
always lies somewhere in the middle.
Good model. The cubic model trained on cubic data.

I will not talk much about bias/variance trade-off (you can read the basics in
this article), but let me briefly mention possible options:

 low bias, low variance — is a good result, just right.


 low bias, high variance — overfitting — the algorithm outputs
very different predictions for similar data.
 high bias, low variance — underfitting — the algorithm outputs
similar predictions for similar data, but predictions are wrong
(algorithm “miss”).
 high bias, high variance — very bad algorithm. You will most likely
never see this.
Bias and Variance options on four plots. Image by Author

All these cases can be placed on the same plot. It is a bit less clear than the
previous one but more compact.

Bias and Variance options on one plot. Image by Author


Underfitting means that your model makes accurate, but initially
incorrect predictions. In this case, train error is large and val/test
error is large too.

Overfitting means that your model makes not accurate predictions. In this
case, train error is very small and val/test error is large.

When you find a good model, train error is small (but larger than in
the case of overfitting), and val/test error is small too.

In the case above, the test error and validation error are approximately the
same. This happens when everything is fine, and your train, validation, and
test data have the same distributions. If validation and test error are very
different, then you need to get more data similar to test data and make sure
that you split the data correctly.

So, the conclusion is — getting more data can help only with
overfitting (not underfitting) and if your model is not TOO
complex
Linear regression
Linear regression analysis is used to predict the value of a variable based on
the value of another variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict the other variable's
value is called the independent variable

The term regression is used when you try to find the relationship between
variables.

In Machine Learning and in statistical modeling, that relationship is used to


predict the outcome of events. In this module, we will cover the following
questions:

 Can we conclude that Average_Pulse and Duration are related to


Calorie_Burnage?
 Can we use Average_Pulse and Duration to predict Calorie_Burnage?

Least Square Method


Linear regression uses the least square method.

The concept is to draw a line through all the plotted data points. The line is
positioned in a way that it minimizes the distance to all of the data points.

The distance is called "residuals" or "errors".

The red dashed lines represents the distance from the data points to the
drawn mathematical function.

Linear Regression Using One Explanatory Variable


In this example, we will try to predict Calorie_Burnage with Average_Pulse
using Linear Regression:
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

full_health_data = pd.read_csv("data.csv", header=0, sep=",")

x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()

Example Explained:
 Import the modules you need: Pandas, matplotlib and Scipy
 Isolate Average_Pulse as x. Isolate Calorie_burnage as y
 Get important key values with: slope, intercept, r, p, std_err =
stats.linregress(x, y)
 Create a function that uses the slope and intercept values to return a
new value. This new value represents where on the y-axis the
corresponding x value will be placed
 Run each value of the x array through the function. This will result in a
new array with new values for the y-axis: mymodel =
list(map(myfunc, x))
 Draw the original scatter plot: plt.scatter(x, y)
 Draw the line of linear regression: plt.plot(x, mymodel)
 Define maximum and minimum values of the axis
 Label the axis: "Average_Pulse" and "Calorie_Burnage"
Output:

Logistic regression

Logistic regression is a statistical analysis method to predict a binary outcome, such as


yes or no, based on prior observations of a data set.

A logistic regression model predicts a dependent data variable by analyzing the


relationship between one or more existing independent variables. For example, a logistic
regression could be used to predict whether a political candidate will win or lose an
election or whether a high school student will be admitted or not to a particular college.
These binary outcomes allow straightforward decisions between two alternatives.

A logistic regression model can take into consideration multiple input criteria. In the case
of college acceptance, the logistic function could consider factors such as the student's
grade point average, SAT score and number of extracurricular activities. Based on
historical data about earlier outcomes involving the same input criteria, it then scores
new cases on their probability of falling into one of two outcome categories.

Logistic regression has become an important tool in the discipline of machine learning. It
allows algorithms used in machine learning applications to classify incoming data based
on historical data. As additional relevant data comes in, the algorithms get better at
predicting classifications within data sets.
Logistic regression can also play a role in data preparation activities by allowing data
sets to be put into specifically predefined buckets during the extract, transform, load
(ETL) process in order to stage the information for analysis.

What is the purpose of logistic regression?

Logistic regression streamlines the mathematics for measuring the impact of multiple
variables (e.g., age, gender, ad placement) with a given outcome (e.g., click-through or
ignore). The resulting models can help tease apart the relative effectiveness of various
interventions for different categories of people, such as young/old or male/female.

Logistic models can also transform raw data streams to create features for other types of
AI and machine learning techniques. In fact, logistic regression is one of the commonly
used algorithms in machine learning for binary classification problems, which are
problems with two class values, including predictions such as "this or that," "yes or no,"
and "A or B."

Logistic regression can also estimate the probabilities of events, including determining a
relationship between features and the probabilities of outcomes. That is, it can be used
for classification by creating a model that correlates the hours studied with the likelihood
the student passes or fails. On the flip side, the same model could be used for predicting
whether a particular student will pass or fail when the number of hours studied is
provided as a feature and the variable for the response has two values: pass and fail

Support Vector Machine


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so
if we want a model that can accurately identify whether it is a cat or dog, so
such a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc

Types of SVM
SVM can be of two types:

 Linear SVM: Linear SVM is used for linearly separable data, which means if

a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,

which means if a dataset cannot be classified by using a straight line, then


such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier

Face Recognition using OpenCV


Introduction
In this article, we are going to see what is face recognition? and how it is different
from face detection also? We will understand briefly the theory of face recognition
and then jump to the coding section!! At the end of this article, you will be able to
develop a face recognition program for recognizing faces in images!!!

Agenda of this Article


1. Overview of Face Detection

2. Overview of Face Recognition

3. Understand what is OpenCV

4. Implementation using Python

Overview of Face Detection


What if the machine is able to detect objects automatically in an image without
human involvement? Let us see: Face detection can be such a problem where we
detect human faces in an image. There might be slight differences in human faces,
but after all, it is safe to say that there are specific features that are associated with
all human faces. Various face detection algorithms are there but the Viola-
Jones Algorithm is the oldest method that is also used today.

Face detection is generally the first step towards many face-related applications
like face recognition or face verification. But, face detection has very useful
applications. One of the most successful applications of face detection is probably
“photo-taking”.

Example: When you click a photo of your friends, the camera in which the face
detection algorithm has built-in detects where the faces are and adjusts focus
accordingly.
Overview of Face Recognition
Now we have seen our algorithms can detect faces but can we also recognize
whose faces are there? And what if an algorithm is able to recognize faces?

Generally, Face Recognition is a method of identifying or verifying the identity of


an individual by using their face. Various algorithms are there for face recognition
but their accuracy might vary. Here I am going to discuss with you that how we
can do face recognition using deep learning.

Now let us understand how we can recognize faces using deep learning. Here we
use face embeddings in which every face is converted into a vector. The technique
of converting the face into a vector is called deep metric learning. Let me divide
this process into three simple steps for better and easy understanding:
1. Face Detection:

The first task that we perform is detecting faces in the image(photograph) or video
stream. Now we know that the exact coordinates/location of the face, so we extract
this face for further processing.
2. Feature Extraction:

Now see we have cropped out the face from the image, so we extract specific
features from it. Here we are going to see how to use face embeddings to extract
these features of the face. As we know a neural network takes an image of the face
of the person as input and outputs a vector that represents the most important
features of a face! In machine learning, this vector is nothing but
called embedding and hence we call this vector face embedding. Now how this
will help in recognizing the faces of different people?

When we train the neural network, the network learns to output similar vectors for
faces that look similar. Let us consider an example, if I have multiple images of
faces within different timelapse, it’s obvious that some features may change but
not too much. So in this problem, the vectors associated with the faces are similar
or we can say they are very close in the vector space.

Up to this point, we came to know how this network works, let us see how to use
this network on our own data. Here we pass all the images in our data to this pre-
trained network to get the respective embeddings and save these embeddings in a
file for the next step.
3. Comparing Faces:

We have face embeddings for each face in our data saved in a file, the next step is
to recognize a new image that is not in our data. Hence the first step is to compute
the face embedding for the image using the same network we used earlier and then
compare this embedding with the rest of the embeddings that we have. We
recognize the face if the generated embedding is closer or similar to any other
embedding
Understand What is OpenCV
In the Artificial Intelligence field, Computer Vision is one of the most interesting
and challenging tasks. Computer Vision acts as a bridge between Computer
Software and visualizations. Computer Vision allows computer software to
understand and learn about the visualizations in the surroundings.

Let us understand an example: Based on the shape, color, and size that determines
the fruit. This task is very easy for the human brain but in the Computer Vision
pipeline, first, we need to gather the data, then we perform the data processing
operations, and then we train and teach the model to understand how to distinguish
between the fruits based on its size, shape, and color of the fruit.

Nowadays, various packages are available to perform machine learning, deep


learning, and computer vision problems. So far, computer vision is the best module
for such complex problems. OpenCV is an open-source library. It is supported by
different programming languages such as R, Python, etc. It runs probably on most
platforms such as Windows, Linux, and macOS.
Advantages of OpenCV:
1. Open CV is free of cost and an open-source library.

2. Open CV is fast as it is written in C/C++ language as compared to others

3. With less system RAM, OpenCV works better.


4. It supports most of the operating systems like Windows, Linux, and macOS.

Implementation
In this section, we are going to implement face recognition using OpenCV and
Python.

First, let us see what libraries we will need and how to install them:

1. OpenCV

2. dlib

3. Face_recognition

OpenCV is a video and image processing library and it is used for image and
video analysis, like facial detection, license plate reading, photo editing, advanced
robotic vision, and many more.

The dlib library contains our implementation of ‘deep metric learning’ which is
used to construct our face embeddings used for the actual recognition process.

The face_recognition library is super easy to work with and we will be using this
in our code. First, remember to install dlib library before you install
face_recognition.

Time Series Analysis


Time series analysis is a specific way of analyzing a sequence of data points collected over an interval
of time. In time series analysis, analysts record data points at consistent intervals over a set period of
time rather than just recording the data points intermittently or randomly. However, this type of
analysis is not merely the act of collecting data over time.
What sets time series data apart from other data is that the analysis can show how variables change
over time. In other words, time is a crucial variable because it shows how the data adjusts over the
course of the data points as well as the final results. It provides an additional source of information
and a set order of dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis can
cut through noisy data. It also ensures that any trends or patterns discovered are not outliers and can
account for seasonal variance. Additionally, time series data can be used for forecasting—predicting
future data based on historical data.

Time series analysis examples

Time series analysis is used for non-stationary data—things that are constantly fluctuating
over time or are affected by time. Industries like finance, retail, and economics frequently use
time series analysis because currency and sales are always changing. Stock market analysis is
an excellent example of time series analysis in action, especially with automated trading
algorithms. Likewise, time series analysis is ideal for forecasting weather changes, helping
meteorologists predict everything from tomorrow’s weather report to future years of climate
change. Examples of time series analysis in action include:

 Weather data

 Rainfall measurements

 Temperature readings

 Heart rate monitoring (EKG)

 Brain monitoring (EEG)

 Quarterly sales

 Stock prices

 Automated stock trading

 Industry forecasts

 Interest rates

Models of time series analysis include:

 Classification: Identifies and assigns categories to the data.

 Curve fitting: Plots the data along a curve to study the relationships of variables within the
data.
 Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or seasonal
variation.

 Explanative analysis: Attempts to understand the data and the relationships within it, as well
as cause and effect.

 Exploratory analysis: Highlights the main characteristics of the time series data, usually in a
visual format.

 Forecasting: Predicts future data. This type is based on historical trends. It uses the historical
data as a model for future data, predicting scenarios that could happen along future plot
points.

 Intervention analysis: Studies how an event can change the data.

 Segmentation: Splits the data into segments to show the underlying properties of the source
information

Forecasting

Forecasting Lis to predict or estimate (a future event or trend). For


businesses and analysts forecasting is determining what is going to happen
in the future by analyzing what happened in the past and what is going on
now.

Let’s think about how forecasting is applicable and where it’s required.
Forecasting can be key when deciding whether to build a dam, or a power
generation plant in the next few years based on forecasts of future demand;
I’ve used forecasting at one of my former workplaces to help in scheduling
staff in a call center for the coming week based on call volumes at certain
days and certain times. Another area I’ve applied forecasting is telling a
business when to stock up and on what they should stock for this was purely
based on demand for the product.

I bet you’ve used forecasting before you read this, maybe not in the same
cases as mine but I’m 100% sure you have. Have you ever looked at the
weather and realized you’re overdressed or under dressed? That is
forecasting!
Forecasting using FB Prophet
Prophet is a procedure for forecasting time series data based on an additive
model where non-linear trends are fit with yearly, weekly, and daily
seasonality, plus holiday effects. It works best with time series that have
strong seasonal effects and several seasons of historical data. Prophet is
robust to missing data and shifts in the trend, and typically handles outliers
well.
Prophet is open source software released by Facebook’s Core Data Science
team. It is available for download on CRAN and PyPI.
Accurate and fast.
Prophet is used in many applications across Facebook for producing
reliable forecasts for planning and goal setting. We’ve found it to perform
better than any other approach in the majority of cases. We fit models
in Stan so that you get forecasts in just a few seconds.
Fully automatic.
Get a reasonable forecast on messy data with no manual effort. Prophet is
robust to outliers, missing data, and dramatic changes in your time series.
Tunable forecasts.
The Prophet procedure includes many possibilities for users to tweak and
adjust forecasts. You can use human-interpretable parameters to improve
your forecast by adding your domain knowledge.
Available in R or Python.
We’ve implemented the Prophet procedure in R and Python, but they share
the same underlying Stan code for fitting. Use whatever language you’re
comfortable with to get forecasts.

Tableu

Tableau provides several options to augment and create new data fields.
You can perform arithmetic, logical, spatial, and predictive modeling
functions using calculated fields. Tableau is a powerful Business
Intelligence (BI) tool, but there are limitations; that's where Python
language comes to the rescue.

Python is popular programming among the data community. You can use it
to extract, clean, process, and apply complex statistical functions to the
data. It provides you with machine learning frameworks, data
orchestrations, multiprocessing, and rich libraries to perform almost any
task possible.
Python is a multipurpose language, and using it with Tableau gives us the
freedom to perform highly complex tasks. In this tutorial, we are going to
use Python for extracting and cleaning the data. Then, we will be using
clean data to create data visualization on Tableau.

We will not be using Tabpy to create a Tableau Python server and execute
Python scripts within Tableau. Instead, we will first extract and clean the
data in Python (Jupyter Notebook) and then use Tableau to create
interactive visualization.

Goodreads Data Viz | Tableau Public

This is a code-based step-by-step tutorial on Goodreads API and creating


complex visualization on Tableau. Check out the link below to access the
code and the Tableau dashboard.

 DataCamp Workspace
 Tableau Public
Data Ingestion and Processing with Python

In the first part of the tutorial, we will learn to use Goodreads API to access
public data. In our case, we will be focusing on the user profile and
converting it into a readable Pandas dataframe. Furthermore, we will clean
the data and export it into CSV file format.

Getting Started
We will be using DataCamp’s Workspace for running the Python code. It
comes with the necessary Python packages for data science tasks.

If you are new to Python and want to set up the environment on your local
machine, install Anaconda. It will install Python, Jupyter Notebook, and
necessary Python Packages.

Before we start writing the code, we have to install the xmltodict package
as it is not part of the Workspace or Anaconda data stack. We will use `pip`
to install the missing Python package.

Note: The `!` symbol only works in Jupyter Notebooks. It lets us access the
terminal within the Jupyter code cell.

!pip install xmltodict

>>> Collecting xmltodict

>>> Using cached xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)

>>> Installing collected packages: xmltodict

>>> Successfully installed xmltodict-0.13.0

In the next step, we will import the necessary packages.


import pandas as pd

import xmltodict

import urllib.request

Parsing the Profile Link


To extract user data, we need both user id and user name. In this section,
we will parse the user (Abid) profile link.

Note: You can use your friend’s profile or use your profile link, and run this
script.

 It extracts user_id by filtering digitals within the link and returns


“73376016”.
 To extract user_name, we will split out the string on user_id and then
split it on “-” to get the user. After replacing “-” with a space, we get
the user name “abid”.
 Finally, we will concatenate user_id with user_name. This unique id
will be used in the next section to access user data.
Goodread_profile = "https://www.goodreads.com/user/show/73376016-abid"

user_id = ''.join(filter(lambda i: i.isdigit(), Goodread_profile))

user_name = Goodread_profile.split(user_id, 1)[1].split('-', 1)[1].replace('-', ' ')

user_id_name = user_id+'-'+user_name

print(user_id_name)

>>> 73376016-abid
Goodreads Data Extraction
At the end of 2020, Goodreads will stop providing developer API. You can
read the full report here. To overcome this issue, we will be using API keys
from old projects such as streamlit_goodreads_app. The project explains
how to access the Goodreads user data using API.

Goodreads also provides you the option to download the data in CSV file
format without an API key, but it is limited to a user, and it doesn't give us
the freedom to extract real-time data.

In this section, we will be creating functions that will take user_id_name,


version, shelf, per_page, and apiKey.

 apiKey: is to get access to the public data


 version: to specify the latest data type.
 shelf: There are multiple shelves in the user profile but mostly read,
to-read, and currently-reading.
 per_page: Number of books entries per page
The function takes user inputs to prepare the URL and then downloads the
data using urllib.request. Finally, we get the data in XML format.

apiKey = "ZRnySx6awjQuExO9tKEJXw"

version = "2"

shelf = "read"

per_page = "200"

def get_user_data(user_id, apiKey, version, shelf, per_page):

api_url_base = "https://www.goodreads.com/review/list/"
final_url = (

api_url_base

+ user_id

+ ".xml?key="

+ apiKey

+ "&v="

+ version

+ "&shelf="

+ shelf

+ "&per_page="

+ per_page

contents = urllib.request.urlopen(final_url).read()

return contents

contents = get_user_data(user_id_name,apiKey,version, shelf, per_page)

print(contents[0:100])

>>> b'<?xml version="1.0" encoding="UTF-8"?>\n<GoodreadsResponse>\n <Request>\n


<authentication>true</aut'
Converting XML to JSON
Our initial data is in XML format, and there is no direct way to convert it into
a structured database. So, we will transform it into JSON using
the xmltodict Python package.

The XML data is converted into nested JSON format, and to display the
first entry in book reviews data, we will use square brackets to access
encapsulated data.

You can experiment with metadata and explore more options, but in this
tutorial, we will be focusing on users reviewing data.

contents_json = xmltodict.parse(contents)

print(contents_json["GoodreadsResponse"]["reviews"]["review"][:1])

>>> [{'id': '4626706284', 'book': {'id': {'@type': 'integer', '#text': '57771224'}, 'isbn': '1250809606',
'isbn13': '9781250809605', 'text_reviews_count': {'@type': 'integer', '#text': '150'}, 'uri':
'kca://book/amzn1.gr.book.v3.tcNoY0o7ErAhczdQ', 'title': 'Good Intentions', 'title_without_series':
'Good Intentions', 'image_url': .........

Converting JSON to Pandas Dataframe


To convert JSON data type to Pandas dataframe, we will use the
json_normalize function. The review data is present at the third level, and
to access it, we will access GoodreadsResponse, reviews, and review.

Before we display the dataframe, we will filter out irrelevant data by


dropping the books with missing date_updated column.

Learn various ways to ingest CSV files, spreadsheets, JSON, SQL


databases, and APIs using Pandas by taking Streamlined Data Ingestion
with pandas course.
df = pd.json_normalize(contents_json["GoodreadsResponse"]["reviews"]["review"])

df = df[df["date_updated"].notnull()]

df.head()

Data Cleaning
The raw dataframe looks reasonably clean, but we still need to reduce the
number of columns.

As we can see, there are 61 columns.

df.shape

(200, 61)

Let’s drop the empty ones.


df.dropna(axis=1, how='all', inplace=True)

df.shape

(200, 58)

We have successfully dropped 3 columns with missing values.

We will now check all column names by using `df.columns` and select the
most useful columns.

final_df = df[

"rating",

"started_at",

"read_at",

"date_added",

"book.title",

"book.average_rating",

'book.ratings_count',

"book.publication_year",

"book.authors.author.name"

final_df.head()
As we can observe, the final dataframe looks clean with the relevant data
fields.

Exporting CSV File


In the last section, we will export the dataframe into a CSV file that is
compatible with Tableau. In the to_csv function, add the name of the file
with the extension type and drop the index by changing the index argument
to False.

final_df.to_csv("abid_goodreads_clean_data.csv",index=False)

The CSV file will show in the current directory.


Goodreads Clean CSV File

You can also check out the Python Jupyter Notebook: Data Ingestion
using Goodreads API. It will help you debug your code and if you want to
skip the Python programming part, you can simply download the file by
clicking on the Copy & Edit button and running the script.

Data Visualization in Tableau

In the second part, we will use clean data and create simple and complex
data visualization in Tableau. Our goal is to plot interactive charts which will
help us understand the user’s book reading behavior.

Connecting the Data


We will connect the CSV file by selecting the Text file option and selecting
the abid_goodreads_clean_data.csv file. After that, we will change the
Started At, Read At, and Date Added data fields to Date & Time, as shown
below.

Note: It is a good practice to modify your data fields at the start.


Connecting Data and Modifying Data Types

Creating Rating Histogram


In this section, we will create the user book rating histogram.

 First, drag and drop the Rating field to the Rows shelf.

User Rating Histogram Part 1

 Click on the Show Me drop-down button to access the visualization


templates. We will convert the bar chart to a histogram by clicking on
the Histogram option.
User Rating Histogram Part 2

 The Rating axis has 0.5 interval tick marks. Change the tick marks by
right-clicking on the bottom axis and selecting Edit Axis. After that
click on the Tick Marks tab and change the Major Tick
Marks to Fixed. Make sure the Tick Origin is 0 and the Tick Interval is
1.

User Rating Histogram Part 3

1. We will customize the histogram by cleaning axis labels, changing


the colors and borders of the bar, and adding mark labels. You can
do all of this by accessing the options on the Marks panel. You can
find it in the middle-left section.
User Rating Histogram Part 4

The user had typically given ratings between 3 and 4. The zero ratings are
the books that are not rated.

Line Plot
To plot line chart:

1. Drag and drop Book.Publication Year field


to Rows and Columns Shelf.
2. Change the Rows data field to count by right-clicking on it and
selecting Measure > Count.
3. Change the Columns data field to dimensions by right-clicking on it
and selecting Dimensions.
4. Go to the Marks panel, click the Automatic dropdown option, and
change it to Line.
5. Clean the axis label, customize the chart, add title, and remove null
values.

Book Publication Year Line Plot


The user has read some old books, but they are particularly interested in
the books that are published between the years 2015 and 2020.

If you are feeling overwhelmed and want to learn the fundamentals of


Tableau, you might find Tableau Tutorial for Beginners by Eugenia
Anello helpful.

Box Plot
To plot box and whisker plot:

1. Drag and drop Books.Ratings Count data field to Rows shelf. Change
it from Discrete to Continuous.
Drag and drop Books.Ratings Count data field to Detail option
at Marks panel. Change it from Measure to Dimension.
2.
Creates a Bin Field
1. Right-click on the recently created Read Duration (bin) field and
select the Read Duration in Days option.
2. Change the Size of Bins to 10. It will create multiple smaller chunks
of data which will help us create a more refined version of the packed
bubbles plot.

Editing Bins

We have crossed the hard path, and now it's time to see the fruits of our
labors.
1. To create the simple visualization, click on Show Me and select
the packed bubbles option. You will see unicolor circles of different
sizes.
2. To add some colors, we will drag and drop the Read Duration field
onto the Color option in the Marks panel.
3. Change the color field to Count (Distinct) by right-clicking on the field
and selecting Measure > Count (Distinct). It will give a unique color to
each bin or label.

Unicolor Packed Bubbles Plot

1. Click on the Color option in the Marks panel and select Edit
Colors… > Sunrise-Sunset Diverging. You can pick any gradient
color that suits your taste.
2. The last part is all about customizing and making sure your
visualization is appealing and conveys the right message.

Packed Bubbles Plot

It took the user less than a day to finish most of the books. You can also
see a few outliers above 300.
We can also create a Tableau dashboard by combining these
visualizations. Learn how to create a Tableau dashboard by following
the Tutorial.

Conclusion

2. After reading this Data Science with Python article, you have learned what
data science is, why it is important, and the different libraries involved in
data science. You learned the different skills needed when it comes to data
science, such as exploratory data analysis, data wrangling, and model
building.

You might also like