Professional Documents
Culture Documents
Data Science
Data Science
Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques used
to learn patterns from data and draw significant information from it. It is the logic
behind a Machine Learning model. An example of a Machine Learning algorithm is the
Linear Regression algorithm.
Predictor Variable: It is a feature(s) of the data that can be used to predict the output.
Response Variable: It is the feature or the output variable that needs to be predicted by
using the predictor variable(s).
Training Data: The Machine Learning model is built using the training data. The training
data helps the model to identify key trends and patterns essential to predict the output.
Testing Data: After the model is trained, it must be tested to evaluate how accurately it
can predict an outcome. This is done by the testing data set.
With the availability of so much data, it is finally possible to build predictive models that
can study and analyze complex data to find useful insights and deliver more accurate
results.
Top Tier companies such as Netflix and Amazon build such Machine Learning models by
using tons of data in order to identify profitable opportunities and avoid unwanted
risks.
You use a Scikit-learn ML Pipeline to pass the data through transformers to extract the
features and an estimator to produce the model, and then evaluate predictions to
measure the accuracy of the model.
Transformer: This is an algorithm that transforms or inputs the data for pre-
processing.
Estimator: This is a machine learning algorithm that trains or fits the data to build
a model, which can be used for predictions.
Pipeline: A pipeline chains Transformers and Estimators together to specify an
ML workflow
A lot of articles have been written about overfitting, but almost all of
them are simply a list of tools. “How to handle overfitting — top 10 tools” or
“best techniques to prevent overfitting”. It’s like being shown nails without
explaining how to hammer them. It can be very confusing for people who
are trying to figure out how overfitting works. Also, these articles often do
not consider underfitting, as if it does not exist at all.
In this article, I would like to list the basic principles (exactly principles)
for improving the quality of your model and, accordingly, preventing
underfitting and overfitting on a particular example. This is a very general
issue that can apply to all algorithms and models, so it is very difficult to
fully describe it. But I want to try to give you an understanding of why
underfitting and overfitting occur and why one or another particular
technique should be used.
Although I’m not describing all the concepts you need to know here (for
example, quality metrics or cross-validation), I think it’s important to
explain to you (or just remind you) what underfitting/overfitting is.
To figure this out, let’s create some dataset, split it into train and test sets,
and then train three models on it — simple, good and complex (I will not use
a validation set in this example to simplify it, but I will tell about it later). All
code is available in this GitLab repo.
Underfitting is a situation when your model is too simple for your data.
More formally, your hypothesis about data distribution is wrong and too
simple — for example, your data is quadratic and your model is linear. This
situation is also called high bias. This means that your algorithm can do
accurate predictions, but the initial assumption about the data is incorrect.
Underfitting. The linear model trained on cubic data. Image by Author
Overfitting. The 13-degree polynomial model trained on cubic data. Image by Author
These are two extremes of the same problem and the optimal solution
always lies somewhere in the middle.
Good model. The cubic model trained on cubic data.
I will not talk much about bias/variance trade-off (you can read the basics in
this article), but let me briefly mention possible options:
All these cases can be placed on the same plot. It is a bit less clear than the
previous one but more compact.
Overfitting means that your model makes not accurate predictions. In this
case, train error is very small and val/test error is large.
When you find a good model, train error is small (but larger than in
the case of overfitting), and val/test error is small too.
In the case above, the test error and validation error are approximately the
same. This happens when everything is fine, and your train, validation, and
test data have the same distributions. If validation and test error are very
different, then you need to get more data similar to test data and make sure
that you split the data correctly.
So, the conclusion is — getting more data can help only with
overfitting (not underfitting) and if your model is not TOO
complex
Linear regression
Linear regression analysis is used to predict the value of a variable based on
the value of another variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict the other variable's
value is called the independent variable
The term regression is used when you try to find the relationship between
variables.
The concept is to draw a line through all the plotted data points. The line is
positioned in a way that it minimizes the distance to all of the data points.
The red dashed lines represents the distance from the data points to the
drawn mathematical function.
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()
Example Explained:
Import the modules you need: Pandas, matplotlib and Scipy
Isolate Average_Pulse as x. Isolate Calorie_burnage as y
Get important key values with: slope, intercept, r, p, std_err =
stats.linregress(x, y)
Create a function that uses the slope and intercept values to return a
new value. This new value represents where on the y-axis the
corresponding x value will be placed
Run each value of the x array through the function. This will result in a
new array with new values for the y-axis: mymodel =
list(map(myfunc, x))
Draw the original scatter plot: plt.scatter(x, y)
Draw the line of linear regression: plt.plot(x, mymodel)
Define maximum and minimum values of the axis
Label the axis: "Average_Pulse" and "Calorie_Burnage"
Output:
Logistic regression
A logistic regression model can take into consideration multiple input criteria. In the case
of college acceptance, the logistic function could consider factors such as the student's
grade point average, SAT score and number of extracurricular activities. Based on
historical data about earlier outcomes involving the same input criteria, it then scores
new cases on their probability of falling into one of two outcome categories.
Logistic regression has become an important tool in the discipline of machine learning. It
allows algorithms used in machine learning applications to classify incoming data based
on historical data. As additional relevant data comes in, the algorithms get better at
predicting classifications within data sets.
Logistic regression can also play a role in data preparation activities by allowing data
sets to be put into specifically predefined buckets during the extract, transform, load
(ETL) process in order to stage the information for analysis.
Logistic regression streamlines the mathematics for measuring the impact of multiple
variables (e.g., age, gender, ad placement) with a given outcome (e.g., click-through or
ignore). The resulting models can help tease apart the relative effectiveness of various
interventions for different categories of people, such as young/old or male/female.
Logistic models can also transform raw data streams to create features for other types of
AI and machine learning techniques. In fact, logistic regression is one of the commonly
used algorithms in machine learning for binary classification problems, which are
problems with two class values, including predictions such as "this or that," "yes or no,"
and "A or B."
Logistic regression can also estimate the probabilities of events, including determining a
relationship between features and the probabilities of outcomes. That is, it can be used
for classification by creating a model that correlates the hours studied with the likelihood
the student passes or fails. On the flip side, the same model could be used for predicting
whether a particular student will pass or fail when the number of hours studied is
provided as a feature and the variable for the response has two values: pass and fail
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so
if we want a model that can accurately identify whether it is a cat or dog, so
such a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc
Types of SVM
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
Face detection is generally the first step towards many face-related applications
like face recognition or face verification. But, face detection has very useful
applications. One of the most successful applications of face detection is probably
“photo-taking”.
Example: When you click a photo of your friends, the camera in which the face
detection algorithm has built-in detects where the faces are and adjusts focus
accordingly.
Overview of Face Recognition
Now we have seen our algorithms can detect faces but can we also recognize
whose faces are there? And what if an algorithm is able to recognize faces?
Now let us understand how we can recognize faces using deep learning. Here we
use face embeddings in which every face is converted into a vector. The technique
of converting the face into a vector is called deep metric learning. Let me divide
this process into three simple steps for better and easy understanding:
1. Face Detection:
The first task that we perform is detecting faces in the image(photograph) or video
stream. Now we know that the exact coordinates/location of the face, so we extract
this face for further processing.
2. Feature Extraction:
Now see we have cropped out the face from the image, so we extract specific
features from it. Here we are going to see how to use face embeddings to extract
these features of the face. As we know a neural network takes an image of the face
of the person as input and outputs a vector that represents the most important
features of a face! In machine learning, this vector is nothing but
called embedding and hence we call this vector face embedding. Now how this
will help in recognizing the faces of different people?
When we train the neural network, the network learns to output similar vectors for
faces that look similar. Let us consider an example, if I have multiple images of
faces within different timelapse, it’s obvious that some features may change but
not too much. So in this problem, the vectors associated with the faces are similar
or we can say they are very close in the vector space.
Up to this point, we came to know how this network works, let us see how to use
this network on our own data. Here we pass all the images in our data to this pre-
trained network to get the respective embeddings and save these embeddings in a
file for the next step.
3. Comparing Faces:
We have face embeddings for each face in our data saved in a file, the next step is
to recognize a new image that is not in our data. Hence the first step is to compute
the face embedding for the image using the same network we used earlier and then
compare this embedding with the rest of the embeddings that we have. We
recognize the face if the generated embedding is closer or similar to any other
embedding
Understand What is OpenCV
In the Artificial Intelligence field, Computer Vision is one of the most interesting
and challenging tasks. Computer Vision acts as a bridge between Computer
Software and visualizations. Computer Vision allows computer software to
understand and learn about the visualizations in the surroundings.
Let us understand an example: Based on the shape, color, and size that determines
the fruit. This task is very easy for the human brain but in the Computer Vision
pipeline, first, we need to gather the data, then we perform the data processing
operations, and then we train and teach the model to understand how to distinguish
between the fruits based on its size, shape, and color of the fruit.
Implementation
In this section, we are going to implement face recognition using OpenCV and
Python.
First, let us see what libraries we will need and how to install them:
1. OpenCV
2. dlib
3. Face_recognition
OpenCV is a video and image processing library and it is used for image and
video analysis, like facial detection, license plate reading, photo editing, advanced
robotic vision, and many more.
The dlib library contains our implementation of ‘deep metric learning’ which is
used to construct our face embeddings used for the actual recognition process.
The face_recognition library is super easy to work with and we will be using this
in our code. First, remember to install dlib library before you install
face_recognition.
Time series analysis is used for non-stationary data—things that are constantly fluctuating
over time or are affected by time. Industries like finance, retail, and economics frequently use
time series analysis because currency and sales are always changing. Stock market analysis is
an excellent example of time series analysis in action, especially with automated trading
algorithms. Likewise, time series analysis is ideal for forecasting weather changes, helping
meteorologists predict everything from tomorrow’s weather report to future years of climate
change. Examples of time series analysis in action include:
Weather data
Rainfall measurements
Temperature readings
Quarterly sales
Stock prices
Industry forecasts
Interest rates
Curve fitting: Plots the data along a curve to study the relationships of variables within the
data.
Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or seasonal
variation.
Explanative analysis: Attempts to understand the data and the relationships within it, as well
as cause and effect.
Exploratory analysis: Highlights the main characteristics of the time series data, usually in a
visual format.
Forecasting: Predicts future data. This type is based on historical trends. It uses the historical
data as a model for future data, predicting scenarios that could happen along future plot
points.
Segmentation: Splits the data into segments to show the underlying properties of the source
information
Forecasting
Let’s think about how forecasting is applicable and where it’s required.
Forecasting can be key when deciding whether to build a dam, or a power
generation plant in the next few years based on forecasts of future demand;
I’ve used forecasting at one of my former workplaces to help in scheduling
staff in a call center for the coming week based on call volumes at certain
days and certain times. Another area I’ve applied forecasting is telling a
business when to stock up and on what they should stock for this was purely
based on demand for the product.
I bet you’ve used forecasting before you read this, maybe not in the same
cases as mine but I’m 100% sure you have. Have you ever looked at the
weather and realized you’re overdressed or under dressed? That is
forecasting!
Forecasting using FB Prophet
Prophet is a procedure for forecasting time series data based on an additive
model where non-linear trends are fit with yearly, weekly, and daily
seasonality, plus holiday effects. It works best with time series that have
strong seasonal effects and several seasons of historical data. Prophet is
robust to missing data and shifts in the trend, and typically handles outliers
well.
Prophet is open source software released by Facebook’s Core Data Science
team. It is available for download on CRAN and PyPI.
Accurate and fast.
Prophet is used in many applications across Facebook for producing
reliable forecasts for planning and goal setting. We’ve found it to perform
better than any other approach in the majority of cases. We fit models
in Stan so that you get forecasts in just a few seconds.
Fully automatic.
Get a reasonable forecast on messy data with no manual effort. Prophet is
robust to outliers, missing data, and dramatic changes in your time series.
Tunable forecasts.
The Prophet procedure includes many possibilities for users to tweak and
adjust forecasts. You can use human-interpretable parameters to improve
your forecast by adding your domain knowledge.
Available in R or Python.
We’ve implemented the Prophet procedure in R and Python, but they share
the same underlying Stan code for fitting. Use whatever language you’re
comfortable with to get forecasts.
Tableu
Tableau provides several options to augment and create new data fields.
You can perform arithmetic, logical, spatial, and predictive modeling
functions using calculated fields. Tableau is a powerful Business
Intelligence (BI) tool, but there are limitations; that's where Python
language comes to the rescue.
Python is popular programming among the data community. You can use it
to extract, clean, process, and apply complex statistical functions to the
data. It provides you with machine learning frameworks, data
orchestrations, multiprocessing, and rich libraries to perform almost any
task possible.
Python is a multipurpose language, and using it with Tableau gives us the
freedom to perform highly complex tasks. In this tutorial, we are going to
use Python for extracting and cleaning the data. Then, we will be using
clean data to create data visualization on Tableau.
We will not be using Tabpy to create a Tableau Python server and execute
Python scripts within Tableau. Instead, we will first extract and clean the
data in Python (Jupyter Notebook) and then use Tableau to create
interactive visualization.
DataCamp Workspace
Tableau Public
Data Ingestion and Processing with Python
In the first part of the tutorial, we will learn to use Goodreads API to access
public data. In our case, we will be focusing on the user profile and
converting it into a readable Pandas dataframe. Furthermore, we will clean
the data and export it into CSV file format.
Getting Started
We will be using DataCamp’s Workspace for running the Python code. It
comes with the necessary Python packages for data science tasks.
If you are new to Python and want to set up the environment on your local
machine, install Anaconda. It will install Python, Jupyter Notebook, and
necessary Python Packages.
Before we start writing the code, we have to install the xmltodict package
as it is not part of the Workspace or Anaconda data stack. We will use `pip`
to install the missing Python package.
Note: The `!` symbol only works in Jupyter Notebooks. It lets us access the
terminal within the Jupyter code cell.
import xmltodict
import urllib.request
Note: You can use your friend’s profile or use your profile link, and run this
script.
user_id_name = user_id+'-'+user_name
print(user_id_name)
>>> 73376016-abid
Goodreads Data Extraction
At the end of 2020, Goodreads will stop providing developer API. You can
read the full report here. To overcome this issue, we will be using API keys
from old projects such as streamlit_goodreads_app. The project explains
how to access the Goodreads user data using API.
Goodreads also provides you the option to download the data in CSV file
format without an API key, but it is limited to a user, and it doesn't give us
the freedom to extract real-time data.
apiKey = "ZRnySx6awjQuExO9tKEJXw"
version = "2"
shelf = "read"
per_page = "200"
api_url_base = "https://www.goodreads.com/review/list/"
final_url = (
api_url_base
+ user_id
+ ".xml?key="
+ apiKey
+ "&v="
+ version
+ "&shelf="
+ shelf
+ "&per_page="
+ per_page
contents = urllib.request.urlopen(final_url).read()
return contents
print(contents[0:100])
The XML data is converted into nested JSON format, and to display the
first entry in book reviews data, we will use square brackets to access
encapsulated data.
You can experiment with metadata and explore more options, but in this
tutorial, we will be focusing on users reviewing data.
contents_json = xmltodict.parse(contents)
print(contents_json["GoodreadsResponse"]["reviews"]["review"][:1])
>>> [{'id': '4626706284', 'book': {'id': {'@type': 'integer', '#text': '57771224'}, 'isbn': '1250809606',
'isbn13': '9781250809605', 'text_reviews_count': {'@type': 'integer', '#text': '150'}, 'uri':
'kca://book/amzn1.gr.book.v3.tcNoY0o7ErAhczdQ', 'title': 'Good Intentions', 'title_without_series':
'Good Intentions', 'image_url': .........
df = df[df["date_updated"].notnull()]
df.head()
Data Cleaning
The raw dataframe looks reasonably clean, but we still need to reduce the
number of columns.
df.shape
(200, 61)
df.shape
(200, 58)
We will now check all column names by using `df.columns` and select the
most useful columns.
final_df = df[
"rating",
"started_at",
"read_at",
"date_added",
"book.title",
"book.average_rating",
'book.ratings_count',
"book.publication_year",
"book.authors.author.name"
final_df.head()
As we can observe, the final dataframe looks clean with the relevant data
fields.
final_df.to_csv("abid_goodreads_clean_data.csv",index=False)
You can also check out the Python Jupyter Notebook: Data Ingestion
using Goodreads API. It will help you debug your code and if you want to
skip the Python programming part, you can simply download the file by
clicking on the Copy & Edit button and running the script.
In the second part, we will use clean data and create simple and complex
data visualization in Tableau. Our goal is to plot interactive charts which will
help us understand the user’s book reading behavior.
First, drag and drop the Rating field to the Rows shelf.
The Rating axis has 0.5 interval tick marks. Change the tick marks by
right-clicking on the bottom axis and selecting Edit Axis. After that
click on the Tick Marks tab and change the Major Tick
Marks to Fixed. Make sure the Tick Origin is 0 and the Tick Interval is
1.
The user had typically given ratings between 3 and 4. The zero ratings are
the books that are not rated.
Line Plot
To plot line chart:
Box Plot
To plot box and whisker plot:
1. Drag and drop Books.Ratings Count data field to Rows shelf. Change
it from Discrete to Continuous.
Drag and drop Books.Ratings Count data field to Detail option
at Marks panel. Change it from Measure to Dimension.
2.
Creates a Bin Field
1. Right-click on the recently created Read Duration (bin) field and
select the Read Duration in Days option.
2. Change the Size of Bins to 10. It will create multiple smaller chunks
of data which will help us create a more refined version of the packed
bubbles plot.
Editing Bins
We have crossed the hard path, and now it's time to see the fruits of our
labors.
1. To create the simple visualization, click on Show Me and select
the packed bubbles option. You will see unicolor circles of different
sizes.
2. To add some colors, we will drag and drop the Read Duration field
onto the Color option in the Marks panel.
3. Change the color field to Count (Distinct) by right-clicking on the field
and selecting Measure > Count (Distinct). It will give a unique color to
each bin or label.
1. Click on the Color option in the Marks panel and select Edit
Colors… > Sunrise-Sunset Diverging. You can pick any gradient
color that suits your taste.
2. The last part is all about customizing and making sure your
visualization is appealing and conveys the right message.
It took the user less than a day to finish most of the books. You can also
see a few outliers above 300.
We can also create a Tableau dashboard by combining these
visualizations. Learn how to create a Tableau dashboard by following
the Tutorial.
Conclusion
2. After reading this Data Science with Python article, you have learned what
data science is, why it is important, and the different libraries involved in
data science. You learned the different skills needed when it comes to data
science, such as exploratory data analysis, data wrangling, and model
building.