Unit 1 Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

UNIT 1: FUNDAMENTALS OF MACHINE LEARNING

[Introduction to Machine Learning: What is Machine Learning? Why Use Machine Learning?
, Types of Machine Learning Systems, Main Challenges of Machine Learning, Applications of
Machine Learning. Why Python, scikit-learn, Essential Libraries and Tools.]

WHAT IS MACHINE LEARNING?


Machine Learning is the science (and art) of programming computers so they can learn from data.
Example: Your spam filter is a Machine Learning program that, given examples of spam emails
(e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails, can learn to
flag spam. The examples that the system uses to learn are called the training set. Each training
example is called a training instance (or sample).

WHY USE MACHINE LEARNING?


Machine Learning is great for:
 Problems for which existing solutions require a lot of fine-tuning or long lists of rules: one
Machine Learning algorithm can often simplify code and perform better than the traditional
approach.
 Complex problems for which using a traditional approach yields no good solution: the best
Machine Learning techniques can perhaps find a solution.
 Fluctuating environments: a Machine Learning system can adapt to new data.
 Getting insights about complex problems and large amounts of data.

TYPES OF MACHINE LEARNING SYSTEMS


There are so many different types of Machine Learning systems that it is useful to
classify them in broad categories, based on the following criteria:
 Whether or not they are trained with human supervision (supervised, unsupervised,
semisupervised, and Reinforcement Learning)
 Whether or not they can learn incrementally on the fly (online versus batch learning)
 Whether they work by simply comparing new data points to known data points, or instead
by detecting patterns in the training data and building a predictive model, much like
scientists do (instance-based versus model-based learning)
Supervised/Unsupervised Learning:
Machine Learning systems can be classified according to the amount and type of supervision they
get during training. There are four major categories: supervised learning, unsupervised learning,
semisupervised learning, and Reinforcement Learning.
Supervised learning:
In supervised learning, the training set you feed to the algorithm includes the desired solutions,
called labels.

Here are some of the most important supervised learning algorithms


• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
Unsupervised learning
In unsupervised learning, the training data is unlabeled. The system tries to learn without a teacher.

Here are some of the most important unsupervised learning algorithms


• Clustering
— K-Means
— DBSCAN
— Hierarchical Cluster Analysis (HCA)
• Anomaly detection and novelty detection
— One-class SVM
— Isolation Forest
• Visualization and dimensionality reduction
— Principal Component Analysis (PCA)
— Kernel PCA
— Locally Linear Embedding (LLE)
— t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
— Apriori
— Eclat
Semisupervised learning:
Since labeling data is usually time-consuming and costly, you will often have plenty of unlabeled
instances, and few labeled instances. Some algorithms can deal with data that’s partially labeled.
This is called semisupervised learning.
Reinforcement Learning:
Reinforcement Learning is a very different beast. The learning system, called an agent in this
context, can observe the environment, select and perform actions, and get rewards in return (or
penalties in the form of negative rewards, as shown in Figure. It must then learn by itself what is
the best strategy, called a policy, to get the most reward over time. A policy defines what action
the agent should choose when it is in a given situation.

Batch and Online Learning:


Another criterion used to classify Machine Learning systems is whether or not the system can learn
incrementally from a stream of incoming data.
Batch learning:
In batch learning, the system is incapable of learning incrementally: it must be trained using
all the available data. This will generally take a lot of time and computing resources, so it
is typically done offline. First the system is trained, and then it is launched into production
and runs without learning anymore; it just applies what it has learned. This is called offline
learning.
Online learning:
In online learning, you train the system incrementally by feeding it data instances
sequentially, either individually or in small groups called mini-batches. Each learning step
is fast and cheap, so the system can learn about new data on the fly, as it arrives
Instance-Based Versus Model-Based Learning:
One more way to categorize Machine Learning systems is by how they generalize.
Instance-based learning:
the system learns the examples by heart, then generalizes to new cases by using a similarity
measure to compare them to the learned examples (or a subset of them). For example, in Figure 1-
15 the new instance would be classified as a triangle because the majority of the most similar
instances belong to that class.

Model-based learning:
Another way to generalize from a set of examples is to build a model of these examples and then
use that model to make predictions. This is called model-based learning(Figure 1-16).
MAIN CHALLENGES OF MACHINE LEARNING:
Poor Quality of Data
Data plays a significant role in the machine learning process. One of the significant issues that
machine learning professionals face is the absence of good quality data. Unclean and noisy data
can make the whole process extremely exhausting. We don’t want our algorithm to make
inaccurate or faulty predictions. Hence the quality of data is essential to enhance the output.
Therefore, we need to ensure that the process of data preprocessing which includes removing
outliers, filtering missing values, and removing unwanted features, is done with the utmost level
of perfection.
Underfitting of Training Data
This process occurs when data is unable to establish an accurate relationship between input and
output variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple
to establish a precise relationship. To overcome this issue:

 Maximize the training time


 Enhance the complexity of the model
 Add more features to the data
 Reduce regular parameters
 Increasing the training time of model

Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of data that
negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is
one of the significant issues faced by machine learning professionals. This means that the
algorithm is trained with noisy and biased data, which will affect its overall performance. Let’s
understand this with the help of an example. Let’s consider a model trained to differentiate between
a cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and
4000 Rabbits. Then there is a considerable probability that it will identify the cat as a rabbit. In
this example, we had a vast amount of data, but it was biased; hence the prediction was negatively
affected.

We can tackle this issue by:

 Analyzing the data with the utmost level of perfection


 Use data augmentation technique
 Remove outliers in the training set
 Select a model with lesser features

Machine Learning is a Complex Process

The machine learning industry is young and is continuously changing. Rapid hit and trial
experiments are being carried on. The process is transforming, and hence there are high chances
of error which makes the learning complex. It includes analyzing the data, removing data bias,
training data, applying complex mathematical calculations, and a lot more. Hence it is a really
complicated process which is another big challenge for Machine learning professionals.

Lack of Training Data

The most important task you need to do in the machine learning process is to train the data to
achieve an accurate output. Less amount training data will produce inaccurate or too biased
predictions. Let us understand this with the help of an example. Consider a machine learning
algorithm similar to training a child. One day you decided to explain to a child how to distinguish
between an apple and a watermelon. You will take an apple and a watermelon and show him the
difference between both based on their color, shape, and taste. In this way, soon, he will attain
perfection in differentiating between the two. But on the other hand, a machine-learning algorithm
needs a lot of data to distinguish. For complex problems, it may even require millions of data to
be trained. Therefore, we need to ensure that Machine learning algorithms are trained with
sufficient amounts of data.

Slow Implementation

This is one of the common issues faced by machine learning professionals. The machine learning
models are highly efficient in providing accurate results, but it takes a tremendous amount of time.
Slow programs, data overload, and excessive requirements usually take a lot of time to provide
accurate results. Further, it requires constant monitoring and maintenance to deliver the best
output.

Imperfections in the Algorithm When Data Grows

So you have found quality data, trained it amazingly, and the predictions are really concise and
accurate. Yay, you have learned how to create a machine learning algorithm!! But wait, there is a
twist; the model may become useless in the future as data grows. The best model of the present
may become inaccurate in the coming Future and require further rearrangement. So you need
regular monitoring and maintenance to keep the algorithm working. This is one of the most
exhausting issues faced by machine learning professionals.

APPLICATIONS OF MACHINE LEARNING:


Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.

Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.

Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from
the user and sends back to its database to improve the performance.

Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such as
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.

Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is
working on self-driving car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.

Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning. Below are some spam filters used
by Gmail:

 Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.

Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part. These assistant
record our voice instructions, send it over the server on a cloud, and decode it using ML algorithms
and act accordingly.

Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is
a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern which
gets change for the fraud transaction hence, it detects it and makes our online transactions more
secure.

Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk
of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.

Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology
is growing very fast and able to build 3D models that can predict the exact position of lesions in
the brain. It helps in finding brain tumors and other brain-related diseases easily.

Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at
all, as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it called as automatic
translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.

WHY PYTHON:
Python combines the power of general-purpose programming languages with the ease of use of
domain-specific scripting languages like MATLAB or R. Python has libraries for data loading,
visualization, statistics, natural language processing, image processing, and more. This vast
toolbox provides data scientists with a large array of general- and special-purpose functionality.
One of the main advantages of using Python is the ability to interact directly with the code, using
a terminal or other tools like the Jupyter Notebook, etc.

As a general-purpose programming language, Python also allows for the creation of complex
graphical user interfaces (GUIs) and web services, and for integration into existing systems.

SCIKIT-LEARN:

scikit-learn is an open source project, meaning that it is free to use and distribute, and anyone can
easily obtain the source code to see what is going on behind the scenes. The scikit-learn project is
constantly being developed and improved, and it has a very active user community. It contains a
number of state-of-the-art machine learning algorithms, as well as comprehensive documentation
about each algorithm.

Installing scikit-learn

scikit-learn depends on two other Python packages, NumPy and SciPy. For plotting and
interactive development, you should also install matplotlib, IPython, and the Jupyter Notebook.

Prepackaged Python distributions will provide the necessary packages:

Anaconda

A Python distribution made for large-scale data processing, predictive analytics, and scientific
computing. Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook,
and scikit-learn. Available on Mac OS, Windows, and Linux, it is a very convenient solution and
is the one we suggest for people without an existing installation of the scientific Python packages.
Anaconda now also includes the commercial Intel MKL library for free. Using MKL (which is
done automatically when Anaconda is installed) can give significant speed improvements for many
algorithms in scikit-learn.

Enthought Canopy

Another Python distribution for scientific computing. This comes with NumPy, SciPy, matplotlib,
pandas, and IPython, but the free version does not come with scikit-learn. If you are part of an
academic, degree-granting institution, you can request an academic license and get free access to
the paid subscription version of Enthought Canopy. Enthought Canopy is available for Python
2.7.x, and works on Mac OS, Windows, and Linux.

Python(x,y)

A free Python distribution for scientific computing, specifically for Windows. Python(x,y) comes
with NumPy, SciPy, matplotlib, pandas, IPython, and scikit-learn.

If you already have a Python installation set up, you can use pip to install all of these packages:
$ pip install numpy scipy matplotlib ipython scikit-learn pandas

ESSENTIAL LIBRARIES AND TOOLS:

Jupyter Notebook

The Jupyter Notebook is an interactive environment for running code in the browser. It is a great
tool for exploratory data analysis and is widely used by data scientists. While the Jupyter Notebook
supports many programming languages, we only need the Python support.

NumPy

NumPy is one of the fundamental packages for scientific computing in Python. It contains
functionality for multidimensional arrays, high-level mathematical functions such as linear algebra
operations and the Fourier transform, and pseudorandom number generators.

In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the
form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array. The
core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional) array. All
elements of the array must be of the same type. A NumPy array looks like this:

import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]])

print("x:\n{}".format(x))

Output:

x:

[[1 2 3]

[4 5 6]]

SciPy

SciPy is a collection of functions for scientific computing in Python. It provides, among other
functionality, advanced linear algebra routines, mathematical function optimization, signal
processing, special mathematical functions, and statistical distributions. scikit-learn draws from
SciPy’s collection of functions for implementing its algorithms. The most important part of SciPy
for us is scipy.sparse: this provides sparse matrices, which are another representation that is used
for data in scikit-learn. Sparse matrices are used whenever we want to store a 2D array that contains
mostly zeros:

from scipy import sparse


# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else

eye = np.eye(4)

print("NumPy array:\n{}".format(eye))

Output:

NumPy array:

[[ 1. 0. 0. 0.]

[ 0. 1. 0. 0.]

[ 0. 0. 1. 0.]

[ 0. 0. 0. 1.]]

matplotlib

matplotlib is the primary scientific plotting library in Python. It provides functions for making
publication-quality visualizations such as line charts, histograms, scatter plots, and so on.
Visualizing your data and different aspects of your analysis can give you important insights, and
we will be using matplotlib for all our visualizations. When working inside the Jupyter Notebook,
you can show figures directly in the browser by using the %matplotlib notebook and %matplotlib
inline commands.

%matplotlib inline

import matplotlib.pyplot as plt

# Generate a sequence of numbers from -10 to 10 with 100 steps in between

x = np.linspace(-10, 10, 100)

# Create a second array using sine

y = np.sin(x)

# The plot function makes a line chart of one array against another

plt.plot(x, y, marker="x")
pandas

pandas is a Python library for data wrangling and analysis. It is built around a data structure called
the DataFrame that is modeled after the R DataFrame. Simply put, a pandas DataFrame is a table,
similar to an Excel spreadsheet. pandas provides a great range of methods to modify and operate
on this table; in particular, it allows SQL-like queries and joins of tables. In contrast to NumPy,
which requires that all entries in an array be of the same type, pandas allows each column to have
a separate type (for example, integers, dates, floating-point numbers, and strings). Another
valuable tool provided by pandas is its ability to ingest from a great variety of file formats and
databases, like SQL, Excel files, and comma-separated values (CSV) files.

import pandas as pd

# create a simple dataset of people

data = {'Name': ["John", "Anna", "Peter", "Linda"],

'Location' : ["New York", "Paris", "Berlin", "London"],

'Age' : [24, 13, 53, 33]

data_pandas = pd.DataFrame(data)

# IPython.display allows "pretty printing" of dataframes


# in the Jupyter notebook

display(data_pandas)

This produces the following output:

Mglearn

mglearn is a library of utility functions. If you see a call to mglearn in the code, it is usually a way
to make a pretty picture quickly, or to get our hands on some interesting data.

IMPORTANT QUESTIONS:
1. Why is Python preferred for Machine Learning and what are the essential libraries and tools used in
Python for Machine Learning?
2. What are the different types of Machine Learning systems and their key characteristics?
3. What are some popular applications of Machine Learning and how do they work?
4. What are the main challenges of machine learning?
5. Explain essential libraries and tools in python.

You might also like