Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

MACHINE LEARNING

UNIT -1

What is machine learning?


Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly
programmed.
Machine Learning is defined as the study of computer algorithms for
automatically constructing computer software through past experience and training
data.

How it works?
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it.
Application:

We are using machine learning in our daily life even without knowing it such as
Google Maps, Google assistant, Alexa, etc. Above are some most trending real-
world applications of Machine Learning.
Machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc.

TYPES OF MACHINE LEARNING:


Three important types of Machine Learning Algorithms are:
 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning

1.SUPERVISED LEARNING:

 Pre-categorized data
 Learning in the presence of an expert/teacher
 Data is labeled with a class or variable.
 A model that is able to predict with the help of a labeled dataset. A labeled
dataset is one where you already know the target answer.

TYPES OF SUPERVISED LEARNING:


 Classification -- defined label
 Regression – undefined label

1.1 CLASSIFICATION:

Classification is used when the output variable is categorical i.e. with 2 or


more classes. For example, yes or no, male or female, true or false, etc.

Analyze the data and separate

Three classification types: binary, multi-class, and multi-label.

1.2 REGRESSION:

Regression is a supervised machine learning algorithm used to predict the


continuous values of output based on the input.

Analyze the data that what we have and predict the future.

There are three main types of regression algorithms - simple linear


regression, multiple linear regression, and polynomial regression.

2. UNSUPERVISED LEARNING:
 Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without
any supervision.
 Unlabelled data
 No knowledge of output class or value/self guided learning algorithm.
 Data is unlabelled and value is unknown.
 Group and interpret data based only on input data.

Types of unsupervised learning:

 Clustering
 Association
 Dimensionality reduction

2.1 CLUSTERING:

Clustering is a method of grouping the objects into clusters such that


objects with most similarities remains into a group and has less or no similarities
with the objects of another group.

Example: When we buy any product in online shopping, it suggest some other
product to buy along with that. The person who buy bread also buy the butter.

2.2 ASSOCIATION:
An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset.
Example: Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam)
item. A typical example of Association rule is Market Basket Analysis.

2.3 DIMENSIONALITY REDUCTION:

 It is a way of converting the higher dimensions dataset into lesser


dimensions dataset ensuring that it provides similar information.

 It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be
used for data visualization, noise reduction, cluster analysis, etc

3. REINFORCEMENT LEARNING:
 Reinforcement Learning is a feedback-based Machine learning
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions.
 No predefined data.
 An agent interact with its environment by performing actions and
learning from errors.
 It is less common and much more complex, but it has generated
incredible results. It doesn’t use labels as such, and instead uses
rewards to learn.
Example: feedback.

PROCESS:

Data Collection: Collect the data that the algorithm will learn from.

Data Preparation: Format and engineer the data into the optimal format,
extracting important features and performing dimensionality reduction.

Training: Also known as the fitting stage, this is where the Machine
Learning algorithm actually learns by showing it the data that has been
collected and prepared

Evaluation: Test the model to see how well it performs.


Tuning: Fine tune the model to maximise it’s performance.

PROBLEMS NOT TO BE SOLVED:

Some of the limitation that are not solved in machine learning are as follow:

Limitation 1 — Ethics:

If my self-driving car kills someone on the road, whose fault is it?

Limitation 2 — Deterministic Problems:

A neural network does not understand Newton’s second law, or that density
cannot be negative — there are no physical constraints.

Limitation 3 — Data:

Lack of Good Data: A good example of this is a neural network. Neural


networks are data-eating machines that require copious amounts of training data.
The larger the architecture, the more data is needed to produce viable results.
Reusing data is a bad idea, and data augmentation is useful to some extent, but
having more data is always the preferred solution.

Limitation 4 — Misapplication:

 P-hacking
 Scope of the analysis

Limitation 5 — Interpretability

LANGUAGE AND TOOLS:

Here is a list of languages that support ML development:


Python
R
Matlab
Octave
Julia
C++

Here is a list of IDEs which support ML development:


R Studio
Pycharm
iPython/Jupyter Notebook
Julia
Spyder
Anaconda
Rodeo
Google –Colab

Here is a list of platforms on which ML applications can be deployed:


IBM
Microsoft Azure
Google Cloud
Amazon
Mlflow

TRADITIONAL PROGRAMIMG:

 Develop an algorithm ( x=a+b)


 Implementing an algorithm in code

Input parameters Implemented algorithm Result

Machine Learning:

 Data collection and preparation


 Experimenting with different algorithms to build a better model.

DATA ML model Result


COMMON ISSUES IN MACHINE LEARNING:

1.Inadequate Training Data:

 Noisy Data
 Incorrect data-
 Generalizing of output data

2. Poor quality of data:

Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results.
3. Non-representative training data:

To make sure our training model is generalized well or not, we have to


ensure that sample training data must be representative of new cases that we need
to generalize.
4. Overfitting and Underfitting:

Overfitting: A machine learning model is trained with a huge amount of


data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model.

The main reason behind overfitting is using non-linear methods used in


machine learning algorithms as they build non-realistic data models. We can
overcome overfitting by using linear and parametric algorithms in the machine
learning models.

Methods to reduce overfitting:

 Increase training data in a dataset.


 Early stopping during the training phase
 Reduce the noise
 Reduce the number of attributes in training data.
 Constraining the model.

Underfitting: Underfitting is just the opposite of overfitting. Whenever a machine


learning model is trained with fewer amounts of data, and as a result, it provides
incomplete and inaccurate data and destroys the accuracy of the machine learning
model.

Methods to reduce Underfitting:

 Increase model complexity


 Remove noise from the data
 Trained on increased and better features
 Reduce the constraints
 Increase the number of epochs to get better results.

5. Data Bias:
These errors exist when certain elements of the dataset are heavily weighted or
need more importance than others. Biased data leads to inaccurate results, skewed
outcomes, and other analytical errors.

Methods to remove Data Bias:

 Research more for customer segmentation.


 Be aware of your general use cases and potential outliers.
 Combine inputs from multiple sources to ensure data diversity.
 Include bias testing in the development process.
 Analyze data regularly and keep tracking errors to resolve them easily.
 Review the collected and annotated data.
 Use multi-pass annotation such as sentiment analysis, content moderation,
and intent recognition.

And some of the issues are as follows:

 Monitoring and maintenance


 Getting bad recommendations
 Lack of skilled resources
 Process Complexity of Machine Learning
 Lack of Explainability
 Slow implementations and results
 Irrelevant features

PREPARATION OF MODEL:
MACHINE LEARNING ACTIVITIES:

Algorithm: A Machine Learning algorithm is a set of rules and statistical


techniques used to learn patterns from data and draw significant information from
it. It is the logic behind a Machine Learning model. An example of a Machine
Learning algorithm is the Linear Regression algorithm.

Model: A model is the main component of Machine Learning. A model is trained


by using a Machine Learning Algorithm. An algorithm maps all the decisions that
a model is supposed to take based on the given input, in order to get the correct
output.

Predictor Variable: It is a feature(s) of the data that can be used to predict the
output.

Response Variable: It is the feature or the output variable that needs to be


predicted by using the predictor variable(s).

Training Data: The Machine Learning model is built using the training data. The
training data helps the model to identify key trends and patterns essential to predict
the output.

Testing Data: After the model is trained, it must be tested to evaluate how
accurately it can predict an outcome. This is done by the testing data set.

Machine Learning Process:


 Define the objective of the Problem Statement
 Data Gathering
 Data Preparation
 Exploratory Data Analysis
 Building a Machine Learning Model
 Model Evaluation & Optimization
 Predictions

Define the objective of the Problem Statement:


To build a better model, we must have detailed information on all issues,
such as what to do and how to do it. It is also very much effective to retain clients
without wasting much effort.
It is also essential to take mental notes on what kind of data can be used to
solve this problem or the type of approach you must follow to get to the solution.

Data Gathering:

At this stage, you must be asking questions such as,


 What kind of data is needed to solve this problem?
 Is the data available?
 How can I get the data?

Data collection can be done manually or by web scraping. There are 1000s of
data resources on the web, you can just download the data set and get going. Data
collection is beneficial to reduce and mitigate biasing in the ML model; hence
before collecting data, always analyze it and also ensure that the data set was
collected from diverse people, geographical areas, and perspectives. The data
needed for weather forecasting includes measures such as humidity level,
temperature, pressure, locality, whether or not you live in a hill station, etc. Such
data must be collected and stored for analysis.

Data Preparation:

The data you collected is almost never in the right format. You will encounter a
lot of inconsistencies in the data set such as missing values, redundant variables,
duplicate values, etc. Removing such inconsistencies is very essential because they
might lead to wrongful computations and predictions. Therefore, at this stage, you
scan the data set for any inconsistencies and you fix them then and there.

Exploratory Data Analysis:

Exploratory Data Analysis is the brainstorming stage of Machine Learning.


Data Exploration involves understanding the patterns and trends in the data. At this
stage, all the useful insights are drawn and correlations between the variables are
understood.

For example, in the case of predicting rainfall, we know that there is a strong
possibility of rain if the temperature has fallen low. Such correlations must be
understood and mapped at this stage.

Building a Machine Learning Model:

All the insights and patterns derived during Data Exploration are used to
build the Machine Learning Model. This stage always begins by splitting the data
set into two parts, training data, and testing data. The training data will be used to
build and analyze the model. The logic of the model is based on the Machine
Learning Algorithm that is being implemented.

In the case of predicting rainfall, since the output will be in the form of
True (if it will rain tomorrow) or False (no rain tomorrow), we can use a
Classification Algorithm such as Logistic Regression.
Choosing the right algorithm depends on the type of problem you’re trying
to solve, the data set and the level of complexity of the problem. In the upcoming
sections, we will discuss the different types of problems that can be solved by
using Machine Learning.

Model Evaluation & Optimization:

After building a model by using the training data set, it is finally time to put
the model to a test. The testing data set is used to check the efficiency of the model
and how accurately it can predict the outcome. Once the accuracy is calculated,
any further improvements in the model can be implemented at this stage. Methods
like parameter tuning and cross-validation can be used to improve the performance
of the model.

Predictions:

Once the model is evaluated and improved, it is finally used to make


predictions. The final output can be a Categorical variable (e.g. True or False) or it
can be a Continuous Quantity (e.g. the predicted value of a stock).

DATA TYPES IN MACHINE LEARNING:

Data types are a way of classification that specifies Which Type of Value a
variable can store and what type of mathematical operations, Relational, Or
Logical Operations Can Be Applied To The Variable Without Causing An Error.
In Machine Learning, It Is Very Important To Know Appropriate Datatypes of
Independent And Dependent Variable.
1.NUMERICAL / QUALITATIVE DATA TYPE:

This type of data type consists of numerical values. Anything which is


measured by numbers.

E.g., Profit, Quantity Sold, Height, Weight, Temperature, Etc.

This Is Again Of Two Types:

1.1 Discrete Data Type: The numeric data which have discrete values or hole
numbers. This type of variable value if expressed in decimal format will have no
proper meaning. Their values can be counted.

E.g.: No. of cars you have, no. of marbles in containers, students in a Class, Etc.
1.2 Continuous Data Type: The numerical measures which can take the value
within a certain range. This type of variable value if expressed in decimal
format has true meaning. Their values cannot be counted but measured. The
value can be infinite
E.g.: – height, weight, time, Area, distance, measurement of rainfall, etc.

2. Qualitative Data Type: These are the data types that cannot be expressed in
numbers. This describes categories or groups and is hence known as the categorical
data type.

A. Structured Data:
This type of data is either number or words. This can take numerical values but
mathematical operations cannot be performed on it. This type of data is expressed
in tabular format.

E.g: Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or
Bad, Etc.
B. Unstructured Data: This type of data does not have the proper format and
therefore known as unstructured data. This comprises textual data, sounds, images,
videos, etc.

There are also other types refer as Data Types Preliminaries Or Data Measures:-

 Nominal
 Ordinal
 Interval
 Ratio

These Can Also Be Refer Different Scales Of Measurements.


I. Nominal Data Type: This is in use To Express Names Or Labels Which Are
Not Order Or Measurable. E.g., Male Or Female (Gender), Race, Country, Etc.

Fig: Gender (Female, Male), An Example Of Nominal Data Type

II. Ordinal Data Type: This is also a categorical data type like nominal data but
has some natural ordering associated with it.
E.g., Like Rating Scale, Shirt Sizes, Ranks, Grades, Etc.

Fig: Rating (Good, Average, Poor), An Example Of Ordinal Data Type

III. Interval Data Type:


This is numeric data which has proper order and the exact zero means the
true absence of a value attached. Here zero means not a complete absence but has
some value. This is the local scale E.g., Temperature measured in degree Celsius,
time, sat score, credit score, PH, etc. Difference between values is familiar. In this
case, there is no Absolute Zero.

T Temperature, An Example Of Interval Data Type


IV. Ratio Data Type: This quantitative data type is the same as the interval data
type but has the absolute zero. Here zero means complete absence and the scale
starts from zero. This is the global scale.
E.g., Temperature in Kelvin, Height, Weight

EXPLORING STRUCTURE OF DATA:

DATA STRUCTURE:

The data structure is defined as the basic building block of computer


programming that helps us to organize, manage and store data for efficient search
and retrieval.

Types of Data Structure:

The data structure is the ordered sequence of data, and it tells the compiler
how a programmer is using the data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data
structures.

1. Linear Data structure:

The linear data structure is a special type of data structure that helps to
organize and manage data in a specific order where the elements are attached
adjacently. There are mainly 4 types of linear data structure as follows:
1.1 Array: An array is a collection of similar types of data. We will use arrays
constantly in machine learning, whether it's:

 To convert the column of a data frame into a list format in pre-processing


analysis
 To order the frequency of words present in datasets.
 Using a list of tokenized words to begin clustering topics.
 In word embedding, by creating multi-dimensional matrices.

Python Array method:

Method Description

Append() It is used to add an element at the end of the list.

Clear() It is used to remove/clear all elements in the list.

Copy() It returns a copy of the list.

Count() It returns the count or total available element with an integer value.

Extend() It is used to add the element of a list to the end of the current list.

Index() It returns the index of the first element with the specified value.

Insert() It is used to add an element at a specific position using an index number.

Pop() It is used to remove an element from a specified position using an index


number.

Remove() Used to remove the elements with specified values.

Reverse() Used to show list in reverse order

Sort() Used to sort the list in an array.


1.2 Stacks: Stacks are based on the concept of LIFO (Last in First out) or FILO
(First In Last Out). It is used for binary classification in deep learning.

Stacks enable the undo and redo buttons on your computer as they
function similar to a stack of blog content. There is no sense in adding a blog at the
bottom of the stack. However, we can only check the most recent one that has been
added. Addition and removal occur at the top of the stack.

1.3 Linked List: A linked list is the type of collection having several separately
allocated nodes. A list is the type of collection of data elements that consist of a
value and pointer that point to the next node in the list.

1.4 Queue: A Queue is defined as the "FIFO" (first in, first out). It is useful to
predict a queuing scenario in real-time programs, such as people waiting in line to
withdraw cash in the bank. Hence, the queue is significant in a program where
multiple lists of codes need to be processed.
2. Non-linear Data Structures:
In Non-linear data structures, elements are not arranged in any
sequence. All the elements are arranged and linked with each other in a hierarchal
manner, where one element can be linked with one or more elements.
1.Trees:
Binary Tree: The concept of a binary tree is very much similar to a linked list,
but the only difference of nodes and their pointers. In a linked list, each node
contains a data value with a pointer that points to the next node in the list,
whereas; in a binary tree, each node has two pointers to subsequent nodes instead
of just one.

2. Graphs:
A graph data structure is also very much useful in machine learning for
link prediction. Graphs are directed or undirected concepts with nodes and ordered
or unordered pairs. Hence, you must have good exposure to the graph data
structure for machine learning and deep learning.
3. Maps:

Maps are the popular data structure in the programming world, which are
mostly useful for minimizing the run-time algorithms and fast searching the data. It
stores data in the form of (key, value) pair, where the key must be unique;
however, the value can be duplicated. Each key corresponds to or maps a value;
hence it is named a Map.

In different programming languages, core libraries have built-in maps or,


rather, HashMaps with different names for each implementation.
 In Java: Maps
 In Python: Dictionaries
 C++: hash_map, unordered_map, etc.

Python Dictionaries are very useful in machine learning and data science as
various functions and algorithms return the dictionary as an output. Dictionaries
are also much used for implementing sparse matrices, which is very common in
Machine Learning.

4. Heap data structure:


Heap is a hierarchically ordered data structure. Heap data structure is also
very much similar to a tree, but it consists of vertical ordering instead of horizontal
ordering.

Ordering in a heap DS is applied along the hierarchy but not across it, where
the value of the parent node is always more than that of child nodes either on the
left or right side.

Here, the insertion and deletion operations are performed on the basis of
promotion. It means, firstly, the element is inserted at the highest available
position. After that it gets compared with its parent and promoted until it reaches
the correct ranking position. Most of the heaps data structures can be stored in an
array along with the relationships between the elements.

You might also like