Professional Documents
Culture Documents
ML Unit1
ML Unit1
UNIT -1
How it works?
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it.
Application:
We are using machine learning in our daily life even without knowing it such as
Google Maps, Google assistant, Alexa, etc. Above are some most trending real-
world applications of Machine Learning.
Machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc.
1.SUPERVISED LEARNING:
Pre-categorized data
Learning in the presence of an expert/teacher
Data is labeled with a class or variable.
A model that is able to predict with the help of a labeled dataset. A labeled
dataset is one where you already know the target answer.
1.1 CLASSIFICATION:
1.2 REGRESSION:
Analyze the data that what we have and predict the future.
2. UNSUPERVISED LEARNING:
Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without
any supervision.
Unlabelled data
No knowledge of output class or value/self guided learning algorithm.
Data is unlabelled and value is unknown.
Group and interpret data based only on input data.
Clustering
Association
Dimensionality reduction
2.1 CLUSTERING:
Example: When we buy any product in online shopping, it suggest some other
product to buy along with that. The person who buy bread also buy the butter.
2.2 ASSOCIATION:
An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset.
Example: Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam)
item. A typical example of Association rule is Market Basket Analysis.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be
used for data visualization, noise reduction, cluster analysis, etc
3. REINFORCEMENT LEARNING:
Reinforcement Learning is a feedback-based Machine learning
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions.
No predefined data.
An agent interact with its environment by performing actions and
learning from errors.
It is less common and much more complex, but it has generated
incredible results. It doesn’t use labels as such, and instead uses
rewards to learn.
Example: feedback.
PROCESS:
Data Collection: Collect the data that the algorithm will learn from.
Data Preparation: Format and engineer the data into the optimal format,
extracting important features and performing dimensionality reduction.
Training: Also known as the fitting stage, this is where the Machine
Learning algorithm actually learns by showing it the data that has been
collected and prepared
Some of the limitation that are not solved in machine learning are as follow:
Limitation 1 — Ethics:
A neural network does not understand Newton’s second law, or that density
cannot be negative — there are no physical constraints.
Limitation 3 — Data:
Limitation 4 — Misapplication:
P-hacking
Scope of the analysis
Limitation 5 — Interpretability
TRADITIONAL PROGRAMIMG:
Machine Learning:
Noisy Data
Incorrect data-
Generalizing of output data
Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results.
3. Non-representative training data:
5. Data Bias:
These errors exist when certain elements of the dataset are heavily weighted or
need more importance than others. Biased data leads to inaccurate results, skewed
outcomes, and other analytical errors.
PREPARATION OF MODEL:
MACHINE LEARNING ACTIVITIES:
Predictor Variable: It is a feature(s) of the data that can be used to predict the
output.
Training Data: The Machine Learning model is built using the training data. The
training data helps the model to identify key trends and patterns essential to predict
the output.
Testing Data: After the model is trained, it must be tested to evaluate how
accurately it can predict an outcome. This is done by the testing data set.
Data Gathering:
Data collection can be done manually or by web scraping. There are 1000s of
data resources on the web, you can just download the data set and get going. Data
collection is beneficial to reduce and mitigate biasing in the ML model; hence
before collecting data, always analyze it and also ensure that the data set was
collected from diverse people, geographical areas, and perspectives. The data
needed for weather forecasting includes measures such as humidity level,
temperature, pressure, locality, whether or not you live in a hill station, etc. Such
data must be collected and stored for analysis.
Data Preparation:
The data you collected is almost never in the right format. You will encounter a
lot of inconsistencies in the data set such as missing values, redundant variables,
duplicate values, etc. Removing such inconsistencies is very essential because they
might lead to wrongful computations and predictions. Therefore, at this stage, you
scan the data set for any inconsistencies and you fix them then and there.
For example, in the case of predicting rainfall, we know that there is a strong
possibility of rain if the temperature has fallen low. Such correlations must be
understood and mapped at this stage.
All the insights and patterns derived during Data Exploration are used to
build the Machine Learning Model. This stage always begins by splitting the data
set into two parts, training data, and testing data. The training data will be used to
build and analyze the model. The logic of the model is based on the Machine
Learning Algorithm that is being implemented.
In the case of predicting rainfall, since the output will be in the form of
True (if it will rain tomorrow) or False (no rain tomorrow), we can use a
Classification Algorithm such as Logistic Regression.
Choosing the right algorithm depends on the type of problem you’re trying
to solve, the data set and the level of complexity of the problem. In the upcoming
sections, we will discuss the different types of problems that can be solved by
using Machine Learning.
After building a model by using the training data set, it is finally time to put
the model to a test. The testing data set is used to check the efficiency of the model
and how accurately it can predict the outcome. Once the accuracy is calculated,
any further improvements in the model can be implemented at this stage. Methods
like parameter tuning and cross-validation can be used to improve the performance
of the model.
Predictions:
Data types are a way of classification that specifies Which Type of Value a
variable can store and what type of mathematical operations, Relational, Or
Logical Operations Can Be Applied To The Variable Without Causing An Error.
In Machine Learning, It Is Very Important To Know Appropriate Datatypes of
Independent And Dependent Variable.
1.NUMERICAL / QUALITATIVE DATA TYPE:
1.1 Discrete Data Type: The numeric data which have discrete values or hole
numbers. This type of variable value if expressed in decimal format will have no
proper meaning. Their values can be counted.
E.g.: No. of cars you have, no. of marbles in containers, students in a Class, Etc.
1.2 Continuous Data Type: The numerical measures which can take the value
within a certain range. This type of variable value if expressed in decimal
format has true meaning. Their values cannot be counted but measured. The
value can be infinite
E.g.: – height, weight, time, Area, distance, measurement of rainfall, etc.
2. Qualitative Data Type: These are the data types that cannot be expressed in
numbers. This describes categories or groups and is hence known as the categorical
data type.
A. Structured Data:
This type of data is either number or words. This can take numerical values but
mathematical operations cannot be performed on it. This type of data is expressed
in tabular format.
E.g: Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or
Bad, Etc.
B. Unstructured Data: This type of data does not have the proper format and
therefore known as unstructured data. This comprises textual data, sounds, images,
videos, etc.
There are also other types refer as Data Types Preliminaries Or Data Measures:-
Nominal
Ordinal
Interval
Ratio
II. Ordinal Data Type: This is also a categorical data type like nominal data but
has some natural ordering associated with it.
E.g., Like Rating Scale, Shirt Sizes, Ranks, Grades, Etc.
DATA STRUCTURE:
The data structure is the ordered sequence of data, and it tells the compiler
how a programmer is using the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data
structures.
The linear data structure is a special type of data structure that helps to
organize and manage data in a specific order where the elements are attached
adjacently. There are mainly 4 types of linear data structure as follows:
1.1 Array: An array is a collection of similar types of data. We will use arrays
constantly in machine learning, whether it's:
Method Description
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Stacks enable the undo and redo buttons on your computer as they
function similar to a stack of blog content. There is no sense in adding a blog at the
bottom of the stack. However, we can only check the most recent one that has been
added. Addition and removal occur at the top of the stack.
1.3 Linked List: A linked list is the type of collection having several separately
allocated nodes. A list is the type of collection of data elements that consist of a
value and pointer that point to the next node in the list.
1.4 Queue: A Queue is defined as the "FIFO" (first in, first out). It is useful to
predict a queuing scenario in real-time programs, such as people waiting in line to
withdraw cash in the bank. Hence, the queue is significant in a program where
multiple lists of codes need to be processed.
2. Non-linear Data Structures:
In Non-linear data structures, elements are not arranged in any
sequence. All the elements are arranged and linked with each other in a hierarchal
manner, where one element can be linked with one or more elements.
1.Trees:
Binary Tree: The concept of a binary tree is very much similar to a linked list,
but the only difference of nodes and their pointers. In a linked list, each node
contains a data value with a pointer that points to the next node in the list,
whereas; in a binary tree, each node has two pointers to subsequent nodes instead
of just one.
2. Graphs:
A graph data structure is also very much useful in machine learning for
link prediction. Graphs are directed or undirected concepts with nodes and ordered
or unordered pairs. Hence, you must have good exposure to the graph data
structure for machine learning and deep learning.
3. Maps:
Maps are the popular data structure in the programming world, which are
mostly useful for minimizing the run-time algorithms and fast searching the data. It
stores data in the form of (key, value) pair, where the key must be unique;
however, the value can be duplicated. Each key corresponds to or maps a value;
hence it is named a Map.
Python Dictionaries are very useful in machine learning and data science as
various functions and algorithms return the dictionary as an output. Dictionaries
are also much used for implementing sparse matrices, which is very common in
Machine Learning.
Ordering in a heap DS is applied along the hierarchy but not across it, where
the value of the parent node is always more than that of child nodes either on the
left or right side.
Here, the insertion and deletion operations are performed on the basis of
promotion. It means, firstly, the element is inserted at the highest available
position. After that it gets compared with its parent and promoted until it reaches
the correct ranking position. Most of the heaps data structures can be stored in an
array along with the relationships between the elements.