Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 54

1. Discuss the use of a decision tree for classification purposes with an example.

Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain what the
input is and what the corresponding output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision nodes and
leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is

split.

An example of a decision tree can be explained using above binary tree. Let’s say you want to predict
whether a person is fit given their information like age, eating habit, and physical activity, etc. The
decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of
pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary
classification problem (a yes no type problem). There are two main types of Decision Trees:

1. Classification trees (Yes/No types)

What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or
‘unfit’. Here the decision variable is Categorical.

2. Regression trees (Continuous data types)

Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now that we
know what a Decision Tree is, we’ll see how it works internally. There are many algorithms out there
which construct Decision Trees, but one of the best is called as ID3 Algorithm. ID3 Stands for Iterative
Dichotomiser 3. Before discussing the ID3 algorithm, we’ll go through few definitions.

 Entropy:

Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the

amount of uncertainty or randomness in data. Intuitively, it tells us about


the predictability of a certain event. Example, consider a coin toss whose probability of heads is 0.5 and
probability of tails is 0.5. Here the entropy is the highest possible, since there’s no way of determining
what the outcome might be. Alternatively, consider a coin which has heads on both the sides, the entropy
of such an event can be predicted perfectly since we know beforehand that it’ll always be heads. In other
words, this event has no randomness hence it’s entropy is zero. In particular, lower values imply less
uncertainty while higher values imply high uncertainty.

 Information Gain:

nformation gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is the
effective change in entropy after deciding on a particular attribute A. It measures the relative change in
entropy with respect to the independent variables.

Alternatively, where IG(S, A) is the information gain by


applying feature A. H(S) is the Entropy of the entire set, while the second term calculates the Entropy
after applying the feature A, where P(x) is the probability of event x.

Let’s understand this with the help of an example. Consider a piece of data collected over the course of 14
days where the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether
Golf was played on the day. Now, our job is to build a predictive model which takes in above 4
parameters and predicts whether Golf will be played on the day. We’ll build a decision tree to do that
using ID3 algorithm.

Day Outlook Temperature Humidity Wind Play Golf


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
ID3 Algorithm will perform following tasks recursively

1. Create root node for the tree


2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
Now, let's go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy of the
current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.

Yes No Total
9 5 14

Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them belong
to one class and other half belong to other class that is perfect randomness. Here it’s 0.94 which means
the distribution is fairly random. Now, the next step is to choose the attribute that gives us highest
possible Information Gain which we’ll choose as the root node. Let’s start with

‘Wind’

where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible values in the
sample data, hence x = {Weak, Strong} We’ll have to

calculate: Amongst all the 14 examples we have 8


places where the wind is weak and 6 where the wind is Strong.

Wind =
Wind = Weak Total
Strong
8 6 14
Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’ for ‘Play

Golf’. So, we have, Similarly, out of 6 Strong


examples, we have 3 examples where the outcome was ‘Yes’ for Play Golf and 3 where we had ‘No’

for Play Golf. Remember, here half items belong to


one class while other half belong to other. Hence we have perfect randomness. Now we have all the

pieces required to calculate the Information Gain,


Which tells us the Information Gain by considering ‘Wind’ as the feature and give us information gain
of 0.048. Now we must similarly calculate the Information Gain for all the

features. We can clearly see that IG(S, Outlook) has the


highest information gain of 0.246, hence we chose Outlook attribute as the root node. At this point,

the decision tree looks like.

Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no coincidence by
any chance, the simple tree resulted because of the highest information gain is given by the attribute
Outlook. Now how do we proceed from this point? We can simply apply recursion, you might want to
look at the algorithm steps described earlier. Now that we’ve used Outlook, we’ve got three of them
remaining Humidity, Temperature, and Wind. And, we had three possible values of Outlook: Sunny,
Overcast, Rain. Where the Overcast node already ended up having leaf node ‘Yes’, so we’re left with two
subtrees to compute: Sunny and Rain. Table where
the value of Outlook is Sunny looks like:

Temperature Humidity Wind Play Golf


Hot High Weak No
Hot High Strong No
Mild High Weak No
Cool Normal Weak Yes
Mild Normal Strong Yes

In the similar fashion, we compute the following

values As we can see the highest Information Gain is given by


Humidity. Proceeding in the same way with will give us Wind as the one with highest
information gain. The final Decision Tree looks something like this.

Code: Let’s see an example in Python

import pydotplus
from sklearn.datasets import load_iris
from sklearn import tree
from IPython.display import Image, display
__author__ = "Mayur Kulkarni <mayur.kulkarni@xoriant.com>"

def load_data_set():
"""
Loads the iris data set

:return: data set instance


"""
iris = load_iris()
return iris
def train_model(iris):
"""
Train decision tree classifier

:param iris: iris data set instance


:return: classifier instance
"""
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
return clf

def display_image(clf, iris):


"""
Displays the decision tree image

:param clf: classifier instance


:param iris: iris data set instance
"""
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))

if __name__ == '__main__':
iris_data = load_iris()
decision_tree_classifier = train_model(iris_data)
display_image(clf=decision_tree_classifier, iris=iris_data)
Conclusion: Below is the summary of what we’ve studied in this blog:

1. Entropy to measure discriminatory power of an attribute for classification task. It defines the
amount of randomness in attribute for classification task. Entropy is minimal means the attribute
appears close to one class and has a good discriminatory power for classification
2. Information Gain to rank attribute for filtering at given node in the tree. The ranking is based on
high information gain entropy in decreasing order.
3. The recursive ID3 algorithm that creates a decision tree.

Explain Inductive machine learning with example?

Inductive Inference

Inductive inference is the process of reaching a general conclusion from specific examples.

The general conclusion should apply to unseen examples.

Inductive Learning Hypothesis: any hypothesis found to approximate the target function well over a
sufficiently large set of training examples will also approximate the target function well over other
unobserved examples.

Example:

Identified relevant attributes: x, y, z

x y z
2 3 5
4 6 10
5 2 7

Model 1:

x+y=z

Prediction: x = 0, z = 0 y=0

Model 2:

if x = 2 and z = 5, then y = 3.
if x = 4 and z = 10, then y = 6.
if x = 5 and z = 7, then y = 2.
otherwise y = 1.

Model 2 is likely overfitting.

Good:
Bad:
completely consistent with data.
no justification in the data for the prediction that y = 1 in all other cases.
not in the class of algebraic functions (but nothing was said about class of descriptions).

Inductive bias: explicit or implicit assumption(s) about what kind of model is wanted.

Typical inductive bias:

 prefer models that can be written in a concise way.


o Select the shortest one.

Example:

 The decision tree ID3 algorithm searches the complete hypothesis space, and there is no
restriction on the number of hypthotheses that could eventually be enumerated. However, this
algorithm searches incompletely through the set of possibly hypotheses and preferentially selects
those hypotheses that lead to a smaller decision tree. This type of bias is called
a preference (or search) bias.
 In contrast, the version space candidate-elimination algorithm searches through only a subset of
the possible hypotheses (an incomplete hypothesis space), yet searches this space completely.
This type of bias is called a restriction (or language) bias, because the number of possible
hyptheses considered is restricted.
 A preference bias is generally more desirable than a restriction bias, because an algorithm with
this bias is allowed to search through the complete hypothesis space, which is guaranteed to
contain the target function.
o Restricting the hypothesis space being searched (a restriction bias) is less desirable
because the target function may not be within the set of hypotheses considered.

Some languages of interest:


 Conjunctive normal form for Boolean variables A, B, C.
o e.g., ABC
 Disjunctive normal form.
o e.g., A(~B)(~C) + A(~B)C + AB(~C)
 Algebraic expressions.

Positive and Negative Examples

Positive Examples

 Are all true.

x y z
2 3 5
2 5 7
4 6 10

general x, y, z I
more specific x, y, z I+
more specific than the first 1 < x, y, z < 11 ;
two x, y, z I
even more specific model x+y=z

Negative Examples

 Constrain the set of models consistent with the examples.

x y z Decision
2 3 5 Y
2 5 7 Y
4 6 10 Y
2 2 5 N

Search for Description

Description keeps getting larger or longer.

Finite language - algorithm terminates.

Infinite language - algorithm runs

 Forever.
 Until out of memory.
 Until a final answer is reached.

X = example space/instance space (all possible examples)

D = description space (set of descriptions defined as a language L)


 Each description l L corresponds to a set of examples Xl.

Success Criterion

Look for a description l L such that l is consistent with all observed examples.

 description can be relaxed to match all instances.


 inductive bias can be expressed in the success criterion.

Example:

L = {x op y = z}, op = {+, -, *, /}

Given a precise specification of language and data, write a program to test descriptions one by one against
the examples.

 finite language: size = |L|


 finite number of examples: size = |X|
 (|L| * |X|)

Why is Machine Learning Hard (Slow)?

It is very difficult to specify a small finite language that contains a description of the examples.

e.g., algebraic expressions on 3 variables is an infinite language

Decision Tree: ID3 Algorithm Maths behind the Algorithm

Decision Tree Algorithm:

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised
learning algorithms, the decision tree algorithm can be used for solving regression and classification
problems
Introduction of Terminology:

I'm not going into all detail's of terminology assuming you know at this point instead of sharing image for
references below

How Decision Tree Works:

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation
of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity
of the node increases with respect to the target variable. The decision tree splits the nodes on all available
variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on the type of target variables. Let us look at some algorithms used
in Decision Trees:

ID3 → (extension of D3)

C4.5 → (successor of ID3)

CART → (Classification And Regression Tree)

CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)

MARS → (multivariate adaptive regression splines)

Here, we see only ID3 algorithm:


ID3 Algorithms Intuition:
The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of
possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the
choice that seems to be the best at that moment.

Steps of ID3 Algorithms:

1. Select Root node(S) based on low Entropy and Highest Information Gain

2 . On each iteration of an algorithms it calculate the Entropy and Information gain, considering that
every node is unused

3. Select node base on Lowest Entropy or Highest I.G

4. then Splits set S to produce the subsets of data

5. An algorithms continuous to recur on each subset and make sure that attributes are fresh and Creates
the decision Tree

Before going deep in the algorithms understand some Statistical Terms Involves in it:

1. Entropy :

A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. if Entropy E(s) = 0 means it is Completely homogenic or we can say that it is
Leaf node of tree so it can't be divided further and ID3 uses Lowest Entropy to splitting the algorithm.
Formula:

Here, p = Probability

I'm mentioning some useful relation between probability and Entropy (Consider Binary Classification):

1. Probability(Both Class) = 0.5 & Entropy = 1

2.Probability(Either or Both Class) = 0 & Entropy = 0 it is called Leaf Node & stop Split

 Above relation we seeing it the graph that if p = 0.5 on x axis then Entropy becomes 1 So, we
determine that as probability high i.e. > 0.5 its entropy started decreasing and also we got leaf
node by entropy = 0
2 . Information Gain:

The information gain is based on the decrease in entropy after a data-set is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the
most homogeneous branches)

3. Weighted Entropy Gain:

Above is weighted average for all node

1. | Sv | = sum of every node values

2. | S | = sum of set S node values

3. Entropy(Sv) = entropy of current node

Example:

Now we Understood the all concept used in ID3 algorithm now its time to implement using an Examples
Let's see one Example.
From above example you can See there are 4 features and based on derive tree is Mammal or Reptile

1. find all features Information Gain and select the root node which is Largest Among all

formula for Information Gain=

How You Calculated for I.G all features in our example for toothed

How Entropy can be calculated H(S):

1.Calculated Information Gain:


As per above Results we determine that tree select Hair as root node and splits further

2 .Splits The tree further:


Consider the situation where hair has all true value if we find all entropy for true values it becomes 0 so
we say that for All hair and True values it predicts mammal
Now, see the tree
3 . Lets consider Hair = False scenario:

Entropy when " hair = Not Hair "

Based on data id3 find information gain for all excluding hair = Not hair below
from above Got Information Gain it select legs as child node of Parent Hair it splits tree accordingly we
see in image below
4. Now consider " Legs = Legs " find information gain and splits here its not consider hair again
Entropy of current node Legs :

Information Gain is given by


now considering highest I.G it selects toothed as child node True is Mammal and False as Reptile select

Final Tree is given below :


Data Mining Bayesian Classifiers
In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record cant be assumed with certainty
even though its attribute set is the same as some of the training examples. These circumstances may
emerge due to the noisy data or the presence of certain confusing factors that influence classification, but
it is not included in the analysis. For example, consider the task of predicting the occurrence of whether
an individual is at risk for liver illness based on individuals eating habits and working efficiency.
Although most people who eat healthly and exercise consistently having less probability of occurrence of
liver disease, they may still do so due to other factors. For example, due to consumption of the high-
calorie street foods and alcohol abuse. Determining whether an individual's eating routine is healthy or
the workout efficiency is sufficient is also subject to analysis, which in turn may introduce vulnerabilities
into the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian classifiers
are the statistical classifiers with the Bayesian probability understandings. The theory expresses how a
level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
provide an algorithm that uses evidence to calculate limits on an unknown parameter.

Bayes's theorem is expressed mathematically by the following equation that is given below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known as
the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects the
degree of belief in a hypothesis before and after accounting for evidence. For example, Lets us consider
an example of the coin. If we toss a coin, then we get either heads or tails, and the percent of occurrence
of either heads and tails is 50%. If the coin is flipped numbers of times, and the outcomes are observed,
the degree of belief may rise, fall, or remain the same depending on the outcomes.

For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.


Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) procedure
that is utilized to compute uncertainties by utilizing the probability concept. Generally known as Belief
Networks, Bayesian Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical graph, a
DAG consists of a set of nodes and links, where the links signify the connection between the nodes.

The nodes here represent random variables, and the edges define the relationship between these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability Distribution
(CDP) of each random variable. A Conditional Probability Table (CPT) is used to represent the CPD of
each variable in a network.

What is a kernel?

The kernel is the essential foundation of a computer's operating system (OS). It is the core that provides
basic services for all other parts of the OS. It is the main layer between the OS and underlying computer
hardware, and it helps with tasks such as process and memory management, file systems, device control
and networking.

During normal system startup, a computer's basic input/output system, or BIOS, completes a hardware
bootstrap or initialization. It then runs a bootloader which loads the kernel from a storage device -- such
as a hard drive -- into a protected memory space. Once the kernel is loaded into computer memory, the
BIOS transfers control to the kernel. It then loads other OS components to complete the system startup
and make control available to users through a desktop or other user interface.

If the kernel is damaged or cannot load successfully, the computer will be unable to start completely -- if
at all. This will require service to correct hardware damage or restore the operating system kernel to a
working version.

A kernel serves as the bridge between the operating


system and hardware.
What is the purpose of the kernel?

In broad terms, an OS kernel performs three primary jobs.

1. It provides the interfaces needed for users and applications to interact with the computer.

2. It launches and manages applications.

3. It manages the underlying system hardware devices.

In more granular terms, accomplishing these three kernel functions involves a range of computer tasks,
including the following:

 loading and managing less-critical OS components, such as device drivers;

 organizing and managing threads and the various processes spawned by running applications;

 scheduling which applications can access and use the kernel, and supervising that use when the
scheduled time occurs;

 deciding which nonprotected user memory space each application process uses;

 handling conflicts and errors in memory allocation and management;

 managing and optimizing hardware resources and dependencies, such as central processing unit (CPU)
and cache use, file system operation and network transport mechanisms;

 managing and accessing input/output devices such as keyboards, mice, disk drives, USB ports, network
adapters and displays; and

 handling device and application system calls using various mechanisms such as hardware interrupts or
device drivers.

Scheduling and management are central to the kernel's operation. Computer hardware can only do one
thing at a time. However, a computer's OS components and applications can spawn dozens and even
hundreds of processes that the computer must host. It's impossible for all of those processes to use the
computer's hardware -- such as a memory address or CPU instruction pipeline -- at the same time. The
kernel is the central manager of these processes. It knows which hardware resources are available and
which processes need them. It then allocates time for each process to use those resources.
The kernel is critical to a computer's operation, and it requires careful protection within the system's
memory. The kernel space it loads into is a protected area of memory. That protected memory space
ensures other applications and data don't overwrite or impair the kernel, causing performance problems,
instability or other negative consequences. Instead, applications are loaded and executed in a generally
available user memory space.

A kernel is often contrasted with a shell, which is the outermost part of an OS that interacts with user
commands. Kernel and shell are terms used more frequently in Unix OSes than in IBM mainframe and
Microsoft Windows systems.

A kernel is not to be confused with a BIOS, which is an independent program stored on a chip within a
computer's circuit board.

Device drivers

A key part of kernel operation is communication with hardware devices inside and outside of the physical
computer. However, it is impractical to write an OS capable of interacting with every possible device in
existence. Instead, kernels rely on the ability of device drivers to add kernel support for specialized
devices, such as printers and graphics adapters.

When an OS is installed on a computer, the installation adds device drivers for any specific devices
detected within the computer. This helps tailor the OS installation to the specific system with just enough
components to support the devices present. When a new or better device replaces an existing device,
the device driver is updated or replaced.

There are several types of device drivers. Each addresses a different data transfer type. The following are
some of the main driver types:

 Character device drivers implement, open, close, read and write data, as well as grant data stream
access for the user space.

 Block device drivers provide device access for hardware that transfers randomly accessible data in
fixed blocks.

 Network device drivers transmit data packets for hardware interfaces that connect to external systems.
Device drivers are classified as kernel or user. A kernel mode device driver is a generic driver that is
loaded along with the OS. These drivers are often suited to small categories of major hardware devices,
such as CPU and motherboard device drivers.

User mode device drivers encompass an array of ad hoc drivers used for aftermarket, user-added devices,
such as printers, graphics adapters, mice, advanced sound systems and other plug-and-play devices.

The OS needs the code that makes up the kernel. Consequently, the kernel code is usually loaded into an
area in the computer storage that is protected so that it will not be overlaid with less frequently used parts
of the OS.

Kernel mode vs. user mode

Computer designers have long understood the importance of security and the need to protect critical
aspects of the computer's behavior. Long before the internet, or even the emergence of networks,
designers carefully managed how software components accessed system hardware and resources.
Processors were developed to support two operating modes: kernel mode and user mode.

Kernel mode

Kernel mode refers to the processor mode that enables software to have full and unrestricted access to the
system and its resources. The OS kernel and kernel drivers, such as the file system driver, are loaded into
protected memory space and operate in this highly privileged kernel mode.

User mode

User mode refers to the processor mode that enables user-based applications, such as a word processor or
video game, to load and execute. The kernel prepares the memory space and resources for that
application's use and launches the application within that user memory space.

User mode applications are less privileged and cannot access system resources directly. Instead, an
application running in user mode must make system calls to the kernel to access system resources. The
kernel then acts as a manager, scheduler and gatekeeper for those resources and works to prevent
conflicting resource requests.
The processor switches to kernel mode as the kernel processes its system calls and then switches back to
user mode to continue operating the application(s).

It's worth noting that kernel and user modes are processor states and have nothing to do with actual solid-
state memory. There is nothing intrinsically safe or protected about the memory used for kernel mode.
Kernel driver crashes and memory failures within the kernel memory space can still crash the OS and the
computer.

Types of kernels

Kernels fall into three architectures: monolithic, microkernel and hybrid. The main difference between
these types is the number of address spaces they support.

 A microkernel delegates user processes and services and kernel services in different address spaces.

 A monolithic kernel implements services in the same address space.

 A hybrid kernel, such as the Microsoft Windows NT and Apple XNU kernels, attempts to combine the
behaviors and benefits of microkernel and monolithic kernel architectures.

Overall, these kernel implementations present a tradeoff -- admins get the flexibility of more source code
with microkernels or they get increased security without customization options with the monolithic
kernel.

Some specific differences among the three kernel types include the following:

Microkernels

Microkernels have all of their services in the kernel address space. For their communication protocol,
microkernels use message passing, which sends data packets, signals and functions to the correct
processes. Microkernels also provide greater flexibility than monolithic kernels; to add a new service,
admins modify the user address space for a microkernel.

Because of their isolated nature, microkernels are more secure than monolithic kernels. They remain
unaffected if one service within the address space fails.
Monolithic kernels

Monolithic kernels are larger than microkernels, because they house both kernel and user services in the
same address space. Monolithic kernels use a faster system call communication protocol than
microkernels to execute processes between the hardware and software. They are less flexible than
microkernels and require more work; admins must reconstruct the entire kernel to support a new service.

Monolithic kernels pose a greater security risk to systems than microkernels because, if a service fails,
then the entire system shuts down. Monolithic kernels also don't require as much source code as a
microkernel, which means they are less susceptible to bugs and need less debugging.

The Linux kernel is a monolithic kernel that is constantly growing; it had 20 million lines of code in
2018. From a foundational level, it is layered into a variety of subsystems. These main groups include a
system call interface, process management, network stack, memory management, virtual file system and
device drivers.

Administrators can port the Linux kernel into their OSes and run live updates. These features, along with
the fact that Linux is open source, make it more suitable for server systems and environments that require
real-time maintenance.

Hybrid kernels

Apple developed the XNU OS kernel in 1996 as a hybrid of the Mach and Berkeley Software Distribution
(BSD) kernels and paired it with an Objective-C application programming interface or API. Because it is
a combination of the monolithic kernel and microkernel, it has increased modularity, and parts of the OS
gain memory protection.
See how user and kernel address space for physical
memory is configured in Windows 10.

History and development of the kernel

Before the kernel, developers coded actions directly to the processor, instead of relying on an OS to
complete interactions between hardware and software.

The first attempt to create an OS that used a kernel to pass messages was in 1969 with the RC 4000
Multiprogramming System. Programmer Per Brinch Hansen discovered it was easier to create a nucleus
and then build up an OS, instead of converting existing OSes to be compatible with new hardware. This
nucleus -- or kernel -- contained all source code to facilitate communications and support systems,
eliminating the need to directly program on the CPU.

After RC 4000, Bell Labs researchers started work on Unix, which radically changed OS development
and kernel development and integration. The goal of Unix was to create smaller utilities that do specific
tasks well instead of having system utilities try to multitask. From a user standpoint, this simplifies
creating shell scripts that combine simple tools.
As Unix adoption increased, the market started to see a variety of Unix-like computer OSes, including
BSD, NeXTSTEP and Linux. Unix's structure perpetuated the idea that it was easier to build a kernel on
top of an OS that reused software and had consistent hardware, instead of relying on a time-shared system
that didn't require an OS.

Unix brought OSes to more individual systems, but researchers at Carnegie Mellon expanded kernel
technology. From 1985 to 1994, they expanded work on the Mach kernel. Unlike BSD, the Mach kernel
is OS-agnostic and supports multiple processor architectures. Researchers made it binary-compatible with
existing BSD software, enabling it to be available for immediate use and continued experimentation.

The Mach kernel's original goal was to be a cleaner version of Unix and a more portable version of
Carnegie Mellon's Accent interprocessor communications (IPC) kernel. Over time, the kernel brought
new features, such as ports and IPC-based programs, and ultimately evolved into a microkernel.

Shortly after the Mach kernel, in 1986, Vrije Universiteit Amsterdam developer Andrew Tanenbaum
released MINIX (mini-Unix) for educational and research uses. This distribution contained a microkernel-
based structure, multitasking, protected mode, extended memory support and an American National
Standards Institute C compiler.

The next major advancement in kernel technology came in 1992, with the release of the Linux kernel.
Founder Linus Torvalds developed it as a hobby, but he still licensed the kernel under general public
license, making it open source. It was first released with 176,250 lines of code.

The majority of OSes -- and their kernels -- can be traced back to Unix, but there is one outlier: Windows.
With the popularity of DOS- and IBM-compatible PCs, Microsoft developed the NT kernel and based its
OS on DOS. That is why writing commands for Windows differs from Unix-based systems.

What is bias in machine learning?


Bias is a phenomenon that skews the result of an algorithm in favor or against an idea.

Bias is considered a systematic error that occurs in the machine learning model itself due to incorrect
assumptions in the ML process.

Technically, we can define bias as the error between average model prediction and the ground truth.
Moreover, it describes how well the model matches the training data set:
 A model with a higher bias would not match the data set closely.
 A low bias model will closely match the training data set.

Characteristics of a high bias model include:

 Failure to capture proper data trends


 Potential towards underfitting
 More generalized/overly simplified
 High error rate

What is variance in machine learning?


Variance refers to the changes in the model when using different portions of the training data set.
Simply stated, variance is the variability in the model prediction—how much the ML function can adjust
depending on the given data set. Variance comes from highly complex models with a large number of
features.

 Models with high bias will have low variance.


 Models with high variance will have a low bias.

All these contribute to the flexibility of the model. For instance, a model that does not match a data set
with a high bias will create an inflexible model with a low variance that results in a suboptimal machine
learning model.

Characteristics of a high variance model include:

 Noise in the data set


 Potential towards overfitting
 Complex models
 Trying to put all data points as close as possible

Getting started with AIOps is easy. Learn how you can manage escalating IT complexity with ease! ›

Underfitting & overfitting


The terms underfitting and overfitting refer to how the model fails to match the data. The fitting of a
model directly correlates to whether it will return accurate predictions from a given data set.

 Underfitting occurs when the model is unable to match the input data to the target data. This happens
when the model is not complex enough to match all the available data and performs poorly with the
training dataset.
 Overfitting relates to instances where the model tries to match non-existent data. This occurs when
dealing with highly complex models where the model will match almost all the given data points and
perform well in training datasets. However, the model would not be able to generalize the data point in
the test data set to predict the outcome accurately.

Bias vs variance: A trade-off


Bias and variance are inversely connected. It is impossible to have an ML model with a low bias and a
low variance.

When a data engineer modifies the ML algorithm to better fit a given data set, it will lead to low bias—
but it will increase variance. This way, the model will fit with the data set while increasing the chances of
inaccurate predictions.
The same applies when creating a low variance model with a higher bias. While it will reduce the risk of
inaccurate predictions, the model will not properly match the data set.
It’s a delicate balance between these bias and variance. Importantly, however, having a higher variance
does not indicate a bad ML algorithm. Machine learning algorithms should be able to handle some
variance.

We can tackle the trade-off in multiple ways…

Increasing the complexity of the model to count for bias and variance, thus decreasing the overall bias
while increasing the variance to an acceptable level. This aligns the model with the training dataset
without incurring significant variance errors.
Increasing the training data set can also help to balance this trade-off, to some extent. This is the
preferred method when dealing with overfitting models. Furthermore, this allows users to increase the
complexity without variance errors that pollute the model as with a large data set.
A large data set offers more data points for the algorithm to generalize data easily. However, the major
issue with increasing the trading data set is that underfitting or low bias models are not that sensitive to
the training data set. Therefore, increasing data is the preferred solution when it comes to dealing with
high variance and high bias models.

This table lists common algorithms and their expected behavior regarding bias and variance:

Algorithm Bias Variance

Linear High
Regression Bias Less Variance

Low
Decision Tree Bias High Variance

Low High Variance (Less than


Bagging Bias Decision Tree)

Low High Variance (Less than


Random Forest Bias Decision Tree and Bagging)

Ready to discover how BMC Helix for ServiceOps can transform your business?

Bias & variance calculation example


Let’s put these concepts into practice—we’ll calculate bias and variance using Python.
The simplest way to do this would be to use a library called mlxtend (machine learning extension), which
is targeted for data science tasks. This library offers a function called bias_variance_decomp that we can
use to calculate bias and variance.
We will be using the Iris data dataset included in mlxtend as the base data set and carry out the
bias_variance_decomp using two algorithms: Decision Tree and Bagging.

Decision tree example


Copy

from mlxtend.evaluate import bias_variance_decomp


from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split
# Get Data Set
X, y = iris_data()
X_train_ds, X_test_ds, y_train_ds, y_test_ds = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
# Define Algorithm
tree = DecisionTreeClassifier(random_state=123)
# Get Bias and Variance - bias_variance_decomp function
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
tree, X_train_ds, y_train_ds, X_test_ds, y_test_ds,
loss='0-1_loss',
random_seed=123,
num_rounds=1000)
# Display Bias and Variance
print(f'Average Expected Loss: {round(avg_expected_loss, 4)}n')
print(f'Average Bias: {round(avg_bias, 4)}')
print(f'Average Variance: {round(avg_var, 4)}')
Result:
Bagging example
Copy

from mlxtend.evaluate import bias_variance_decomp


from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
# Get Data Set
X, y = iris_data()
X_train_ds, X_test_ds, y_train_ds, y_test_ds = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
# Define Algorithm
tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=100,
random_state=123)
# Get Bias and Variance - bias_variance_decomp function
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
bag, X_train_ds, y_train_ds, X_test_ds, y_test_ds,
loss='0-1_loss',
random_seed=123,
num_rounds=1000)
# Display Bias and Variance
print(f'Average Expected Loss: {round(avg_expected_loss, 4)}n')
print(f'Average Bias: {round(avg_bias, 4)}')
print(f'Average Variance: {round(avg_var, 4)}')
Result:
Linear Discriminant Analysis (LDA) in Machine Learning

Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction techniques
in machine learning to solve more than two-class classification problems. It is also known as Normal
Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).

This can be used to project the features of higher dimensional space into lower-dimensional space in
order to reduce resources and dimensional costs. In this topic, "Linear Discriminant Analysis (LDA) in
machine learning”, we will discuss the LDA algorithm for classification predictive modeling problems,
limitation of logistic regression, representation of linear Discriminant analysis model, how to make a
prediction using LDA, how to prepare data for LDA, extensions to LDA and much more. So, let's start
with a quick introduction to Linear Discriminant Analysis (LDA) in machine learning.
Note: Before starting this topic, it is recommended to learn the basics of Logistic Regression algorithms
and a basic understanding of classification problems in machine learning as a prerequisite

What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to only two-class, linear Discriminant analysis is
applicable for more than two classes of classification problems.

Linear Discriminant analysis is one of the most popular dimensionality reduction techniques used for
supervised classification problems in machine learning. It is also considered a pre-processing step for
modeling differences in ML and applications of pattern classification.

Whenever there is a requirement to separate two or more classes having multiple features efficiently, the
Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate them
efficiently. When we classify them using a single feature, then it may show overlapping.

To overcome the overlapping issue in the classification process, we must increase the number of features
regularly.

Example:

Let's assume we have to classify two different classes having two sets of data points in a 2-dimensional
plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the 1-D
plane. Using this technique, we can also maximize the separability between multiple classes.

How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need to
classify them efficiently. As we have already seen in the above example that LDA enables us to draw a
straight line that can completely separate the two classes of the data points. Here, LDA uses an X-Y axis
to create a new axis by separating them using a straight line and projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can maximize the
distance between the means of the two classes and minimizes the variation within each class.

In other words, we can say that the new axis will increase the separation between the data points of the
two classes and plot them onto the new axis.

Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform well for
binary classification but falls short in the case of multiple classification problems with well-
separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as PCA,
which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract useful data
from different faces. Coupled with eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)

Although, LDA is specifically used to solve supervised classification problems for two or more classes
which are not possible using logistic regression in machine learning. But LDA also fails in some cases
where the Mean of the distributions is shared. In this case, LDA fails to create a new axis that makes both
the classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in machine learning.

Extension to Linear Discriminant Analysis (LDA)

Linear Discriminant analysis is one of the most simple and effective methods to solve classification
problems in machine learning. It has so many extensions and variations as follows:

1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class deploys its
own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of inputs are
used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the variance
(actually covariance) and hence moderates the influence of different variables on LDA.

Real-world Applications of LDA

Some of the common real-world applications of Linear discriminant Analysis are given below:

o Face Recognition
Face recognition is the popular application of computer vision, where each face is represented as
the combination of a number of pixel values. In this case, LDA is used to minimize the number of
features to a manageable number before going through the classification process. It generates a
new template in which each dimension consists of a linear combination of pixel values. If a linear
combination is generated using Fisher's linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on the basis of
various parameters of patient health and the medical treatment which is going on. On such
parameters, it classifies disease as mild, moderate, or severe. This classification helps the doctors
in either increasing or decreasing the pace of the treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of LDA; we
can easily identify and select the features that can specify the group of customers who are likely
to purchase a specific product in a shopping mall. This can be helpful when we want to identify a
group of customers who mostly purchase a product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example, "will you
buy this product” will give a predicted result of either one or two possible classes as a buying or
not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work, and it can
also be considered a classification problem. In this case, LDA builds similar groups on the basis
of different parameters, including pitches, frequencies, sound, tunes, etc.

Difference between Linear Discriminant Analysis and PCA

Below are some basic differences between LDA and PCA:


o PCA is an unsupervised algorithm that does not care about classes and labels and only aims to
find the principal components to maximize the variance in the given dataset. At the same time,
LDA is a supervised algorithm that aims to find the linear discriminants to represent the axes that
maximize separation between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to PCA. However, PCA
is assumed to be an as good performer for a comparatively small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where PCA is first followed
by LDA.

How to Prepare Data for LDA

Below are some suggestions that one should always consider while preparing the data to build the LDA
model:

o Classification Problems: LDA is mainly applied for classification problems to classify the
categorical output variable. It is suitable for both binary and multi-class classification problems.
o Gaussian Distribution: The standard LDA model applies the Gaussian Distribution of the input
variables. One should review the univariate distribution of each attribute and transform them into
more Gaussian-looking distributions. For e.g., use log and root for exponential distributions and
Box-Cox for skewed distributions.
o Remove Outliers: It is good to firstly remove the outliers from your data because these outliers
can skew the basic statistics used to separate classes in LDA, such as the mean and the standard
deviation.
o Same Variance: As LDA always assumes that all the input variables have the same variance,
hence it is always a better way to firstly standardize the data before implementing an LDA model.
By this, the Mean will be 0, and it will have a standard deviation of 1.

Pruning in Machine Learning

Introduction

Pruning is a technique in machine learning that involves diminishing the size of a prepared model by
eliminating some of its parameters. The objective of pruning is to make a smaller, faster, and more
effective model while maintaining its accuracy. Pruning can be especially useful for huge and complex
models, where lessening their size can prompt significant improvements in their speed and proficiency.

Types of Pruning Techniques:

There are two principal types of pruning techniques: unstructured and structured pruning. Unstructured
pruning involves eliminating individual parameters or connections from the model, resulting in a smaller
and sparser model. Structured pruning involves eliminating groups of parameters, such as whole filters,
channels, or neurons.
Structured Pruning:

Structured pruning involves eliminating whole structures or groups of parameters from the model, such as
whole neurons, channels, or filters. This sort of pruning preserves the hidden structure of the model,
implying that the pruned model will have the same overall architecture as the first model, but with fewer
parameters.

Structured pruning is suitable for models with a structured architecture, such as convolutional neural
networks (CNNs), where the parameters are coordinated into filters, channels, and layers. It is also easier
to carry out than unstructured pruning since it preserves the structure of the model.

Unstructured Pruning:

Unstructured pruning involves eliminating individual parameters from the model without respect for their
location in the model. This sort of pruning does not preserve the hidden structure of the model, implying
that the pruned model will have an unexpected architecture in comparison to the first model. Unstructured
pruning is suitable for models without a structured architecture, such as completely connected brain
networks, where the parameters are coordinated into a single grid. It tends to be more effective than
structured pruning since it allows for more fine-grained pruning; however, it can also be more difficult to
execute.

Criteria for Selecting a Pruning Technique:

The decision of which pruning technique to use depends on several factors, such as the type of model, the
accessibility of registration resources, and the degree of accuracy desired. For instance, structured pruning
is more suitable for convolutional brain networks, while unstructured pruning is more pertinent for
completely connected networks. The decision to prune should also consider the compromise between
model size and accuracy. Other factors to consider include the complexity of the model, the size of the
training information, and the performance metrics of the model.

Pruning in Neural Networks:

Neural networks are a kind of machine learning model that can benefit extraordinarily from pruning. The
objective of pruning in neural networks is to lessen the quantity of parameters in the network, thereby
making a smaller and faster model without sacrificing accuracy.

There are several types of pruning techniques that can be applied to neural networks, including weight
pruning, neuron pruning, channel pruning, and filter pruning.

1. Weight Pruning

Weight pruning is the most common pruning technique used in brain networks. It involves setting some
of the weights in the network to zero or eliminating them. This results in a sparser network that is faster
and more effective than the first network. Weight pruning can be done in more than one way, including
magnitude-based pruning, which removes the smallest magnitude weights, and iterative pruning, which
removes weights during training.
2. Neuron Pruning

Neuronal pruning involves eliminating whole neurons from the network. This can be useful for
diminishing the size of the network and working on its speed and effectiveness. Neuron pruning can be
done in more ways than one, including threshold-based pruning, which removes neurons with small
activation values, and sensitivity-based pruning, which removes neurons that only slightly affect the
result.

3. Channel Pruning

Channel pruning is a technique used in convolutional brain networks (CNNs) that involves eliminating
whole channels from the network. A channel in a CNN corresponds to a gathering of filters that figure out
how to distinguish a specific element. Eliminating unnecessary channels can decrease the size of the
network and work on its speed and effectiveness without sacrificing accuracy.

4. Filter Pruning

Filter pruning involves eliminating whole filters from the network. A filter in a CNN corresponds to a set
of weights that figure out how to identify a specific element. Eliminating unnecessary filters can decrease
the size of the network and improve its speed and effectiveness without sacrificing accuracy.

Pruning in Decision Trees:

Pruning can also be applied to decision trees, which are a kind of machine learning model that learns a
series of binary decisions based on the information features. Decision trees can turn out to be
exceptionally huge and perplexing, prompting overfitting and decreased generalisation capacity. Pruning
can be used to eliminate unnecessary branches and nodes from the decision tree, resulting in a smaller and
simpler model that is less liable to overfit.

Pruning in Support Vector Machines:

Pruning can also be applied to support vector machines (SVMs), which are a sort of machine learning
model that separates useful pieces of information into various classes using a hyperplane. SVMs can turn
out to be extremely large and complicated, resulting in slow and wasteful predictions. Pruning can be
used to eliminate unnecessary support vectors from the model, resulting in a smaller and faster model that
is still accurate.

Advantages
o Decreased model size and complexity. Pruning can significantly diminish the quantity of
parameters in a machine learning model, prompting a smaller and simpler model that is easier to
prepare and convey.
o Faster inference. Pruning can decrease the computational cost of making predictions, prompting
faster and more effective predictions.
o Further developed generalization. Pruning can forestall overfitting and further develop the
generalization capacity of the model by diminishing the complexity of the model.
o Increased interpretability. Pruning can result in a simpler and more interpretable model, making it
easier to understand and make sense of the model's decisions.
Disadvantages
o Possible loss of accuracy. Pruning can sometimes result in a loss of accuracy, especially in the
event that such a large number of parameters are pruned or on the other hand in the event that
pruning is not done cautiously.
o Increased training time. Pruning can increase the training season of the model, especially
assuming it is done iteratively during training.
o Trouble in choosing the right pruning technique. Choosing the right pruning technique can be
testing and may require area expertise and experimentation.
o Risk of over-pruning. Over-pruning can prompt an overly simplified model that is not accurate
enough for the task.

Pruning vs Other Regularization Techniques:


1. Pruning is one of numerous regularization techniques used in machine learning to forestall
overfitting and further develop the generalization capacity of the model.
2. Other famous regularization techniques incorporate L1 and L2 regularization, dropout, and early
stopping.
3. Contrasted with other regularization techniques, pruning has the upside of diminishing the model
size and complexity, prompting faster inference and further developed interpretability.
4. Be that as it may, pruning can also have a higher computational cost during training, and its effect
on the model's performance can be less unsurprising than other regularization techniques.

Practical Considerations for Pruning


o Choose the right pruning technique:

The decision of pruning technique depends on the specific characteristics of the model and the task within
reach. Structured pruning is suitable for models with a structured architecture, while unstructured pruning
is suitable for models without a structured architecture.

o Decide the pruning rate:

The pruning rate determines the proportion of parameters to be pruned. It should be chosen cautiously to
adjust the reduction in model size with the loss of accuracy.

o Evaluate the effect on the model's performance:

The effect of pruning on the model's accuracy should be evaluated using fitting metrics, such as validation
accuracy or test accuracy.

o Consider iterative pruning:

Iterative pruning involves pruning the model on various occasions during training, which can prompt
improved results than a single pruning toward the finish of training.

o Consolidate pruning with other regularization techniques:

Pruning can be joined with other regularization techniques, such as L1 and L2 regularization or dropout,
to further develop the model's performance further.
o Beware of over-pruning:

Over-pruning can prompt an overly simplified model that is not accurate enough for the task. Cautious
attention should be given to choosing the right pruning rate and assessing the effect on the model's
accuracy.

Conclusion:

Pruning is a useful technique in machine learning for decreasing the size and complexity of prepared
models. There are various types of pruning techniques, and selecting the right one depends on various
factors. Pruning should be done cautiously to achieve the desired harmony between model size and
accuracy, and it should be evaluated using suitable metrics. Overall, pruning can be an effective method
for making smaller, faster, and more proficient models without sacrificing accuracy.

You might also like