Mathematics For Machine Learning

Multi-variable Calculus
Applications & theory in Machine learning

 Multi-variable calculus is just a calculus which involves more then one variable.
 To understand & incorporate it in many real-life problems one must use some Linear Algebra as well otherwise it
would be impossible to understand.
 For instance, Derivative of a function of multiple variable is a linear transformation. If one don’t
know what linear transformation is, then one can't understand derivative.
, Chain rule is best understood in terms of product of matrices which represent various
derivatives.
Introduction
, Concept involving multiple integrals incorporate determinants.

, Understandable version of 2nd derivative test involve eigen values etc.
 So here in this Presentation, before going to main topic i.e. multi variable calculus, we would present Basic Linear
Algebra – Fundamentals & Applications.
Basic Linear Algebra- Fundamentals & Application
 Under Basic linear algebra, following sections would be majorly covered:
Fundamental
Fundamental
ss &
&
Linear Algebra-
covered
be covered
Basic
Basic
notations
notations
to be
Basic Linear
Contents to
Contents
Basic
Matrices &
&
Matrices
Linear
Linear Determinants
Transformati Determinants
Transformati
on
on
 After completing fundamentals in Basic linear Algebra one would be able to

 Describe Rn and do algebra with vectors in Rn.
Fundaments- Outcomes
 Represent a line in 3 space by a vector parameterization, a set of scalar parametric equations or using
symmetric form.
 Find a parameterization of a line given information about
(a) a point of the line and the direction of the line.
(b) two points contained in the line.
 Determine the direction of a line given its parameterization.
 The notation, Rn refers to the collection of ordered lists of n real numbers. More precisely, consider the following
definition.
 Mathematically we may say,
 Rn ≡{(x1,···,xn) : xj ∈R for j = 1,···,n}. (x1,···,xn) = (y1,···,yn) if and only if for all j = 1,···,n, xj = yj. When
Fundaments- Rn description
(x1,···,xn)∈ Rn, it is conventional to denote (x1,···,xn) by the single bold face letter, x.
 The numbers, xj are called the coordinates. The set {(0,···,0,t,0,···,0) : t ∈R } for t in the ith slot is called the ith
coordinate axis coordinate axis, the xi axis for short. The point 0≡(0,···,0) is called the origin. E.g. (1,2,4) ∈
R3 and (2,1,4) ∈ R3 but (1,2,4) 6= (2,1,4) because, even though the same numbers are involved, they don’t
match up. In particular, the first entries are not equal.
 Now point is why one would be interested in such a thing?
 consider a hot object which is cooling and suppose you want the temperature of this object. How many
coordinates would be needed? You would need one for the temperature, three for the position of the point in
the object and one more for the time. Thus you would need to be considering R5`
 Many other examples can be given.
 Sometimes n is very large. This is often the case in applications to business when they are trying to maximize
profit subject to constraints.
 It also occurs in numerical analysis when people try to solve hard problems on a computer.
 There are two algebraic operations done with elements of Rn. One is addition and the other is multiplication by
numbers, called scalars
 Mathematically,
Fundaments-Algebra in Rn
 If x ∈Rn and a is a number, also called a scalar, then ax ∈Rn is defined by ax = a(x1,···,xn)≡(ax1,···,axn).
 This is known as scalar multiplication. If x,y∈Rn then x+y∈Rn and is defined by x+y = (x1,···,xn)+(y1,···,yn)
≡(x1 +y1,···,xn +yn)
 An element of Rn,x ≡ (x1,···,xn) is often called a vector. The above definition is known as vector addition
 For v,w vectors in Rn and α,β scalars, (real numbers), the following hold.
 the commutative law of addition,
v+w = w+v
 the associative law for addition,
(v+w)+z = v+(w+z)
 the existence of an additive identity,
v+0 = v,
 the existence of an additive inverse
v+(−v) = 0
Also Also
α(v+w) = αv+αw, (α+β)v =αv+βv, α(βv) = αβ(v), 1v = v.
Fundaments- Geometric meaning of vector
 It was explained earlier that an element of Rn is an n tuple of numbers.
 that this can be used to determine a point in three-dimensional space in the case where n = 3 and in two-dimensional
space, in the case where n = 2. This point was specified relative to some coordinate axes.
 Thus the geometric significance of (d,e,f) + (a,b,c) = (d+a,e+b,f +c) is that
 You start with the position vector of the point (d,e,f) and at its point.
 you place the vector determined by (a,b,c) with its tail at (d,e,f).
 Then the point of this last vector will be (d+a,e+b,f +c).
x3
addition in Rn
u = (d,e,f)
v v = (a,b,c)
u+v
u
x1
x2
 To
begin with consider the case n = 1,2`
 In the case where n = 1, the only line is just R1 = R. Therefore, if x1 and x2 are two different points in R, consider x
= x1 +t(x2 −x1).
Fundaments-Lines & distances
where t ∈ R and the totality of all such points will give R.You see that you can always solve the above equation
for t, showing that every point on R is of this form.
 Now consider the plane.
 Let (x1,y1) and (x2,y2) be two different points in R2 which are contained in a line, l.
we may formulate it as (x,y) = (x1,y1)+t(x2 −x1,y2 −y1). If x1 = x2, then in place of the point slope form above, x =
x1. Since the two given points are different, y1 6= y2 and so you still obtain the above formula for the line. Because of
this, the following is the definition of a line in Rn.
 How is distance between two points in Rn defined?
 Let x =(x1,···,xn) and y =(y1,···,yn) be two points in Rn. Then |x−y| to indicates the distance between these points
and is defined as
distance between x and y ≡|x−y| ≡ This is called the distance formula.
Fundaments-Geometric Meaning Of Scalar
 As discussed earlier, x = (x1,x2,x3) determines a vector. the line from 0 to x placing the point of the vector on x.
 The length of this vector is defined to equal |x|. Thus the length of x equals
When you multiply x by a scalar, α, you get (αx1,αx2,αx3) and the length of this vector is defined
= |a|
Thus the following holds.
|αx|=|α||x|.
Multiplication In R3
 In other words, multiplication by a scalar magnifies the length of the vector.

 if α is negative, it causes the resulting vector to point in the opposite direction while if α > 0 it preserves the
direction the vector points.
 One way to see this is to first observe that if a is not 1, then x and αx are both points on the same line.
Basic Linear Algebra- Matrices And Linear Transformations
Matrices And Linear Transformations
 After completing Matrices and linear transformations one would be able to

 Perform the basic matrix operations of matrix addition, scalar multiplication, transposition and matrix
multiplication. Identify when these operations are not defined. Represent the basic operations in terms of double
subscript notation.
 Recall and prove algebraic properties for matrix addition, scalar multiplication, transposition, and matrix
multiplication. Apply these properties to manipulate an algebraic expression involving matrices.
 Evaluate the inverse of a matrix using Gauss Jordan elimination.
 Recall the cancellation laws for matrix multiplication. Demonstrate when cancellation laws do not apply.
 Recall and prove identities involving matrix inverses.
- Outcomes
 Understand the relationship between linear transformations and matrices.

 When people speak of vectors and matrices, it is common to refer to numbers as scalars.
 A matrix is a rectangular array of numbers. Several of them are referred to as matrices. For example, here is a matrix.
12345
12345
 The size or dimension of a matrix is defined
1 2 3 4 as
5 m×n where m is the number of rows and n is the number of columns. The
entry in the ith row and the jth column of this matrix is denoted by aij.
 There are various operations which are done on matrices. Matrices can be added multiplied by a scalar and multiplied by
other matrices.
 To illustrate scalar multiplication

-Matrix Arithmetic
The new matrix is obtained by multiplying every entry of the original matrix by the given scalar. If A is an m×n
matrix, −A is defined to equal (−1)A.
By definition, (Scalar Multiplication) If A = (aij) and k is a scalar, then kA = (kaij).
 To illustrate Matrix addition,

Two matrices must be the same size to be added. The sum of two matrices is a matrix which is obtained by adding the
corresponding entries.
By definition, (Addition) If A = (aij) and B = (bij) are two m×n matrices. Then A+B = C where C = (cij) for cij = aij +bij.
 An m×n matrix can be used to transform vectors in Rn to vectors in Rm through the use of matrix multiplication.
 In general mathematically, Let T : Rn → Rm be a function. Thus for each x ∈ Rn,Tx ∈ Rm. Then T is a linear
- Linear Transformation-overview
transformation if whenever α,β are scalars and x1 and x2 are vectors in Rn,
T (αx1 +βx2) = α1Tx1 +βTx2
In words, linear transformations distribute across + and allow you to factor out scalars
 A linear transformation is called one to one (often written as 1−1) if it never takes two different vectors to the same
vector. Thus T is one to one if whenever x6= y Tx6= Ty. Equivalently, if T (x) = T (y), then x = y.
A linear transformation is called one to one (often written as 1−1) if it never takes two different vectors to the same
vector. Thus T is one to one if whenever x6= y Tx6= Ty. Equivalently, if T (x) = T (y), then x = y.
 A linear transformation mapping Rn to Rm is called onto if whenever y ∈Rm there exists x ∈Rn such that T (x) = y.
Thus T is onto if everything in Rm gets hit. In the case that a linear transformation comes from matrix
multiplication, it is common to refer to the matrix as onto when the linear transformation it determines is onto
 It turns out that if T is any linear transformation which maps Rn to Rm, there is always an m×n matrix, A with the
property that Ax = Tx.
- Constructing The Matrix Of A Linear for all x∈Rn. Here is why. Suppose T : Rn →Rm is a linear transformation and you want to find the matrix defined by
this linear transformation.

Then if x∈Rn it follows
Transformation
and so you see that the matrix desired is obtained from letting the ith column equal T (ei). This yields the following theorem .
Basic Linear Algebra- Determinants
 After completing fundamentals in Basic linear Algebra one would be able to

 Evaluate the determinant of a square matrix by applying (a) the cofactor formula or (b) row operations.
Determinants- Outcomes
 Recall the general properties of determinants.

 Recall that the determinant of a product of matrices is the product of the determinants. Recall that the
determinant of a matrix is equal to the determinant of its transpose.
 Apply Cramer’s Rule to solve a 2×2 or a 3×3 linear system.
 Use determinants to determine whether a matrix has an inverse.
 Evaluate the inverse of a matrix using cofactors.
 Cofactors And 2×2 Determinants
 Let A be an n×n matrix. The determinant of A, denoted as det(A) is a number. If the matrix is a 2×2
matrix, this number is very easy to find.
Determinants- Overview
Minor :
Co-factor :
 Properties Of Determinants
 There are many properties satisfied by determinants. Some of these properties have to do with row
operations which are described below.
 The row operations consist of the following

Determinants- Properties
1. Switch two rows.

2. Multiply a row by a nonzero number.
3. Replace a row by a multiple of another row added to itself.
 Let A be an n×n matrix and let A1 be a matrix which results from multiplying some row of A by a scalar,
c. Then cdet(A) = det(A1).
 Let A be an n×n matrix and let A1 be a matrix which results from switching two rows of A. Then det(A)
=−det(A1). Also, if one row of A is a multiple of another row of A, then det(A) = 0.
 Let A be an n×n matrix and let A1 be a matrix which results from applying row operation 3. That is you
replace some row by a multiple of another row added to itself. Then det(A) = det(A1).
 In All above theorems you can replace the word, “row” with the word “column”.
 There are two other major properties of determinants which do not involve row operations.
 Let A and B be two n×n matrices. Then
det(AB) = det(A)det(B).
det(A) = det(A`)
 A Formula For The Inverse
 The definition of the determinant in terms of Laplace expansion along a row or column also provides a
way to give a formula for the inverse of a matrix.
Determinants- Application
 . This cofactor matrix was just the matrix which results from replacing the ijth entry of the matrix with
the ijth cofactor.
 The following theorem says that to find the inverse, take the transpose of the cofactor matrix and divide
by the determinant.
 The transpose of the cofactor matrix is called the adjugate or sometimes the classical adjoint of the matrix
A.
 In other words, A−1 is equal to one divided by the determinant of A times the adjugate matrix of A. This
is what the following theorem says with more precision.
 Cramer’s Rule
 This formula for the inverse also implies a famous procedure known as Cramer’s rule. Cramer’s rule
gives a formula for the solutions, x, to a system of equations, Ax = y in the special case that A is a square
Determinants- Application
matrix.
{Note: Note this rule does not apply if you have a system of equations in which there is a different number of equations
than variables.
Vector calculus in many variables
 Under vector calculus in many variables following areas will be explained

Vector calculus in many variables
- Contents to be covered
Functions
Functions of
of
Partial
Partial many
Derivatives many
Derivatives
variables
variables
Directional
Directional
Derivatives
Derivatives
Gradient
Gradient &
&
Optimization
Optimization
Vector calculus in many variables- Functions in many variables
 After completing fundamentals in Functions in many variables one would be able to

Functions in many variables- Outcomes
 Represent a function of two variables by level curves.

 Identify the characteristics of a function from a graph of its level curves.
 Recall and use the concept of limit point.
 Describe the geometrical significance of a directional derivative.
 Give the relationship between partial derivatives and directional derivatives.
 Compute partial derivatives and directional derivatives from their definitions.
 Evaluate higher order partial derivatives.
 State conditions under which mixed partial derivatives are equal.
 Describethegradientofascalarvaluedfunctionandusetocomputethedirectional derivative.
 Explainwhythedirectionalderivativeismaximizedinthedirectionofthegradient and minimized in the direction of
minus the gradient
 The Graph Of A Function Of Two Variables
 In calculus, we are permitted and even required to think in a meaningful way about things which cannot
be drawn.
 However, it is certainly interesting to consider some things which can be visualized and this will help to
The Graph Of A Function Of Two Variables
formulate and understand more general notions which make sense in contexts which cannot be
visualized. One of these is the concept of a scalar valued function of two variables.
 Let f (x,y) denote a scalar valued function of two variables evaluated at the point (x,y). Its graph consists
of the set of points, (x,y,z) such that z = f (x,y).
 How does one go about depicting such a graph?
The usual way is to fix one of the variables, say x and consider the function z = f (x,y)where y is
Functions in many variables-
allowed to vary and x is fixed.

Graphing this would give a curve which lies in the surface to be depicted. Then do the same thing
for other values of x and the result would depict the graph desired graph. The following is the graph of the
function z = cos(x)sin(2x+y) drawn using Maple, a computer algebra system.1.
have the same z value.

dimensional graph which
to points on the three
two dimensions correspond
and the different lines in
This is in two dimensions
 The Graph Of A Function Of Multiple Variables
The Graph Of A Function Of multiple Variables  A scalar function of three variables, cannot be visualized because four dimensions are required.
 However, some people like to try and visualize even these examples. This is done by looking at level
surfaces in R3 which are defined as surfaces where the function assumes a constant value.
 They play the role of contour lines for a function of two variables.
 As a simple example, consider f (x,y,z) = x2 + y2 + z2.
The level surfaces of this function would be concentric spheres centered at 0.
 Another way to visualize objects in higher dimensions involves the use of color and animation.
 So much for art. However, the concept of level curves is quite useful because these can be drawn.
In the picture, the steepest places are

where the contour lines are close together
because they correspond to various
values of the function. You can look at
the picture and see where they are close
and where they are far. This is the
advantage of a contour map.
 Review Of Limits
 Recall the concept of limit of a function of many variables. When f : D(f) → Rq one can only consider in
a meaningful way limits at limit points of the set, D(f).
Review Of Limits
 The condition that x must be a limit point of D(f) if you are to take a limit at x is what makes the limit
well defined.
Vector calculus in many variables- Directional Derivatives
 The Directional Derivative

 The directional derivative is just what its name suggests. It is the derivative of a function in a particular
direction.
 The following picture illustrates the situation in the case of a function of two variables.
The Directional Derivative
 In this picture, v ≡ (v1,v2) is a unit vector in the xy plane and x0 ≡ (x0,y0) is a point in the xy plane.
When (x,y) moves in the direction of v, this results in a change in z = f (x,y) as shown in the picture.
 The directional derivative in this direction is defined as
 By definition we may say,

Vector calculus in many variables- Partial Derivatives
 Partial Derivative
 There are some special unit vectors which come to mind immediately. These are the vectors, ei where
and the 1 is in the ith position.
 Thus in case of a function of two variables, the directional derivative in the direction i = e1 is the slope of
the indicated straight line in the following picture.
 As in the case of a general directional derivative, you fix y and take the derivative of the function,
Partial Derivatives
x→f(x,y).
 More generally, even in situations which cannot be drawn, the definition of a partial derivative is as
follows.
Vector calculus in many variables- Gradient & Optimization
 After completing Gradient & Optimization one would be able to

 Interpret the gradient of a function as a normal to a level curve or a level surface.
 Find the normal line and tangent plane to a smooth surface at a given point.
 Find the angles between curves and surfaces.
 Define what is meant by a local extreme point.
 Find candidates for local extrema using the gradient.
Gradient & Optimization
 Find the local extreme values and saddle points of a C2 function.

 Use the second derivative test to identify the nature of a singluar point.
 Find the extreme values of a function defined on a closed and bounded region.
- Outcome
 Solve word problems involving maximum and minimum values.

 Use the method of Lagrange to determine the extreme values of a function subject to a constraint.
 Solve word problems using the method of Lagrange multipliers.
 Fundamental Properties
 Let f : U → R where U is an open subset of Rn and suppose f is differentiable on U. Thus if x∈U,
 If f is differentiable at x and for v a unit vector,

- Fundamental Properties
 When f is differentiable, define
This defines the gradient for a differentiable scalar valued function. There are ways to define the
gradient for vector valued functions. It follows immediately
 Let f : U →R be a differentiable function and let x∈U. Then

 The gradient has fundamental geometric significance illustrated by the following picture.
-Tangent Planes
 Definition:
 A point x∈D(f)⊆Rn is called a local minimum if f (x)≤f (y) for all y∈D(f) sufficiently close to x. A point
x∈D(f) is called a local maximum if f (x)≥ f (y) for all y ∈ D(f) sufficiently close to x. A local extremum
is a point of D(f) which is either a local minimum or a local maximum. The plural for extremum is
extrema. The plural for minimum is minima and the plural for maximum is maxima.
 Procedure:
 To find candidates for local extrema which are interior points of D(f) where f is a differentiable function, you
simply identify those points where ∇f equals the zero vector. To justify this, note that the graph of f is the
level surface
F (x,z)≡z−f (x) = 0
 and the local extrema at such interior points must have horizontal tangent planes. Therefore, a normal vector
-Local Extrema
at such points must be a multiple of (0,···,0,1). Thus∇F at such points must be a multiple of this vector. That
is, if x is such a point,
k(0,···,0,1) = (−fx1 (x),···,−fxn (x),1).
Thus ∇f (x) = 0.
 A singular point for f is a point x where ∇f (x) = 0. This is also called a critical point.
 There is a version of the second derivative test in the case that the function and its first and second partial derivatives
are all continuous.
 Definition: The matrix, H (x) whose ijth entry at the point x is
is called the Hessian matrix. The eigenvalues of H (x) are the solutions λ to the equation.
-The Second Derivative Test
 The following theorem says that if all the eigenvalues of the Hessian matrix at a critical point are positive, then the
critical point is a local minimum.

 If all the eigenvalues of the Hessian matrix at a critical point are negative, then the critical point is a local
maximum.
 Finally, if some of the eigenvalues of the Hessian matrix at the critical point are positive and some are negative then
the critical point is a saddle point. The following picture illustrates the situation.
 Lagrange Multipliers- overview necessity
- Lagrange Multipliers- overview, necessity  Lagrange multipliers are used to solve extremum problems for a function defined on a level set of another
function.
 For example, suppose you want to maximize xy given that x + y = 4.
This is not too hard to do using methods developed earlier. Solve for one of the variables, say y, in the
constraint equation, x + y = 4 to find y = 4−x. Then the function to maximize is f (x) = x(4−x) and the
answer is clearly x = 2. Thus the two numbers are x = y = 2. This was easy because you could easily solve
the constraint equation for one of the variables in terms of the other.
 Now what if you wanted to maximize f (x,y,z) = xyz subject to the constraint that x2 +y2 +z2 = 4?
Solve for one of the variables in the constraint equation, say z, substitute it into f, and then
find where the partial derivatives equal zero to find candidates for the extremum. However, it seems you
might encounter many cases and it does look a little fussy. However, sometimes you can’t solve the
constraint equation for one variable in terms of the others. Also, what if you had many constraints. What if
you wanted to maximize f (x,y,z) subject to the constraints x2 + y2 = 4 and z = 2x +3y2. Things are clearly
getting more involved and messy.
 Lagrange Multipliers- definition

- Lagrange Multipliers- overview, necessity  Relation can be seen geometrically as in the following picture.
 In the picture, the surface represents a piece of the level surface of g(x,y,z) = 0 and f (x,y,z) is the function of three
variables which is being maximized or minimized on the level surface and suppose the extremum of f occurs at the point
(x0,y0,z0). As shown above, ∇g(x0,y0,z0) is perpendicular to the surface or more precisely to the tangent plane. However,
if x(t) = (x(t),y(t),z(t)) is a point on a smooth curve which passes through (x0,y0,z0) when t = t0, then the function, h(t) = f
(x(t),y(t),z(t)) must have either a maximum or a minimum at the point, t = t0. Therefore, h0(t0) = 0. But this means
A brief Intro to Probability
and Statistics
for Machine Learning
 Machine Learning is an
interdisciplinary field that uses
statistics, probability, algorithms
to learn from data and provide
insights which can be used to
build intelligent applications.
 Probability and statistics are
related areas of mathematics
which concern themselves with
analyzing the relative frequency
of events.
Probability
 Probability deals with predicting the likelihood of future events,
while statistics involves the analysis of the frequency of past events.
 Most people have an intuitive understanding of degrees of probability, which is why we
use words like “probably” and “unlikely” in our daily conversation, but we will talk
about how to make quantitative claims about those degrees.
 In probability theory, an event is a set of outcomes of experiment to which a probability
is assigned. If E represents an event, then P(E) represents the probability that E will
occur. A situation where E might happen (success) or might not happen (failure) is called
a trial.
 This event can be anything like tossing a coin, rolling a die or pulling a ball out of a bag.
In these examples the outcome of the event is random, so the variable that represents the
outcome of these events is called a random variable.
 Let us consider a basic example of tossing a coin. If the coin is fair, then it is just
as likely to come up heads as it is to come up tails. In other words, if we were to
repeatedly toss the coin many times, we would expect about half of the tosses to
be heads and half to be tails. In this case, we say that the probability of getting a
head is 1/2 or 0.5 .
 The empirical probability of an event is given by number of times the event
occurs divided by the total number of incidents observed. If for n trials and we
observe s success, the probability of success is s/n. In the above example, any
sequence of coin toss may have more or less than exactly 50% heads.
 Theoretical probability on the other hand is given by the numbers of ways the
event can occur divided by the total number of possible outcomes. So a head can
occur once and possible outcome are two ( head, tail). The true (theoretical)
probability of a head is ½.
 Joint Probability
 Probability of events A and B denoted by is the probability that events
A and B both occur.
 This only applies if A and B are independent, which means that if A occurred, that
doesn’t change the probability of B, vice versa.
 Conditional Probability
 Let us consider A and B are not independent, because if A occurred, the probability of B is higher. When A
and B are not independent, it is often useful to compute the conditional probability, (A/B), which is the
probability of A given that B occurred:
 The probability of an event A conditioned on an event B is denoted

and defined
 We can write the joint probability of as A and B as :
,which means: “The chance of both things happening is the chance that the first one happens, and then the
second one given the first happened.”
Bayes’ Theorem
 Bayes’ theorem is a relationship between the conditional probabilities of two events. For example, if we
want to find the probability of selling ice cream on a hot and sunny day, Bayes’ theorem gives us the tools
to use prior knowledge about the likelihood of selling ice cream on any other type of day (rainy, windy,
snowy etc.)
 Where H and E events, P(H/E) is the
probability that event H occurs given
that E has already occurred. The
probability P(H) in the equation is
basically frequency analysis; given our
prior data what is the probability of the
event occurring. The P(E/H) in the
equation is basically of the event occurring. The P(E/H) in the equation is called
likelihood and is essentially the probability that the evidence is correct, given the information from the frequency
analysis. P(E) is the probability that the actual
evidence is true.
Example:
 Let H represent the event that we sell ice cream and E be the event of the weather. Then we might
ask what is the probability of selling ice cream on any given day given the type of
weather? Mathematically this is written as
 P(H=Ice cream sale | E = type of weather) which is equivalent to the left hand side of the equation.
P(H) on the right hand side is the expression that is known as the prior because we might already
know the marginal probability of the sale of ice cream. In this example this is P(H=ice cream
sale ) , i.e. the probability of selling the ice cream regardless of the type of weather outside. For
example, I could look at data that said 30 people out of a potential 100 actually bought ice cream
at some shop somewhere. So my P(H= ice cream sale) = 30/100 = 0.3, prior to knowing anything
about the weather. This is how the Bayes’ Theorem allows us to incorporate prior information.
 A classic use of Bayes’s theorem is in the interpretation of clinical tests. Suppose that during a
routine medical examination, your doctor informs you that you have tested positive for a rare
disease. You are also aware that there is some uncertainty in the results of these tests. Assuming
we have a Sensitivity (also called the true positive rate) result for 95% of the patients with the
disease, and a Specificity (also called the true negative rate) result for 95% of the healthy
patients.
 If we let “+” and “−” denote a positive and negative test result, respectively, then the test
accuracies are the conditional probabilities
 P(+|diseases) 0.95, P(-|healthy) = 0.95,
 In the Bayesian terms, we want to compute the probaibility of disease given a positive test,
P(disease).
How to evaluate P(+), all positive cases? We have to consider two possibilities, P(+|disease)
and P(+|healthy). The probability , P(+|healthy) is the complementary of P(-|healthy). Thus
P(+|healthy) = 0.05 .
Importantly, Bayes’ theorem reveals that in order to compute the conditional probability that
you have the disease given the test was positive, you need to know the “prior” probability you
have the disease P(disease), given no information at all.
 That is, you need to know the overall incidence of the disease in the population to which you
belong. Assuming these tests are applied to a population where the actual disease is found to be
0.5%, P(disease) = 0.005 which means P(healthy) = 0.995.
 So, P(disease|+) = 0.95 * 0.005 / ( 0.95 * 0.005 + 0.05 * 0.995) = 0.088
 In other words, despite the apparent reliability of the test, the probability that you actually have
the disease is still less than 9%. Getting a positive result increases the probability you have the
disease. But it is incorrect to interpret the 95 % test accuracy as the probability you have the
disease.
Descriptive Statistics
 Descriptive statistics refers to methods for summarizing and organizing the information in a data
set. We will use below table to describe some of the statistical concepts.
 Elements: The entities for which information is collected are called the elements. In the above
table, the elements are the 10 applicants. Elements are also called cases or subjects.
 Variables: The characteristic of an element is called a variable. It can take different values for
different elements.e.g., marital status, mortgage, income, rank, year, and risk. Variables are also
called attributes.
 Variables can be either qualitative or quantitative.
 Qualitative: A qualitative variable enables the elements to be classified or categorized according
to some characteristic. The qualitative variables are marital status, mortgage, rank and risk.
Qualitative variables are also called categorical variables.
 Quantitative: A quantitative variable takes numeric values and allows arithmetic to be
meaningfully performed on it. The quantitative variables are income and year. Quantitative
variables are also called numerical variables.
 Discrete Variable: A numerical variable that can take either a finite or a countable number of
values is a discrete variable, for which each value can be graphed as a separate point, with space
between each point. ‘year’ is an example of a discrete variable.
 Continuous Variable: A numerical variable that can take infinitely many values is a continuous
variable, whose possible values form an interval on the number line, with no space between the
points. ’income’ is an example of a continuous variable.
 Population: A population is the set of all elements of interest for a particular problem. A
parameter is a characteristic of a population.
 Random sample: When we take a sample for which each element has an equal chance of being
selected.
 Measures of Center: Mean, Median, Mode, Mid-range
 Mean : The mean is the arithmetic average of a data set. To calculate the mean, add up the values
and divide by the number of values . The sample mean is the arithmetic average of a sample, and
is denoted x̄ (“x-bar”). The population mean is the arithmetic average of a population, and is
denoted 𝜇 (“myu”, the Greek letter for m).
 Median: The median is the middle data value, when there is an odd number of data values and the
data have been sorted into ascending order. If there is an even number, the median is the mean of
the two middle data values. When the income data are sorted into ascending order, the two middle
values are $32,100 and $32,200, the mean of which is the median income, $32,150.
 Mode: The mode is the data value that occurs with the greatest frequency. Both quantitative and
categorical variables can have modes, but only quantitative variables can have means or medians.
Each income value occurs only once, so there is no mode. The mode for year is 2010, with a
frequency of 4.
 Mid-range: The mid-range is the average of the maximum and minimum values in a data set. The
mid-range income is:
Measures of Variability: Range, Variance, Standard Deviation
 Quantify the amount of variation, spread or dispersion present in the data.

 Range: The range of a variable equals the difference between the maximum and minimum
values. The range of income is:
 Range only reflects the difference between largest and smallest observation, but it fails to reflect
how data is centralized.
 Variance : Population variance is defined as the average of the squared differences from the
Mean, denoted as sigma-squared.
 Larger Variance means the data are more spread out.
 The sample variance s^2 is approximately the mean of the squared deviations, with N replaced by
n-1. This difference occurs because the sample mean is used as an approximation of true
population mean.
 Standard Deviation: The standard deviation or sd of a bunch of numbers tells you how much the
individual numbers tend to differ from the mean.
 The sample standard deviation is the square root of the sample variance :
 Sd = square root (sigma square) .
 For example, incomes deviate from their mean by $7201.
 The population standard deviation is the square root of the population variance.
 The smaller the standard deviation,
narrower the peak, the data points
are closer to the mean. The further
the data points are from the mean,
the greater the standard deviation.
Measures of Position:
Percentile, Z-score, Quartiles
 Percentile: The pth percentile of a data set

is the data value such that p percent of the
values in the data set are at or below this
value. The 50th percentile is the median.
For example, the median income is
$32,150, and 50% of the data values lie at
or below this value.
 Percentile rank: The percentile rank of a
data value equals the percentage of values
in the data set that are at or below that
value. For example, the percentile rank. of
Applicant 1’s income of $38,000 is 90%,
since that is the percentage of incomes
equal to or less than $38,000.
 Interquartile Range (IQR): The first quartile (Q1) is the 25th percentile of a data set; the
second quartile (Q2) is the 50th percentile (median); and the third quartile (Q3) is the 75th
percentile.
 The IQR measures the difference between 75th and 25th observation using the formula: IQR = Q3
− Q1.
 A data value x is an outlier if either:
 Z-score: The Z-score for a particular data value represents how many standard deviations the
data value lies above or below the mean.
 So, If z is positive, it means that the value is above the average. For Applicant 6, the Z-score is
(24,000 − 32,540)/ 7201 ≈ −1.2, which means the income of Applicant 6 lies 1.2 standard
deviations below the mean.
 Uni-variate Descriptive Statistics: Different ways you can describe patterns found in uni-
variate data include central tendency : mean, mode and median and dispersion: range, variance,
maximum, minimum, quartiles , and standard deviation.
 The various plots used to visualize uni-variate data typically are Bar Charts, Histograms, Pie Charts. etc.
 Bi-variate Descriptive Statistics: Bi-variate analysis involves the analysis of two variables for the
purpose of determining the empirical relationship between them. The various plots used to visualize bi-
variate data typically are scatter-plot, box-plot.
 Scatter Plots: The simplest way to visualize the relationship between two quantitative variables , x and
y. For two continuous variables, a scatter-plot is a common graph. Each (x, y) point is graphed on a
Cartesian plane, with the x axis on the horizontal and the y axis on the vertical. Scatter plots are
sometimes called correlation plots because they show how two variables are correlated.
 Correlation: A correlation is a statistic intended to quantify the strength of the relationship
between two variables. The correlation coefficient r quantifies the strength and direction of the
linear relationship between two quantitative variables. The correlation coefficient is defined as:
 Where sx and sy represent the standard deviation of the x-variable and the y-variable, respectively.
-1<= r <= 1.
 If r is positive and significant, we say that x and y are positively correlated. An increase in x is
associated with an increase in y.
 If r is negative and significant, we say that x and y are negatively correlated. An increase in x is
associated with a decrease in y.
 Box Plots: A box plot is also called a box and whisker plot and it’s used to picture the distribution
of values. When one variable is categorical and the other continuous, a box-plot is commonly
used. When you use a box plot you divide the data values into four parts called quartiles. You start
by finding the median or middle value. The median splits the data values into halves. Finding the
median of each half splits the data values into four parts, the quartiles.
 Each box on the plot shows the range of values from the median of the lower half of the values at
the bottom of the box to the median of the upper half of the values at the top of the box. A line in
the middle of the box occurs at the median of all the data values. The whiskers then point to the
largest and smallest values in the data.
 The five- number summary of a data set consists of the minimum, Q1, the median, Q3, and the
maximum.
 Box plots are especially useful for indicating whether a distribution is skewed and whether
there are potential unusual observations (outliers) in the data set.
 The left whisker extends down to the minimum value which is not an outlier. The right whisker
extends up to the maximum value that is not an outlier. When the left whisker is longer than the
right whisker, then the distribution is left-skewed and vice versa. When the whiskers are about
equal in length, the distribution is symmetric.
Conclusion
 Basic concepts of probability and statistics are a

must have for anyone interested in machine
learning. We covered briefly some of the essential
concepts that are mostly used in machine learning.

Mathematics For Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mathematics For Machine Learning

Uploaded by

Copyright:

Available Formats

Multi-variable Calculus

Applications & theory in Machine learning

, Concept involving multiple integrals incorporate determinants.

 Under Basic linear algebra, following sections would be majorly covered:

 After completing fundamentals in Basic linear Algebra one would be able to

 In other words, multiplication by a scalar magniﬁes the length of the vector.

 After completing Matrices and linear transformations one would be able to

 Understand the relationship between linear transformations and matrices.

 To illustrate scalar multiplication

 To illustrate Matrix addition,

this linear transformation.

 After completing fundamentals in Basic linear Algebra one would be able to

 Recall the general properties of determinants.

 The row operations consist of the following

1. Switch two rows.

 Under vector calculus in many variables following areas will be explained

 After completing fundamentals in Functions in many variables one would be able to

 Represent a function of two variables by level curves.

allowed to vary and x is ﬁxed.

have the same z value.

In the picture, the steepest places are

 The Directional Derivative

 By definition we may say,

 After completing Gradient & Optimization one would be able to

 Find the local extreme values and saddle points of a C2 function.

 Solve word problems involving maximum and minimum values.

 If f is diﬀerentiable at x and for v a unit vector,

 When f is diﬀerentiable, deﬁne

 Let f : U →R be a diﬀerentiable function and let x∈U. Then

critical point is a local minimum.

 Lagrange Multipliers- definition

 The probability of an event A conditioned on an event B is denoted

 Quantify the amount of variation, spread or dispersion present in the data.

 Percentile: The pth percentile of a data set

 Basic concepts of probability and statistics are a

You might also like