Professional Documents
Culture Documents
Mathematics For Machine Learning
Mathematics For Machine Learning
Fundamental
Fundamental
ss &
&
Linear Algebra-
covered
be covered
Basic
Basic
notations
notations
to be
Basic Linear
Contents to
Contents
Basic
Matrices &
&
Matrices
Linear
Linear Determinants
Transformati Determinants
Transformati
on
on
Multi-variable Calculus
Basic Linear Algebra- Fundamentals & Application
Represent a line in 3 space by a vector parameterization, a set of scalar parametric equations or using
symmetric form.
Find a parameterization of a line given information about
(a) a point of the line and the direction of the line.
(b) two points contained in the line.
Determine the direction of a line given its parameterization.
Multi-variable Calculus
Basic Linear Algebra- Fundamentals & Application
The notation, Rn refers to the collection of ordered lists of n real numbers. More precisely, consider the following
definition.
Mathematically we may say,
Rn ≡{(x1,···,xn) : xj ∈R for j = 1,···,n}. (x1,···,xn) = (y1,···,yn) if and only if for all j = 1,···,n, xj = yj. When
Fundaments- Rn description
(x1,···,xn)∈ Rn, it is conventional to denote (x1,···,xn) by the single bold face letter, x.
The numbers, xj are called the coordinates. The set {(0,···,0,t,0,···,0) : t ∈R } for t in the ith slot is called the ith
coordinate axis coordinate axis, the xi axis for short. The point 0≡(0,···,0) is called the origin. E.g. (1,2,4) ∈
R3 and (2,1,4) ∈ R3 but (1,2,4) 6= (2,1,4) because, even though the same numbers are involved, they don’t
match up. In particular, the first entries are not equal.
Now point is why one would be interested in such a thing?
consider a hot object which is cooling and suppose you want the temperature of this object. How many
coordinates would be needed? You would need one for the temperature, three for the position of the point in
the object and one more for the time. Thus you would need to be considering R5`
Many other examples can be given.
Sometimes n is very large. This is often the case in applications to business when they are trying to maximize
profit subject to constraints.
It also occurs in numerical analysis when people try to solve hard problems on a computer.
Multi-variable Calculus
Basic Linear Algebra- Fundamentals & Application
There are two algebraic operations done with elements of Rn. One is addition and the other is multiplication by
numbers, called scalars
Mathematically,
Fundaments-Algebra in Rn
If x ∈Rn and a is a number, also called a scalar, then ax ∈Rn is defined by ax = a(x1,···,xn)≡(ax1,···,axn).
This is known as scalar multiplication. If x,y∈Rn then x+y∈Rn and is defined by x+y = (x1,···,xn)+(y1,···,yn)
≡(x1 +y1,···,xn +yn)
An element of Rn,x ≡ (x1,···,xn) is often called a vector. The above definition is known as vector addition
For v,w vectors in Rn and α,β scalars, (real numbers), the following hold.
the commutative law of addition,
v+w = w+v
the associative law for addition,
(v+w)+z = v+(w+z)
the existence of an additive identity,
v+0 = v,
the existence of an additive inverse
v+(−v) = 0
Also Also
α(v+w) = αv+αw, (α+β)v =αv+βv, α(βv) = αβ(v), 1v = v.
Multi-variable Calculus
Basic Linear Algebra- Fundamentals & Application
Fundaments- Geometric meaning of vector
It was explained earlier that an element of Rn is an n tuple of numbers.
that this can be used to determine a point in three-dimensional space in the case where n = 3 and in two-dimensional
space, in the case where n = 2. This point was specified relative to some coordinate axes.
Thus the geometric significance of (d,e,f) + (a,b,c) = (d+a,e+b,f +c) is that
You start with the position vector of the point (d,e,f) and at its point.
you place the vector determined by (a,b,c) with its tail at (d,e,f).
Then the point of this last vector will be (d+a,e+b,f +c).
x3
addition in Rn
u = (d,e,f)
v v = (a,b,c)
u+v
u
x1
x2
Multi-variable Calculus
Basic Linear Algebra- Fundamentals & Application
To
begin with consider the case n = 1,2`
In the case where n = 1, the only line is just R1 = R. Therefore, if x1 and x2 are two different points in R, consider x
= x1 +t(x2 −x1).
Fundaments-Lines & distances
where t ∈ R and the totality of all such points will give R.You see that you can always solve the above equation
for t, showing that every point on R is of this form.
Now consider the plane.
Let (x1,y1) and (x2,y2) be two different points in R2 which are contained in a line, l.
we may formulate it as (x,y) = (x1,y1)+t(x2 −x1,y2 −y1). If x1 = x2, then in place of the point slope form above, x =
x1. Since the two given points are different, y1 6= y2 and so you still obtain the above formula for the line. Because of
this, the following is the definition of a line in Rn.
How is distance between two points in Rn defined?
Let x =(x1,···,xn) and y =(y1,···,yn) be two points in Rn. Then |x−y| to indicates the distance between these points
and is defined as
distance between x and y ≡|x−y| ≡ This is called the distance formula.
Multi-variable Calculus
Basic Linear Algebra- Fundamentals & Application
Fundaments-Geometric Meaning Of Scalar
As discussed earlier, x = (x1,x2,x3) determines a vector. the line from 0 to x placing the point of the vector on x.
The length of this vector is defined to equal |x|. Thus the length of x equals
When you multiply x by a scalar, α, you get (αx1,αx2,αx3) and the length of this vector is defined
= |a|
Thus the following holds.
|αx|=|α||x|.
Multiplication In R3
12345
12345
The size or dimension of a matrix is defined
1 2 3 4 as
5 m×n where m is the number of rows and n is the number of columns. The
entry in the ith row and the jth column of this matrix is denoted by aij.
There are various operations which are done on matrices. Matrices can be added multiplied by a scalar and multiplied by
other matrices.
The new matrix is obtained by multiplying every entry of the original matrix by the given scalar. If A is an m×n
matrix, −A is defined to equal (−1)A.
By definition, (Scalar Multiplication) If A = (aij) and k is a scalar, then kA = (kaij).
An m×n matrix can be used to transform vectors in Rn to vectors in Rm through the use of matrix multiplication.
Matrices And Linear Transformations
In general mathematically, Let T : Rn → Rm be a function. Thus for each x ∈ Rn,Tx ∈ Rm. Then T is a linear
- Linear Transformation-overview
transformation if whenever α,β are scalars and x1 and x2 are vectors in Rn,
T (αx1 +βx2) = α1Tx1 +βTx2
In words, linear transformations distribute across + and allow you to factor out scalars
A linear transformation is called one to one (often written as 1−1) if it never takes two different vectors to the same
vector. Thus T is one to one if whenever x6= y Tx6= Ty. Equivalently, if T (x) = T (y), then x = y.
A linear transformation is called one to one (often written as 1−1) if it never takes two different vectors to the same
vector. Thus T is one to one if whenever x6= y Tx6= Ty. Equivalently, if T (x) = T (y), then x = y.
A linear transformation mapping Rn to Rm is called onto if whenever y ∈Rm there exists x ∈Rn such that T (x) = y.
Thus T is onto if everything in Rm gets hit. In the case that a linear transformation comes from matrix
multiplication, it is common to refer to the matrix as onto when the linear transformation it determines is onto
Multi-variable Calculus
Basic Linear Algebra- Matrices And Linear Transformations
It turns out that if T is any linear transformation which maps Rn to Rm, there is always an m×n matrix, A with the
property that Ax = Tx.
- Constructing The Matrix Of A Linear for all x∈Rn. Here is why. Suppose T : Rn →Rm is a linear transformation and you want to find the matrix defined by
Matrices And Linear Transformations
and so you see that the matrix desired is obtained from letting the ith column equal T (ei). This yields the following theorem .
Multi-variable Calculus
Basic Linear Algebra- Determinants
Minor :
Co-factor :
Multi-variable Calculus
Basic Linear Algebra- Determinants
Properties Of Determinants
There are many properties satisfied by determinants. Some of these properties have to do with row
operations which are described below.
There are two other major properties of determinants which do not involve row operations.
Let A and B be two n×n matrices. Then
det(AB) = det(A)det(B).
det(A) = det(A`)
Multi-variable Calculus
Basic Linear Algebra- Determinants
A Formula For The Inverse
The definition of the determinant in terms of Laplace expansion along a row or column also provides a
way to give a formula for the inverse of a matrix.
Determinants- Application
. This cofactor matrix was just the matrix which results from replacing the ijth entry of the matrix with
the ijth cofactor.
The following theorem says that to find the inverse, take the transpose of the cofactor matrix and divide
by the determinant.
The transpose of the cofactor matrix is called the adjugate or sometimes the classical adjoint of the matrix
A.
In other words, A−1 is equal to one divided by the determinant of A times the adjugate matrix of A. This
is what the following theorem says with more precision.
Multi-variable Calculus
Basic Linear Algebra- Determinants
Cramer’s Rule
This formula for the inverse also implies a famous procedure known as Cramer’s rule. Cramer’s rule
gives a formula for the solutions, x, to a system of equations, Ax = y in the special case that A is a square
Determinants- Application
matrix.
{Note: Note this rule does not apply if you have a system of equations in which there is a different number of equations
than variables.
Multi-variable Calculus
Vector calculus in many variables
Functions
Functions of
of
Partial
Partial many
Derivatives many
Derivatives
variables
variables
Directional
Directional
Derivatives
Derivatives
Gradient
Gradient &
&
Optimization
Optimization
Multi-variable Calculus
Vector calculus in many variables- Functions in many variables
Review Of Limits
Recall the concept of limit of a function of many variables. When f : D(f) → Rq one can only consider in
a meaningful way limits at limit points of the set, D(f).
Functions in many variables-
Review Of Limits
The condition that x must be a limit point of D(f) if you are to take a limit at x is what makes the limit
well defined.
Multi-variable Calculus
Vector calculus in many variables- Directional Derivatives
In this picture, v ≡ (v1,v2) is a unit vector in the xy plane and x0 ≡ (x0,y0) is a point in the xy plane.
When (x,y) moves in the direction of v, this results in a change in z = f (x,y) as shown in the picture.
The directional derivative in this direction is defined as
As in the case of a general directional derivative, you fix y and take the derivative of the function,
Partial Derivatives
x→f(x,y).
More generally, even in situations which cannot be drawn, the definition of a partial derivative is as
follows.
Multi-variable Calculus
Vector calculus in many variables- Gradient & Optimization
Fundamental Properties
Let f : U → R where U is an open subset of Rn and suppose f is differentiable on U. Thus if x∈U,
This defines the gradient for a differentiable scalar valued function. There are ways to define the
gradient for vector valued functions. It follows immediately
Procedure:
To find candidates for local extrema which are interior points of D(f) where f is a differentiable function, you
Gradient & Optimization
simply identify those points where ∇f equals the zero vector. To justify this, note that the graph of f is the
level surface
F (x,z)≡z−f (x) = 0
and the local extrema at such interior points must have horizontal tangent planes. Therefore, a normal vector
-Local Extrema
at such points must be a multiple of (0,···,0,1). Thus∇F at such points must be a multiple of this vector. That
is, if x is such a point,
k(0,···,0,1) = (−fx1 (x),···,−fxn (x),1).
Thus ∇f (x) = 0.
A singular point for f is a point x where ∇f (x) = 0. This is also called a critical point.
Multi-variable Calculus
Vector calculus in many variables- Gradient & Optimization
There is a version of the second derivative test in the case that the function and its first and second partial derivatives
are all continuous.
Definition: The matrix, H (x) whose ijth entry at the point x is
is called the Hessian matrix. The eigenvalues of H (x) are the solutions λ to the equation.
-The Second Derivative Test
The following theorem says that if all the eigenvalues of the Hessian matrix at a critical point are positive, then the
Gradient & Optimization
Now what if you wanted to maximize f (x,y,z) = xyz subject to the constraint that x2 +y2 +z2 = 4?
Solve for one of the variables in the constraint equation, say z, substitute it into f, and then
find where the partial derivatives equal zero to find candidates for the extremum. However, it seems you
might encounter many cases and it does look a little fussy. However, sometimes you can’t solve the
constraint equation for one variable in terms of the others. Also, what if you had many constraints. What if
you wanted to maximize f (x,y,z) subject to the constraints x2 + y2 = 4 and z = 2x +3y2. Things are clearly
getting more involved and messy.
Multi-variable Calculus
Vector calculus in many variables- Gradient & Optimization
In the picture, the surface represents a piece of the level surface of g(x,y,z) = 0 and f (x,y,z) is the function of three
variables which is being maximized or minimized on the level surface and suppose the extremum of f occurs at the point
(x0,y0,z0). As shown above, ∇g(x0,y0,z0) is perpendicular to the surface or more precisely to the tangent plane. However,
if x(t) = (x(t),y(t),z(t)) is a point on a smooth curve which passes through (x0,y0,z0) when t = t0, then the function, h(t) = f
(x(t),y(t),z(t)) must have either a maximum or a minimum at the point, t = t0. Therefore, h0(t0) = 0. But this means
A brief Intro to Probability
and Statistics
for Machine Learning
Machine Learning is an
interdisciplinary field that uses
statistics, probability, algorithms
to learn from data and provide
insights which can be used to
build intelligent applications.
Probability and statistics are
related areas of mathematics
which concern themselves with
analyzing the relative frequency
of events.
Probability
Probability deals with predicting the likelihood of future events,
while statistics involves the analysis of the frequency of past events.
Most people have an intuitive understanding of degrees of probability, which is why we
use words like “probably” and “unlikely” in our daily conversation, but we will talk
about how to make quantitative claims about those degrees.
In probability theory, an event is a set of outcomes of experiment to which a probability
is assigned. If E represents an event, then P(E) represents the probability that E will
occur. A situation where E might happen (success) or might not happen (failure) is called
a trial.
This event can be anything like tossing a coin, rolling a die or pulling a ball out of a bag.
In these examples the outcome of the event is random, so the variable that represents the
outcome of these events is called a random variable.
Let us consider a basic example of tossing a coin. If the coin is fair, then it is just
as likely to come up heads as it is to come up tails. In other words, if we were to
repeatedly toss the coin many times, we would expect about half of the tosses to
be heads and half to be tails. In this case, we say that the probability of getting a
head is 1/2 or 0.5 .
The empirical probability of an event is given by number of times the event
occurs divided by the total number of incidents observed. If for n trials and we
observe s success, the probability of success is s/n. In the above example, any
sequence of coin toss may have more or less than exactly 50% heads.
Theoretical probability on the other hand is given by the numbers of ways the
event can occur divided by the total number of possible outcomes. So a head can
occur once and possible outcome are two ( head, tail). The true (theoretical)
probability of a head is ½.
Joint Probability
Probability of events A and B denoted by is the probability that events
A and B both occur.
This only applies if A and B are independent, which means that if A occurred, that
doesn’t change the probability of B, vice versa.
Conditional Probability
Let us consider A and B are not independent, because if A occurred, the probability of B is higher. When A
and B are not independent, it is often useful to compute the conditional probability, (A/B), which is the
probability of A given that B occurred:
How to evaluate P(+), all positive cases? We have to consider two possibilities, P(+|disease)
and P(+|healthy). The probability , P(+|healthy) is the complementary of P(-|healthy). Thus
P(+|healthy) = 0.05 .
Importantly, Bayes’ theorem reveals that in order to compute the conditional probability that
you have the disease given the test was positive, you need to know the “prior” probability you
have the disease P(disease), given no information at all.
That is, you need to know the overall incidence of the disease in the population to which you
belong. Assuming these tests are applied to a population where the actual disease is found to be
0.5%, P(disease) = 0.005 which means P(healthy) = 0.995.
So, P(disease|+) = 0.95 * 0.005 / ( 0.95 * 0.005 + 0.05 * 0.995) = 0.088
In other words, despite the apparent reliability of the test, the probability that you actually have
the disease is still less than 9%. Getting a positive result increases the probability you have the
disease. But it is incorrect to interpret the 95 % test accuracy as the probability you have the
disease.
Descriptive Statistics
Descriptive statistics refers to methods for summarizing and organizing the information in a data
set. We will use below table to describe some of the statistical concepts.
Elements: The entities for which information is collected are called the elements. In the above
table, the elements are the 10 applicants. Elements are also called cases or subjects.
Variables: The characteristic of an element is called a variable. It can take different values for
different elements.e.g., marital status, mortgage, income, rank, year, and risk. Variables are also
called attributes.
Variables can be either qualitative or quantitative.
Qualitative: A qualitative variable enables the elements to be classified or categorized according
to some characteristic. The qualitative variables are marital status, mortgage, rank and risk.
Qualitative variables are also called categorical variables.
Quantitative: A quantitative variable takes numeric values and allows arithmetic to be
meaningfully performed on it. The quantitative variables are income and year. Quantitative
variables are also called numerical variables.
Discrete Variable: A numerical variable that can take either a finite or a countable number of
values is a discrete variable, for which each value can be graphed as a separate point, with space
between each point. ‘year’ is an example of a discrete variable.
Continuous Variable: A numerical variable that can take infinitely many values is a continuous
variable, whose possible values form an interval on the number line, with no space between the
points. ’income’ is an example of a continuous variable.
Population: A population is the set of all elements of interest for a particular problem. A
parameter is a characteristic of a population.
Random sample: When we take a sample for which each element has an equal chance of being
selected.
Measures of Center: Mean, Median, Mode, Mid-range
Mean : The mean is the arithmetic average of a data set. To calculate the mean, add up the values
and divide by the number of values . The sample mean is the arithmetic average of a sample, and
is denoted x̄ (“x-bar”). The population mean is the arithmetic average of a population, and is
denoted 𝜇 (“myu”, the Greek letter for m).
Median: The median is the middle data value, when there is an odd number of data values and the
data have been sorted into ascending order. If there is an even number, the median is the mean of
the two middle data values. When the income data are sorted into ascending order, the two middle
values are $32,100 and $32,200, the mean of which is the median income, $32,150.
Mode: The mode is the data value that occurs with the greatest frequency. Both quantitative and
categorical variables can have modes, but only quantitative variables can have means or medians.
Each income value occurs only once, so there is no mode. The mode for year is 2010, with a
frequency of 4.
Mid-range: The mid-range is the average of the maximum and minimum values in a data set. The
mid-range income is:
Measures of Variability: Range, Variance, Standard Deviation
Range only reflects the difference between largest and smallest observation, but it fails to reflect
how data is centralized.
Variance : Population variance is defined as the average of the squared differences from the
Mean, denoted as sigma-squared.
Larger Variance means the data are more spread out.
The sample variance s^2 is approximately the mean of the squared deviations, with N replaced by
n-1. This difference occurs because the sample mean is used as an approximation of true
population mean.
Standard Deviation: The standard deviation or sd of a bunch of numbers tells you how much the
individual numbers tend to differ from the mean.
The sample standard deviation is the square root of the sample variance :
Sd = square root (sigma square) .
For example, incomes deviate from their mean by $7201.
The population standard deviation is the square root of the population variance.
The smaller the standard deviation,
narrower the peak, the data points
are closer to the mean. The further
the data points are from the mean,
the greater the standard deviation.
Measures of Position:
Percentile, Z-score, Quartiles
So, If z is positive, it means that the value is above the average. For Applicant 6, the Z-score is
(24,000 − 32,540)/ 7201 ≈ −1.2, which means the income of Applicant 6 lies 1.2 standard
deviations below the mean.
Uni-variate Descriptive Statistics: Different ways you can describe patterns found in uni-
variate data include central tendency : mean, mode and median and dispersion: range, variance,
maximum, minimum, quartiles , and standard deviation.
The various plots used to visualize uni-variate data typically are Bar Charts, Histograms, Pie Charts. etc.
Bi-variate Descriptive Statistics: Bi-variate analysis involves the analysis of two variables for the
purpose of determining the empirical relationship between them. The various plots used to visualize bi-
variate data typically are scatter-plot, box-plot.
Scatter Plots: The simplest way to visualize the relationship between two quantitative variables , x and
y. For two continuous variables, a scatter-plot is a common graph. Each (x, y) point is graphed on a
Cartesian plane, with the x axis on the horizontal and the y axis on the vertical. Scatter plots are
sometimes called correlation plots because they show how two variables are correlated.
Correlation: A correlation is a statistic intended to quantify the strength of the relationship
between two variables. The correlation coefficient r quantifies the strength and direction of the
linear relationship between two quantitative variables. The correlation coefficient is defined as:
Where sx and sy represent the standard deviation of the x-variable and the y-variable, respectively.
-1<= r <= 1.
If r is positive and significant, we say that x and y are positively correlated. An increase in x is
associated with an increase in y.
If r is negative and significant, we say that x and y are negatively correlated. An increase in x is
associated with a decrease in y.
Box Plots: A box plot is also called a box and whisker plot and it’s used to picture the distribution
of values. When one variable is categorical and the other continuous, a box-plot is commonly
used. When you use a box plot you divide the data values into four parts called quartiles. You start
by finding the median or middle value. The median splits the data values into halves. Finding the
median of each half splits the data values into four parts, the quartiles.
Each box on the plot shows the range of values from the median of the lower half of the values at
the bottom of the box to the median of the upper half of the values at the top of the box. A line in
the middle of the box occurs at the median of all the data values. The whiskers then point to the
largest and smallest values in the data.
The five- number summary of a data set consists of the minimum, Q1, the median, Q3, and the
maximum.
Box plots are especially useful for indicating whether a distribution is skewed and whether
there are potential unusual observations (outliers) in the data set.
The left whisker extends down to the minimum value which is not an outlier. The right whisker
extends up to the maximum value that is not an outlier. When the left whisker is longer than the
right whisker, then the distribution is left-skewed and vice versa. When the whiskers are about
equal in length, the distribution is symmetric.
Conclusion