Download as pdf
Download as pdf
You are on page 1of 128
Deep Networks Basics Syllabus Linear Algebra + Scatars-Vectors-Matrices and tensors; Probability Distributions-Gradient-based fion-Machine Learning Basics : Capacity-Overfitting and underfitting-Hyperparameters i Estimators-Bias and variance-Stochastic gradient descent-Challenges motivating decp learning: Deep Networks : Deep feedforward networks; Regularization ~ Optimization. Contents 1.1 Linear Algebra 1.2 Scalars, Vectors, Matrices and Tensors 1.3 Probability Distributions 1.4 Gradiont based Optimization 1.5 Machine Learning Basics 1.6 Capacity, Overfitting and Underfitting 1.7 Hyperparamoters and Validation Sets 1.8 Blas and Variance 1.9 Challenges Motivating Deep Learning 1.10 Deep Networks 1.11 Deep Feedforward Networks 1.12 Gradient-Based Learning 1.13 Regularization 1.14 Bagging 1.15 Semi-Supervised Leaning 1.16 Multi-Task Learning 1.17 Optimization 1.18 Two Marks Questions with Answers (1) Bp | Doop Loaming _ +2 Beep Notwerks ey, ae a ERM Linoar Algobra © Linear algebra is the study of linear combinations. It is the study of vector Spaces, ; . i . and planes and some mappings that are required to perform the linear transforma, It includes vectors, matrices and linear functions. It is the study of linear Se jons and its transformation properties. Lincar algebra is the study of vectors and linear functions. Linear algebra is about ji, combinations, That is, using arithmetic on columns of numbers called vectors ang ays of numbers called matrices, to create new columns and arrays of numbers, Linear algebra is the study of lines and planes, vector spaces and mappings that a, required for linear transforms. The general linear equation is represented as, 4x, + + Fax, = b where, a = Represents the coefficients x = Represents the unknowns b = Represents the constant Formally, a vector space is a set of vectors which is closed under addition and multiplication by real numbers. A subspace is a subset of a vector space which is a vector space itself, e.g, the plane z = 0 is a subspace of R? If all vectors in a vector space may be expressed as linear combinations of v, ..., ¥, then v,, ... 4, span the space. A basis is a set of linearly independent vectors which span the space. The dimension of a space is the # of "degrees of freedom" of the space; it is the number of vectors in any basis for the space. A basis is a maximal set of linearly independent vectors and a minimal set of spanning vectors. Two vectors are orthogonal if their dot product is 0. An orthogonal basis consists of orthogonal vectors. An orthonormal basis consists of orthogonal vectors of unit length. Functions of several variables are often presented in one line such as, f(x, y) = 3x+5y Vector addition : Numbers — Both 3 and 5 are numbers and so is 3 +5 TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge f vvem ld). «Polynomials : If p(x) = 1 + x — 2x? + 3x? and q(x) = x + 3° — 3x’ + 2° then their sum p(x) + q(x) is the new polynomial 1+ 2x +x? +x", Deap Waterers Baves © Power series : If f(x) = lextgvedelt. and g(x) = 1+x+37 1 . then f(x) + a(x) = 149; x° +qx +... is also a power series. ¢ Functions : If f(x) = e* and g(x) = e * then their sum f(x) + g(x) is the new function 2 cosh x. 7 Scalars, Vectors, Matrices and Tensors * Scalars, Vectors, Matrices and Tensors are the most important mathematical concepts of Linear Algebra. 1, Scalar : © A scalar is a physical quantity that is represented by a dimensional number at a particular point in space and time. Examples are hydrostatic pressure and temperature. Ascalar has magnitude but no direction. An example is pressure p. The coordinates x, y and z of Cartesian space are scalars. © A scalar is only a single number, unlike most other objects studied in linear algebra, which are usually arrays of multiple numbers. We write the scalars in italics. © Scalar variables are denoted by ordinary lower-case letters (e.g. x, y, Z). The continuous space of real-value scalars is denoted by R. For a scalar variable, the expression x € R denotes that x is a real value scalar. 2. Veetor : A vector is a bookkeeping tool to keep track of two pieces of information (typically magnitude and direction) for a physical quantity. Examples are position, force and velocity. The vector has three components. © Let i,j,k denote unit. vectors in the x, y and z direction. The hat denotes a magnitude of unity. © A vector is an array of numbers. The numbers are listed in order. We can identify each individual number by its index in that order. Typically, we give vectors lowercase names written in bold types, such as x. TECHNICAL PUBLICATIONS® - an up-thrust for knowiodpe Deep Leaming 1-4 Dep Netirores 7 3. Tensors : * What happens when we need to keep track of three picces of information f fo given physical quantity ? We need a tensor. Examples are stress and styaj, 4 in. tensor has nine components. * * In some cases, we'll need an array with more than two axes. In the general ¢, array of numbers arranged on a regular grid with a varying number of axes ig a tensor. We note a tensor named “A” with this font : A. 4, Matrix : Cally * A matrix is a 2D array of numbers, so each element is identified by two Subs instead of just one. We usually give matrices uppercase variable names with i characters, such as A. © We usually identify the elements of a matrix by using its name in italics but Hot in bold, and the subscripts are listed with separating commas. * Figure shows scalars, vectors, matrices and tensors. B 13 13° 29 02) B2 3] 15 9] [ia 54] Scalars Vectors Matrices Tensors Transposes and Inner Products A collection of variables may be treated as a single entity by writing them as a vector, For example, the three variables x,, x, and x, may be written as the vector x x =/% Xs, Vectors can be written as column vectors where the variables go down the page or as Tow vectors where the variables go across the page. * To tum a column vector into a row vector we use the transpose operator T X = [X,, X,, X,] The transpose operator also tums row vectors into column vectors. We now define the inner product of two vectors vy 1, xy = [x x x,]] Yo Ys = XY, + XV2 + GY, TECHNICAL PUBLICATIONS®- on upthuat for knowledge Deep Learning 4-6 Doop Networks Basics which is scen to be a scalar. The outer product of two vectors produces a matrix x 1 XY = )%| [yp Yo Ys] Xs. MY M2 Ys = {XY OY. %2Y3 AY, %GY2 %Ys * AnN XM matrix has N rows and M columns. The ij” entry of a matrix is the entry on ath tt . ‘ ath . . the j” column of the i* row. Given a matrix A, the ij” entry is written as Ay. When applying the transpose operator to a matrix the i" row becomes the i" column. That is, if Ay Ay ays A=]! An ay As, Ay Ags yy Ay As Then Ab=|ay ay a 2 An Ay3 Ay; yy * A matrix is symmetric if A, = Aj, Another way to say this is that, for symmetric matrices, A = AT, Two matrices can be multiplied if the number of columns in the first matrix equals the number of rows in the second. Multiplying A, an N x M matrix, by B, an M x K matrix, results in C, an N x K matrix. The ij” entry in C is the inner product between the i row in A and the j® column in B. « Example: . [2 3 ‘Ils 47] 2 2 15) 56 71s 6 42 64 75 87 30. © Given two matrices A and B we note that, laa TECHNICAL PUBLICATIONS® - an up-thrust for knowledge on Deep Leaming 1-6 Doop Networks Basic Solution : Here the matrix C will also be 2 x 2, with e, = B-1 - 18+(-7)=11, iL; ae - 124+2=14, [S| -124+35=23, Cy = [-2 2 si 4 ]--8+C 19 =-18, fe 2] BE al ©) Cy] 123 A Probability Distributions ¢ A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. The way we describe probability distributions depends on whether the variables are discrete or continuous, ¢ Arandom variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. ey, = [+2 5] a Discrete Variables and Probability Mass Functions © A probability distribution over discrete variables may be described using a Probability Mass Function (PMF). . The random variable is called a discrete random variable if it is defined over a sample space having a finite or a countable infinite number of sample points. In this case, a random variable takes on discrete values and it is possible to enumerate all the values it may assume. © A disctete random variable can only have a specific (or finite) number of numerical values. For example, x = {1, 4, 9, 16, 25, 36} is a discrete random variable. © In case‘of a sample space having an uncountable infinite number of sample points, the associated random variable is called a continuous random variable, with its values distributed over one or more continuous intervals on the real line, * The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values. It is also sometimes called the probability function or the probability mass function. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Leaming 1-7 Doop Networks Basics The probability mass function maps from a state of a random variable to the probability of that random variable taking on that state. The probability that x = x is denoted as P(x), with a probability of 1 indicating that x = x is certain and a probability of 0 indicating that x is impossible. Probability mass functions can act on many variables at the same time. Such a probability distribution over many variables is known as a joint probability distribution. P(x =x, y=y) denotes the probability that x = x and y = y simultaneously. The probability mass function (pmf) of a random variable *X” is the set of probabilities for X = x‘, Mathematically it is written as, Px(%) = PIX=%,] The probability mass function is a positive function and it has countable number of value of x. . Normally probability mass function (pmf) is defined for discrete random variables and probability density function (pdf) is defined for continuous randont variables. Relation between probability mass function (pmf) and cumulative distribution function (cdf) : The CDF is defined in next section as F,(x). The relationship is given by, FQ) = Dox) uee—m) The sum of pmf over all value of x,, k= 1, 2, ..., @ is equal to unity. i., Eos) = 1 Continuous Variables and Probability Density Functions ‘A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile. ‘A continuous random variable is not defined at specific values. Instead, it is defined over an interval of values and is represented by the area under a curve. The probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite. ‘A continuous random variable is one having a continuous range of values. It cannot be produced from a discrete sample space because of our requirement that all random variables be single valued functions of all sample space points. ‘When working with continuous random variables, we describe probability distributions using a probability density function (PDF) rather than a probability mass function. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge: 4 Deep Notworks Ba, : 8 Deep Leaming 1-8 ey + A probability density function p(x) does not give the aed * ae Stay directly, instead the probability of landing inside an infinitesimal regior olumeg, is given by p(x) 8x. EE Gradient based Optimization * Optimization refers to the task of either minimizing or maximizing some function f¢) by altering x. The function we want to minimize or maximize is called the objectiye function or criterion. When we are minimizing it, we may also call it the cost function, Joss function, or error function, GEES] Gradient Descent * Goal : Solving minimization nonlinear problems through derivative information © First and second derivatives of the objective function or the constraints play an important role in optimization. The first order derivatives are called the gradient and the second order derivatives are called the Hessian matrix. ‘© Derivative based optimization is also called nonlinear. Capable of determining search directions" according to an objective function's derivative information. * Derivative based optimization methods are used for : 1. Optimization of nonlinear neuro-fuzzy models, 2. Neural network learning. 3. Regression analysis in nonlinear models. * Basic descent methods are as follows : 1. Steepest descent 2. Newton-Raphson method Gradient Descent : * Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point, * Gradient descent is popular for very large-scale optimization problems because it is easy to implement, can handle black box functions and each iteration is cheap. © Given a differentiable scalar field f(x) and an initial Suess x,, gradient descent iteratively moves the guess toward lower values of "f" by taking steps in the direction of the negative gradient — Vf(x).” TECHNICAL PUBLICATIONS® - an up-thrust for knowiodge Deep Leaming 1-9 Doop Networks Basics © Locally, the negated gradient is the steepest descent direction, i.c., the direction that x would need to move in order to decrease "f" the fastest. The algorithm typically converges to a local minimum, but may rarely reach a saddle point or not move at all if x; lies at a local maximum. The gradient will give the slope of the curve at that x and its direction will point to an increase in the function. So we change x in the opposite direction to lower the function value : Rear 4 AVE (X,) ¢ The A>0 is a small number that forces the algorithm to make small jumps. Limitations of Gradient Descent : © Gradient descent is relatively slow close to the minimum : Technically, its asymptotic rate of convergence is inferior to many other methods. © For poorly conditioned convex problems, gradient descent increasingly "zigzags' as the gradients point nearly orthogonally to the shortest direction to a minimum point. Steepest Descent : ° © Steepest descent is also known as gradient method. © This method is based on first order Taylor series approximation of objective-function. This method is also called saddle point method. Fig. 1.4.1 shows steepest. descent method. Fig, 1.4.1 Steepest descent method TECHNICAL PUBLICATIONS® - an up-thrust for knowledge oN oop Networks Deep Leaming 1-10 Bau, * The steepest descent is the simplest of the gradient methods. ihe ae Mite, where f decreases most quickly, which is in the direction opposite ° i The Seng starts at an arbitrary point x,:and then go down the gradient, until reach close 104, solution. ; * The method of steepest descent is the discrete analogue of gradient descent, but the be, move is computed using a local minimization rather than computing a gradient 15, typically able to converge in few steps but it is unable to escape local minima o plateaus in the objective function. * The gradient is everywhere perpendicular to the contour lines. After each jy minimization the new gradient is always orthogonal to the previous step direction Consequently, the iterates tend to zig-zag down the valley in a very inefficient manner, * The method of steepest descent is simple, easy to apply and each iteration is fast, It asp very stable; if the minimum points exist, the method is guaranteed to locate them after ay east an infinite number of iterations, Jacobian and Hessian Matrices 1, Jacobian Matrices ° The matrix of all first-order partial derivatives of a vector- or scalar-valued function with respect to another vector. The Jacobian of a function describes the orientation of tangent plane to the function at a given point. Likewise, the Jacobian can also be thought of as describing the amount of "stretching" that a transformation imposes. * The Jacobian determinant at a given point gives important information about the behavior of F near that point. For instance, the continuously differentiable function F is invertible near a point p if the Jacobian determinant at p is non-zero. This is the inverse function theorem. * Furthermore, if the Jacobian determinant at P is positive, then F preserves orientation near p. If it is negative, F reverses orientation, * The absolute value of the Jacobian determinant at p gives us the factor by which the function F expands or shrinks volumes near p. 2. Hessian Matrices * The square matrix of second-order partial derivatives of a function. It describes the local curvature of a function of many variables, If all second partial derivatives of “f“ exist, then the Hessian matrix of “f” is the matrix, ° © If the gradient of “f“ is zero at some point x, then f has a critical point (or stationary point) at x. The determinant of the Hessian at x is then called the discriminant. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Leaming 1-1 Deep Notworks Basics ¢ If this determinant is zero then x is called a degenerate critical point of “f“, this is also called a non-Morse critical point of “f “, Otherwise it is non-degenerate, this is called a Morse critical point of “f, * Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms. Machine Learning Basics * Definition : A computer program is said to learn from experience E with respect to some class of tasks T and performance measute P, if its performance at tasks in T, as measured by P, improves with experience E. ¢ A (machine learning) problem is well-posed if a solution to it exists, if that solution is unique and if that solution depends on the data / experience but it is not sensitive to (reasonably small) chariges in thé data / experience. © Identify three features are as follows : 1. Class of tasks 2. Measure of performance to be improved 3. Source of experience © What are T, P, E ? How do we formulate a machine learning problem ? A Robot Driving Leaming Problem 1. Task T : Driving on public, 4-lane highway using vision sensors 2. Performance measure P : Average distance traveled before an error (as judged by human overseer) 3. Training experience E : A sequence of images and steering commands recorded while observing a human driver. © A handwriting recognition learning problem 1. Task T : Recognizing and classifying handwritten words within images. 2. Performance measure P : Percent of words correctly classified. 3. Training experience E : A database of handwritten words with given classifications. ¢ Text Categorization Problem 1. Task T : Assign a document to its content category. 2. Performance measure P : Precision and recall. 3. Training experience E’: Example pre-classified documents. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Pl Deen Leaming t12 Deep Networks Basi, Task ‘ks that are too difficult to solve with fixeg * Machine leaning allows us to tackle ta programs written and designed by human beings. Machine learning tasks are usually described in terms of how the machine learning system should process an example, ¢ Some of the most common machine learning tasks include the following : 1. Classification : * Classification predicts categorical labels (classes), prediction models continuous-valueq functions. Classification is considered to be supervised learning. © Classifies data based on the training set and the values in a classifying attribute and uses it in classifying new data, Prediction means models continuous-valued functions, i.e, predicts unknown or missing values. © Preprocessing of the data in preparation for classification and prediction can involve data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes and data transformation, such as generalizing the data to higher level concepts or normalizing data. © Classification is the process of finding a model that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data. . . © Classification with missing inputs : Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector. will always be provided. © In order to solve the classification task, the learning algorithm only has to define a single function mapping from a vector input to a categorical output. When some of the inputs may be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions. ¢ Each function corresponds to classifying x with a different subset of its inputs missing. This kind of situation arises frequently in medical diagnosis, because many kinds of medical tests are expensive or invasive. 2. Regression : ¢ For an input x, if the output is continuous, this is called a regression problem. For example, based on historical information of demand for tooth paste in your supermarket, you are asked to predict the demand for the next month, TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge Deep Leaming 1219 Doop Notworks Basics * Regression is concerned with the prediction of continuous quantities. Linear regression is the oldest and most widely used predictive model in the field of machine learning. mite goal is to minimize the sum of the squared errors to fit a straight linc to a set of data points. For regression tasks, the typical accuracy metrics are Root Mean Square Error (RMSE) and Mean Absolute Perentage Error (MAPE). These metrics measure the distance between the predicted numeric target and the actual numeric answer. Transcription : * In this type of task, the machine leaming system is asked to observe a relatively unstructured representation of some kind of data and transcribe it into discrete, textual form. ¢ For example, in optical character recognition, the computer program is shown a photograph containing an image of text and is asked to return this text in the form of a sequence of characters, 4. Machine translation : * Ina machine translation task, the input already consists of a sequence of symbols in some language and the computer program must convert this into a sequence of symbols in another language. This is commonly applied to natural languages. 5. Structured output : © Structured output tasks involve any task where the output is a vector with important relationships between the different elements. The Performance Measure (P) * In order to evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure P is specific to the task T being carried out by the system. ¢ Accuracy and error rate is used to measure the performance. For tasks such as density estimation, it does not make sense to measure accuracy, error rate, or any other kind of 0-1 loss. © The choice of performance measure may seem straightforward and objective, but it is often difficult to choose a performance measure that corresponds well to the desired behavior of the system. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge KIER] The Experience (E) «sive int supervised OF URSUPERsed by during the learning process: egory label oF cost for each pat, f the costs for these pattems, 7, * Machine leaming algorithms ar of experience they are allowed to have a i vit a cat * In supervised learning. a teacher provides ol training set and users seek to reduce the sum data consists of a set of training examples. es i value: i * To leam a target function that can be used to a 1e te on a ee i low risk. The task i attribute, e.g. approve or not-approved and high-risk or low iS Come, ductive learning. called : Supervised learning, Classification or in - A it esults. For some Training data includes both the input and the desired © examples correct results (targets) are known and are given in input in input to the model dy, validation. and tests. the leaning process. The construction of a proper training, crucial. These methods are usually fast and accurate. Unsupervised learning : © The model is not provided with the correct results during the training. It can be use, cluster the input data in classes on the basis of their statistical properties only. Cluy, significance and labeling. © The labeling can be carried out even if the labels are only available for a small nunte of objects representative of the desired classes. All similar inputs patterns are grou together as clusters. © If matching pattem is not found, a new cluster is formed. There is no error feedback. * External teacher is not used and is based upon only local information. It is also refer to as self-organization. They are called unsupervised because they do not need a teacher or super-visor to label set of training examples. Only the original data is required to start the analysis. © The term supervised learning originates from the view of the target by being provite! by an instructor or teacher who shows the machine leaning system what to do. unsupervised learning, there is no instructor or teacher and the algorithm must leam! make sense of the data without this guide. © Traditionally, people refer to regression, classification and structured output probleas supervised learning. Density estimation in support of other tasks is usually conside unsupervised learning. TECHNICAL PUBLICATIONS® «en uptivust for knowledge eee 1215 _ Deep Hetworts Basics Reinforcement Leaming : Reinforcement learning is the training of machine learning models to make a sequence of decisions. Both supervised and reinforcement learning use mapping between input and output. unlike supervised leaming where feedback provided to the agent is the correct set of actions for performing a task, reinforcement learning uses rewards and punishment as signals for positive and negative behaviour. Reinforcement learning elements are as follows : 1. Policy 2. Reward function 3. Value function 4. Model of the environment. Policy : Policy defines the learning agent behavior for given time period. It is a mapping from perceived states of the environment to actions to be taken when in those states. Reward function : Reward function is used to define a goal in a reinforcement learning problem. It also maps each perceived state of the environment to a single number. Value function : Value functions specify what is good in the long run. The value of a state it’s the total ‘amount of reward an agent can expect to accumulate over the future, starting from the state. Model of the environment : Models are used for planning. With reinforcement learning algorithms an agent can improve its performed by using the feedback it gets from the environment. This environment feedback is called the reward signal. . Based on accumulated experience, the agent needs to learn which action to take in a given situation in order to obtain a desired long term goal. Essentially actions that lead to long term rewards need to reinforced. Reinforcement learning has connections with control theory, Markov decision processes and game theory. Example of reinforcement learning : A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the recharger in the past. TECHNICAL PUBLICATIONS® - an up-thnust for knowedge network is associated with Trying to predict data ~G. Supervised leaning req! . a variable is well defined an umber of its values are given. ~~ Example : Optical char ‘We can test our model. > Every input eed n output patter = function from labeled wuires that the target id that a sufficient acter recognition. o learn larger and more comp pervised leaming, is possible - models with uns or unsupervised le: typically eit unknown or has caily ie “target variable is orded for 00! simall & num Supervised leaming is also called _ classification. Difference between Supervised, Unsupervi ed and ae Learning 7 Unsupervised learning © Reinforce cement learnin and that a sufficient umber ofits supervised learning deals with two s regression al and — For unsupervised learning ‘the target variable is well defined typically either the target variable is unknown or has only | been recorded for too small a number of cases, Unsupervised leaming deals i with clustering and associative wi rule mining problems. ee nat o ean inforcement learning deals ae ire fou . TECHNICAL PUBLICATIONS® ‘2n up-thust for knowledge Deep Leaming 7 Doop Networks Basics ‘The input data in supervised Unsupervised learning uses The data is not predefined in learning in labelled data, unlabelled data, reinforcement learning. Leams by using labelled data, Trained using unlabelled data. Works on interacting with the without any guidance. environment. Mops the labeled inputs to the Understands pattems and Follows the trial and error known outputs, discovers the output method. EEG Linear Regres: Linear Regression Models n * Linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. * The objective of a linear regression model is to find a relationship between the input variables and a target variable. 1, One variable, denoted x, is regarded as the predictor, explanatory or independent variable. 2. The other variable, denoted y, is regarded as the response, outcome or dependent variable. ¢ Regression models predict a continuous variable, such as the sales made on a day or predict temperature of a city. Let's imagine that we fit a line with the training point that we have. If we want to add another data point, but to fit it, we need to change existing model. . © This will happen with each data point that we add to the model; hence, linear regression isn’t good for classification models. ° © Regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. Classification predicts categorical labels (classes), prediction models continuous - valued functions. Classification is considered to be supervised learning. © Classifies data based on the training set and the values in a classifying attribute and uses it in classifying new data. Prediction means models continuous - valued functions, i.e. predicts unknown or missing values. © The regression line gives the average relationship between the two variables in mathematical form. e For two variables X and Y, there are always two lines of regression. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge De 1-18 22P Not Deepteeming estimate for the value of X for ; ayy iV t + Regression fine af X an Y gives the bes given values of Y : xX=atbY Where a = X- intercept b = Slope of the line X = Dependent variable Y = Independent variable «Regression line Y on X : Gives the best estimate for the value of ¥ for any gee given values of X : Y =atbx where a = Y- intercept b = Slope of the line Y = Dependent variable x = Independent variable © By using the least squares method (a procedure that minimizes the vertical deviatioas¢ plotted points surrounding a straight line) we are able to construct a best fitting stragt line to the scatter diagram points and then formulate a regression equation in the fom of: J=atbX Bias term——> 1 No A _ Wy x $=$+b—x-%) 1p ae | Input vector] x2 Regression analysis is the art and x 32 oe science of fitting straight lines to *d atterns of data. i P In a linear Fig. 18.1 regression model, the variable of interest (“dependent” variable) is predicted from k other variables (independ! variables) using a linear equation. If Y denotes the dependent variable and Xp «% are the independent variables, then the assumption is that the value of Y at time tint data sample is determined by the linear equation : Y1 = By+BX,+ BX, +... + BX, +e, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Learning 1-19 Deep Networks Basics Where the betas are constants and the epsilons are independent and identically distributed normal random variables with mean zero. + At each split point, the “error” between the predicted value and the actual values is squared to get a “Sum of Squared Errors (SSE)”. The split point errors across the variables are compared and the variable/point yielding the lowest SSE is chosen as the root node/split point. This process is recursively continued. * Error function measures how much our predictions deviate from the desired answers. Mean-squared error J, = is (y,- fx)" Advantages : a. Training a linear regression model is usually much faster than methods such as neural networks, Linear regression models are simple and require minimum memory to implement. c. By examining the magnitude and sign of the regression coefficients one can infer how predictor variables affect the target outcome. Capacity, Overfitting and Underfitting « The ability to perform well on previously unobserved inputs is called generalization. The generalization error is defined as the expected value of the errot on a new input. Typically, when training a machine learning model, we have access to a training set, we can compute some error measure on the training set called the training error. We typically estimate the generalization error of a machine learning model by measuring its performance on a test set of examples that were collected separately from the training set. © The train and test data are generated by a probability distribution over datasets called the data generating process. The assumptions are that the examples in each dataset are independent from each other and that the train set and test set are identically distributed, drawn from the same probability distribution as each other. © This assumption allows us to describe the data generating process with a probability distribution over a single example. The same distribution is then used to generate every train example and every test example, The observation between the training and test error is that the expected training error of a randomly selected model is equal to the expected test error of that model. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Leaming_ it jose the para; * We sample the training set, then use falas te sani: to error, then sample the test set. Under a P a The fee a or equal to the expected value of training em i i letey machine learning algorithm will perform are its ability to : 1. Make the training error small. 2. Make the gap between training and test error small. These two factors correspond to the two central challenges in Underfitting and overfitting. ' Underfitting occurs when the model is not able to obtain a Sufficient) on the training set, Overfitting occurs when the gap between the trai error is too large. Training error can be reduced by making the hyp but this may lead to overfitting and poor generali Overfitting occurs when a statistical model describes random error op 1 the underlying relationship, Overfitting is when a classifier fits the training data too tightly. Such a g, Well on the training data but not on independent test data, It is a gene plagues all machine learning methods, Overfitting generally occurs othesis more Sensiti zation. Balanced Fig.1.6.4 Reasons for overfitting 1. Noisy data *2. Training set is too small 3. Large number of features, TECHNICAL, PUBLICATIONS? on opting or knowledge Deep Leaming 1.21 Deep Networks Basics + To prevent over-fitting we have several options : 1, Restrict the number of adjustable parameters the network has - ¢.g. by reducing the number of hidden units or by forcing connections to share the same weight values 2. To stop the training early, before it has time to learn the training data too well. 3. Add some form of regularization term to the error/cost function to encourage smoother network mappings. 4. Add noise to the training patterns to smear out the data points. Often several heuristic are developed in order to avoid overfitting. for example, when designing neural networks one may : 1. Limit the number of hidden nodes. 2. Stop training early to avoid a perfect explanation of the training set and 3. Apply weight decay to limit the size of the weights and thus ofthe function class implemented by the network. Definition : Given a hypothesis space H, a hypothesis h € H is said to overfit the training data if there exists some alternative hypothesis h’ € H, such that h has smaller error than h’ over the training examples, but h’ has a smaller error than h over the entire distribution of instances. © Occam's Razor states : Given two different explanations which offer the same hypothesis, preference should be given to the simpler explanation. This is to reduce the number of falsifiable assumptions for which your hypothesis relies, thereby keeping the hypothesis robust. © Applied to machine learning this involves simplifying the algorithm on our training dataset to a less complex model so that the testing sample is optimised for lowest prediction error. In fact one should optimise the average of several testing datasets by way of a cross-validation applied to multiple train-test splits. © Statistical leaning theory provides various means of quantifying model capacity. Among these, the most well-known is the Vapnik-Chervonenkis (VC) dimension. Vapnik - Chervonenkis (VC) dimension provides a measure of the complexity of a space of functions and which allows the probably approximately correct framework to be extended to spaces containing an infinite number of functions. To Vapnik - Chervonenkis dimension is a measure of the complexity or capacity of a class of functions f(«). The VC dimension measures the largest number of examples that can be explained by the family f(). The Vapnik - Chervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) = ©. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Doop Not Deep Leaming 1-22 Works, , and generalization properties are at odds. # The basic argument is that high capacity 1. If the family (a) has enough eapacity to explain every Po not expect these functions to generalize very well. 2. On the other hand, if functions f(a) have small our particular dataset, we have stronger reasons t ssible dataset, we Shay ly pacity but they are able to exp) 10 believe that they will alsy ~ well on ‘unseen data. ; © The problem of determining the capacity of a deep learning model is especially diffny because the effective capacity is limited by the capabilities of the optimizaty algorithm, and we have little theoretical understanding of the very general TON-conye, optimization problems involved in deep learning. © Training and generalization error vary as the size of the training set vaties. Expects generalization error can never increase as the number of training examples increas, For non-parametric models, more data yields better generalization until the best possiy, error is achieved, [EEGSI The No Free Lunch Theorem © The no free lunch theorerhs state that any one algorithm that searches for an optimal coy or fitness solution is not universally superior to any other algorithm. © The no free lunch theorem for search and optimization applies to finite spaces and algorithms that do not resample points. * All algorithms that search for an extreme of a cost function perform exactly the same when averaged over all: possible ‘cost functions. So, for any search/optimization algorithm, any elevated performance over one class of problems is exactly paid for in performance over another class. The no fice lunch theorem implies that we must design our machine learning algorithms to perform well.on a specific task. © Regularization is any modification we make to a leaning algorithm that is intended 10 reduce its generalization error but not its training error. Regularization is one of the central concems of the field of machine learning, rivaled in its importance only by optimization. Hyperparameters and Validation Sets | © Hyperparameters are parameters whose values control the learning process and determine the values of model parameters that a learning algorithm ends ap learning: TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Leaming 1-29 Doop Hotworks Basics While designing a machine leaming model, one always has multiple choices for the architectural design for the model. This creates a confusion on which design to choose for the model based on its optimality. And due to this, there are always trials for defining a perfect machine leaming model. The parameters that are used to define these machine learning models are known as the hyperparameters and the rigorous search for these parameters to build an optimized model is known as hyperparameter tuning. Hyperparameters are not model parameters, which can be directly trained from data. Model parameters usually specify the way to transform the input into the required output, whereas hyperparameters define the actual structure of the model that gives the required data, If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overfitting. For example, we can always fit the training set better with a higher degree polynomial and a weight decay setting of 4 = 0 than we could with a lower degree polynomial and a positive weight decay setting. To solve this problem, we need a validation set of examples that the training algorithm does not observe. Cross-Validation Cross-validation is a technique for evaluating ‘estimating performance by training several machine leaning models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, i.e,, failing to generalize a pattern, In general, machine learning involves deriving models from data, with the aim of achieving some kind of desired behaviour, e.g., prediction or classification. ¢ Fig. 1.7.1 shows cross-validation. Dataset Training ‘Testing |] Holdout method Gross validation Data permiting Training] [Natdaton ][ Tesh | Training, Vtidaton, Testng. WA Fig, 1.7.1 Cross validation TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Leaming u K-fold cross-validation : of special cases. W] * But this generie task is broken down into a number of Sh hen et est the performance of done, the data that was removed can be used (0 test Pp Ineo oe idea fe class of m model on "new" data. This is the basic idea for a whole lode] methods called cross validation. Types of cross validation methods are holdout, K-fold and Leave-one-out. The holdout method is the simplest kind of cross validation. The data set is Seah into two sets, called the training set and the testing set. The function approximate 5 function using the training set only. E The K-fold cross validation is one way to improve over the holdout method. The qa is divided into k subsets and the holdout method is repeated k times. Leave-one-out cross validation is K-fold cross validation taken to its logical ex with K equal to N, the number of data points in the set. * Cross-validation ensures non-overlapping test sets. * In this technique, k ~ 1 folds are used for training and the remaining one is used fr testing as shown in Fig. 1.7.2. Total number of examples Experiment 1 Experiment 2 Experiment 3 < nl Test examples Experiment 4 ae ~ Fig. 1.7.2 K-fold cross validation # The advantage is that entire data is used for training and testing, The error rate of model is average of the error rate of each iteration, TECHNICAL PUBLICATIONS® an uptrust for knowedge Deep Leaning 1-25 Doop Networks Basics * This technique can also be called a form the repeated hold-out method. The error rate could be improved by using stratification technique. $I Bias and Variance il] Point Estimation * Point estimation is the attempt to provide the single best prediction of some quantity of interest. Point estimation can also refer to the estimation of the relationship between input and target variables. We refer to these types of point estimates as function estimators. © We have: r error,(h) == errorg(h) = P Where n = Number of instances in the sample S, 1 = Number of instances from S misclassified by h, p = The probability of misclassifying a single instance drawn from D. « Is erro,(h) an unbiased estimator for err,(h) ? Yes, because for a Binomial distribution the expected value of r is equal to np. It follows, given that n is a constant, that the expected value of r/n is p. © Bias and variance measure two different sources of error of an estimator. Bias measures the expected deviation from the true value of the function or parameter. Variance provides a measure of the expected deviation that any particular sampling of the data is likely to cause. Bias-Variance Estimator © Bias and variance measure two different sources of error in an estimator. Bias measures the expected deviation from the true value of the function or parameter. Variance provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause, [EER Bias Variance Trade-Off © In the experimental practice we observe an important phenomenon called the bias variance dilemma. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 1-26 — Deep Leaming ne learning model built b, ssigned by 1 , signed BY Os erin Fearing cane) he class value a aoe air rm teats 8 eo types, errors due to ‘bias’ and error due to 'vartal . + Fig 181 showsbiswarnce de0F to fit 0 some, * Give two classes of hypothesis (08: TE pots class has a TOW bias tem : flexil ; data set, we observe that the more pf esis, then we can nS! A i ve have parametric family of hypoth 1 WE CN ings. higher variance term. If we have “pserve the inerease of Variance, the flexibility of the hypothesis but we still Low variance High variance Low bias Sol High bias Un Fig. 1.8.1 Bias-varlance trade off © The bias-variance-dilemma is the problem of simultaneously minimizing two sources error that prevent supervised learning algorithm from generalizing beyond their tanisy set: 1, The bias is error from erroneous assumptions in the learning algorithm. High biasca cause an algorithm to miss the relevant relations between features and target output. w The variance is error from sensitivity to small fluctuations in the training set. Hid variance can cause overfitting : modeling the random noise in the training data, ratht than the intended outputs. @ Inorder to reduce the model error, the designer can aim at reducing either the bias or variance, as the noise components is irreducible. TEGHHIGAL PUBLIATIONS® an pitt or inowage Deep Lenming. 1-27 Doop Hotwarks Basics * AS the model increases in complexity, its bias is likely to diminish. However, as the ixed, the parametric identification of the model may Strongly vary from one DN to another. This will increase the variance term. © At one stage, number of training examples is kept fi the decrease in bias will be inferior to the increase in variance, warning that the model should not be too complex. Conversely, to decrease the variance term, the designer has to simplify its model so that it is less sensitive to a specific trai This simplification will lead to a higher bias. [Seen ekl Price Price Price Underfitting (High bias and low variance) : * A statistical model or a machine leaming algorithm is said to have underfitting when it cannot capture the underlying trend of the data. It usually happens when ‘we have less data to build an accurate model atid also when we try to build a linear model with a non-linear data. g é é Size Size Size 09 + 04x Oot OrKt Or? —, 09+ OK + Ors Ont Oye? High bais (underfit) High bais (underfit) High variance (overfit) |. 1.8.3 Fig. . * In such cases the rules of the machine learning model are too easy and flexible to be applied on such minimal data and therefore the model will probably make a lot of wrong predictions. © Underfitting can be avoided by using more data and also reducing the features by feature selection. TECHNICAL PUBLICATIONS® - an up-tivust forked Deep Leaming 1-28 Deep Networks Basi | q Overfitting (High variance and low bias) : | * A statistical model is said to be overfitted, when we trait it with a lot of data, © When a model gets trained with so mich of data, it starts learning from the noise ang inaccurate data entries in our data set. * Then the model does not categorize the data correctly, because of too many details ang noise. * The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. * A solution to avoid overftting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees. | EEE] Challenges Motivating Deep Learning I * The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks. — EER The Curse of Dimensional * Many machine learning problems become exceedingly difficult when the number of dimensions’ in the data is high. This phenomenon is known as the curse of dimensionality. The curse of dimensionality refers to the phenomena that occur when classifying, organizing, and analyzing high dimensional data that does not occur in low dimensional spaces, specifically the issue of data sparsity and “closeness” of data, | © The volume of the space represented grows so quickly that the data cannot keep up ané thus becomes sparse, as shown in Fig. The sparsity issue is a major one for anyont whose goal has some statistical significance. 2113 . 20 15 18 10 Seeeen eee ee ! 5 ' 1 0 i ' ' 20 © a 00 daodw | ! 45 = 5 10 45 510) o 5 10 «15 20 5 10 15 20 20 (a) 10-4 regions (b) 2D - 46 regions (¢) 3D - 64 regions | Fig, 1.9.1 As the number of relevant dimensions of the data increases TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deep Leaming 1-29 Doop Natworks Basics + As the data space seen above moves from one dimension to two dimensions and finally to three dimensions, the given data fills less and less of the data space. In order to maintain an accurate representation of the space, the data for analysis grows exponentially, The second issue that arises is related to sorting or classifying the data, In low dimensional spaces, data may seem very similar but the higher the dimension the further these data points may seem to be. FE Local Constancy and Smoothness Regularization ¢ In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn, Among the most widely used | priors is the smoothness or local constancy prior. I There are many different ways to implicitly or explicitly express a prior belief that the learned function should be smooth or locally constant. All of these different methods are designed to encourage the learning process to learn a function f« that satisfies the condition f#(x)~ f(x +e), If we know a good answer for an input x, then that answer is probably good in the neighborhood of x. If we have several good answers in some neighborhood we would combine them to produce an answer that agrees with as many of them as possible. An extreme example of the local constancy approach is the k -nearest neighbors family of learning algorithms. j The k -nearest neighbor's algorithm copies the output from nearby training examples, most kernel machines interpolate between training set outputs associated with nearby training examples. An important class of kernels is the family of local kernels where k(u, v) is large when.u = v and decreases as u and v grow farther apart from each other. A local kemel can be thought of as a similarity function that performs template matching, by measuring how closely a test example x resembles each training example x, Decision trees also suffer from the limitations of exclusively smoothness-based learning because they break the input space into as many regions as there are leaves and use a separate parameter in each region. ‘ TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Deoop Leaming 1-90 Boop Hotere EEX] Manifold Learning * Manifold learning is an approach to non-linear dimensionality reduction. Algorithms ¢ this task are based on the idea that the dimensionality of many data sets js ‘ny artificially high. © Manifold learning was introduced in the case of continuous-valued data and th unsupervised learning setting, although this probability concentration idea can § generalized to both discrete data and the supervised learning setting : The ke assumption remains that probability mass is highly concentrated. © High-dimensional datasets can be very difficult to visualize. While data in two or thre dimensions can be plotted to show the inherent structure of the data, equivalent high, dimensional plots are much less intuitive. To aid visualization of the structure of dataset, the dimension must be reduced in some way. * The simplest way to accomplish this dimensionality reduction is by taking a random Projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost. * When the data lies on a low-dimensional manifold, it can be most natural for machine learning algorithms to represent the data in terms of coordinates on the manifold, rathe than in terms of coordinates in Rr [] Deep Networks © The term “deep” usually refers to the number of hidden layers in the neural network. : © Deep learning is a subset of machine learning, which is predicated on idea of learning from example. In machine learning, instead of teaching a computer a massive list of rules to solve the problem, we give it a model with which it can evaluate examples, and a small set of instructions to modify the model when it makes a mistake. © The basic ided of deep learning is that repeated composition of functions can often reduce the requirements on the number of base functions (computational units) by a factor that is exponentially related to the number of layers in the network. © Deep learning eliminates some of data pre-processing that is typically involved with machine learning. © Fig. 1.10.1 shows relation between Al, ML and Deep learning, TECHNICAL PUBLICATIONS®- an upto kroniodge Deep Learning eat Doap Networks Basics Artificial Intelligence Machine learning Dats science Deep learning Fig. 1.10.1 Relation between Al, ML and Deep learning For example, let's say that we had a set of photos of different pets, and we wanted to categorize by “cat” and “dog”. Deep learning algorithms can determine which features (eg. ears) are most important t6 distinguish each animal from another. In machine learning, this hierarchy of features is established manually by a human expert. © In deep learning, a computer model leams to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers. © Deep learning classifies information through layers of neural networks, which have a set of inputs that receive raw data. For example, if a neural network is trained with images of birds, it can be used to recognize images of birds. More layers enable more precise results, such as distinguishing a crow from a raven as compared to distinguishing a crow from a chicken. © Deep Learning consists of the following methods and their variations : a) Unsupervised learning systems such as Boltzman machines for preliminary training, auto-encoders, generative adversarial network, b) Supervised leaming such as Convolution neural networks which brought technoogy of pattern recognition to a new level. c) Recurrent neural networks, allowing to train on processes in time. d) Recursive neural networks, allowing to include feedback between circuit elements and chains. Reasons for using Deep Learning Analyzing unstructured data : Deep learning algorithms can be trained to look at text data by analyzing social media posts, news, and surveys to provide valuable business and customer insights. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 2000 Hotwey Peep Leaming Once tra 2. Data labelling : Deep learning requires labeled rained, i fa on its own, label new data and identify different types of data nek 7 i an save time because j A deep learning algorithm can USC it dogg hit & 3. Feature engineeri require humans to extract features manually from raw data. 7 « 4 Efficiency : When a deep leaming algorithm is properly trained, it can p Sto, thousands of tasks over and over again, faster than humans. 5. Training : The neural networks used in deep learning have the ability t0 be ®olg {© many different data types and applications. Additionally, a deep learning Mod ty can adapt by retraining it with new data. EE Application of Deep Learning 1. Aerospace and defense : Deep leaming is utilized extensively to help Satellite, identify specific objects or areas of interest and classify them as safe or unsafe for soldiers. . 2. Financial services : Financial institutions regularly use predictive analytics to ive algorithmic trading of stocks, assess business risks for loan approvals, detect fraud, and help manage credit and investment portfolios for clients. 3. Medical research : The medical research field uses deep learning extensively: For example, in ongoing cancer research, deep learning is used to detect the presence of cancer cells automatically. 4. Industrial automation : The heavy machinery sector is one that Tequires a large number of safety measuies. Deep leaming helps with the improvement of worker safety in such environments by detecting any person or objects that comes within the unsafe radius of a heavy machine, 5. Facial recognition : This feature utilizing deep learning is being used not just for a range of security purposes but will soon enable purchases at stores, Facial ‘ecognition is already being extensively used in airports to enable seamless, paperess check-ins. Deep leaming struct fachine learning uses algorithms to parse data, t ‘create an that data, and make informed decisions TECHNICAL PUBLICATIONS®- on op tat or krowedgo Deep Leaming Beep Leaming __ [7 2. Machine leaming gives lesser accuracy, 3. Machine learning requires less time for training. ‘Needs accurately identified features by human intervention, ‘Machine teaming models mostly require data in a structured form. "Algorithms are detected by data analysts to ‘examine specific variables in data sets. ‘Machine leaming can work on low-end machines. Input Feature extraction” Classification Focuses on providing a means for’ algorithms and systems to leam from experience with data and use that experience to improve over time. “Machine Leaming ws uses. sti 3 models. ‘A fori of enalyis in wich software programs learn about data and find pattems. is to maximize accuracy. Focuses on giving machines: cognitive and intellectual ‘capabilities similar to those of humans. Anificial Imelligenee uses ~ logic and decision trees. Development of computerized applications that sirmulate human intelligence and interaction Objective is to maximize the. chance of success. Deep learning gives more accuracy Deep learning requires more time for training Itcan create new features. Deep Learning models can work with structured and unstructured data both as they rely on the layers of the Artificial neural network. Algorithms are largely self-directed on data analysis once they are put into production. Deep learning model needs a huge amount of data to work efficiently, so they need GPU's ‘and hence the high-end machine Feature extraction Olena | ficial Intelligence ‘Data Science feecla oi onacing information needles from data haystacks to aid in - decision-making and. planning, Data Science deals with structured data. ‘The process of using advanced analytics to. exttact relevant information | from data a Objective is sto extract, actionable insights from the Boop Learning +t acoll 5. ML can be done through supervised, AL encompasses bi taperat oc inet intelligence concepts, Mathematics, dug earning appraaches. including elements of sem ad ; pereeption, planning and anal tes, machin la, rec and aus cere to answer analytin questions. Data science | 6 MLisconcemed with knowledge AL is concerned with cae aceumalaton. ee Madge dissemination and data engineering, AL Al aims towards building ‘ML aims to lear through data DL aims to build neural machines that are capable to, to solve the problem. networks that automatically think fike humans. 4 discover patterns for fee _setection. x MP amebhlan aah UML Gabe ot Alnd daa’ DLis nibeet okay Mi atta _ scone, science, All ‘systems of artificial . “ML Slgorthms c can be broadly. "Deep learning architectures are intelligence fall into three classified into three categories as follows: a) Supervised learning a) Convolutional Neurat 4) Artificial Narrow Intelligene 4) Unsupervised learning Networks by Anificial General Intelligence) Reinforcement leaming ._») Recurrent Neural Netwrts ©) Artificial Super Intelligence : : as c) Recursive Neural Networks Making machines intelligent These algorithms can work Algorithms are dependent on of may not needhigh | _easily on normal low: high performance hardware "performance computers Without components that tee GPUS. 2 [EG Advantages and Disadvantages of Deep Learning ‘Advantages of Deep Learning © No need for feature engineering, ¢ DL solves the problem on the end-to-end basis. © Deep learning gives mote accuracy. Disadvantages of Deep Learning © DL needs high-performance hardware. DL needs much more time to train, Itis very difficult to assess its performance in real world applications. © Itis very hard to understand. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge peep Learning Doop Networks [REI Deep Feedforward Notworks # Deep feedforw Hl networks are also called feedforward neural networks or ultilayer pereeptron’s, These models are called feedfo flows ward because information through the fimetion being evaluated from x, through the intermediate computations used to define f and finally to the output y. There are no feedback connections in which outputs of the model are fed back into itself. ¢ When feedforward neural networks are extended to include feedback connections. they are called recurrent neural networks, Feed Forward Neural ¢ Feed forward neural network is an artificial neural network in which the connections between nodes does not form a cycle. The feed forward model is the simplest form of neural network as information is only processed in one direction. While the data may | pass through multiple hidden nodes, it always moves in one direction and never | backwards. | ¢ They are called feed forward because.information only travels forward in the network (no loops), first through the input nodes, then through the hidden nodes (if present) and finally through the output nodes. ¢ Feed-forward networks tends to be simple networks that associates inputs with outputs. It can be used in pattern recognition. This type of organization is represented as bottom- up or top-down. Fig. 1.11.1 shows basic structure of a Feed Forward (FF) Neural Network Input layer Hidden layer Output layer ee “Hidden node | f | t | Output node Output node [~O Fig. 1.11.1 Basic structure of a feed forward neural network | TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Doop Noy = 36 Work Peep Leaming ; © Input layer contains one or more input nodes. For aes rca a Whether it will rain tomorrow and base our decision Iue for humidity, onan \Wwind speed. In that case, our first input would be the val Yr and they input would be the value for wind speed. * Hidden layer : This layer contains an activatio * Output layer contains one or more output nodes. . ee * Feed forward neural networks are primarily used for supervised learning in ease, the data to be leamed is neither sequential nor time-dependent. 4 n function. * Feedforward networks have the following characteristics = 1. Pereeptron’s are arranged in layers, with the first layer taking in inputs and | layer producing outpats. The middle layers have no connection With the ey, world, and hence aré called hidden layers. 2. Each perceptron in one layer is connected to every perceptron on the TEXt ay Hence information is constantly "fed forward" from one layer to the next ang explains why these networks are called feed-forward networks. 3. There is no connection among perceptron’s in the same layer. EEREED Activation Function * Activation functions also Known as transfer fumetions are used to. map input nodes output nodes in a certain fashion. The activation function is the most important factorin| @ neural network which decides whether or not a neuron will be activated or not ail transferred to the next layer. ¢ Activation functions help in normalizing the output between 0 to 1 or —1 to 1. Tthelpsin| the process of backpropagation due to their differentiable property, backpropagation, the'loss function gets updated and the activation function helps th gradient descent curves to achieve their local minima. . * Activation function basically decides in any neural network that given input or rece information is relevant or itis irrelevant. This activation function makes the multi network to have greater representational power than a single layer network only whea| non-linearity is introduced. 4 ERE] Xor Function © The XOR function is an operation on two binary values x, and x,, When exactly of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0. TECHNICAL PUBLICATIONS® . an up.thrust for knowdodge Deep Leaming ED Notwrorks Basics ¢ The XOR function provides the target funetion y = f(x) that we want to learn, Our model provides a function y = ((x ; 0) snd our learning algorithm will adapt the parameters 0 t0 make Fas similar as post wie tof”. * Neural networks can be used to classify ‘Joolean functions depending on their desired outputs. * The XOR problem is not linearly separable. We cannot use a single layer perceptron to construct a straight line to partition the two dimensional input space into two regions, each containing only data points of the same class. « However, we may solve the XOR problem by using a single hidden layer with two neurons, as in Fig. 1.11.2. Neuron 1 2 NN ton en Neuron 2 Input Hidden Output layer layer layer Fig, 1.11.2 Architectural graph of network for solving the XOR problem © The top neuron, labeled as “Neuron 1” in the hidden layer, is characterized as Wy = Wwy=t1 The slope of the decision boundary constructed by this hidden neuron is equal to — The bottom neuron, labeled as “Neuron 2” in the hidden layer, is characterized as Way = Wy = 41 The output neuron, labeled as “Neuron 3” is characterized as w,, =~ 2, Wy, The following assumptions are made here : a) Each neuron is represented by a McCulloch - Pitts model, which uses a threshold function for its activation function, b) Bits 0 and 1 are represented by the levels 0 and + 1, respectively. The function of the output neuron is to construct a linear combination of the deci boundaries formed by the two hidden neurons. The bottom hidden neuron has an excitatory (positive) connection to the output neuron, whereas the top hidden neuron has an inhibitory (negative) connection to the output neuron. TECHNICAL PUBLICATIONS® an up-tnrust for knowieoye a Deep Notwork - } Deep Leaming ue SB. | * When both hidden neurons are off, which occurs when a ee tick O05 output neuron remains off. When both hidden neurons m on beta ite When 9 input pattern is (1, 1), the output neuron is switched oe eae Ie inhi, effect of the larger negative weight connected to the top hi sr nidden Er Power, excitatory effect ofthe positive weight connected to the botiom hidsn neuron, © When the top hidden neuron is off and the bottom hidden ne s on, Which Occ, when the input pattern is (0, 1) or (1, 0), the output neuron is rite ed on because oft excitatory effect of the positive weight connected to the bottom hidden neuron, St, EEG cradient-Based Learning © Designing and training a neural network is not much different from training any gj, machine learning model with gradient descent. Choices for gradient learning are follows : a) We must choose a cost function b) We must choose how to represent the output of the model ©) We now visit these design considerations. © Neural networks are usually trained by using iterative, gradient-based optimizers Gradient-based leaming draws on the fact that it is generally much easier to minimize reasonably smooth, continuous function than a discrete function. © The loss function can be minimized by estimating the impact of small variations of th parameter values on the loss function. Convex optimization converges starting from any initial parameters. Stochastic gradient descent applied to non-convex loss functions his no such convergence guarantee and is sensitive to the values of the initial parameters, © For feedforward neural networks, itis important to initialize all weights to small randon values. The biases may be initialized to zero or to small positive values. The iterative gradient-based optimization algorithms used to train feedforward networks and almos all other deep models. oe Cost Function ¢ Important aspect of the design of deep neural networks is the cost function. They ar similar to those for parametric models such as linear models. In most cases, parameti model defines a distribution p(ylx ; ®) and simply use the principle of maximum likelihood. © The use of cross-entrépy between the training data and the model’s predictions ast cost function. Most modern neural networks are trained using maximum likelihood. TECHNICAL PUBLICATIONS® - an updhrust for owledge Deop Networks Basics ‘peep Leaming 0 : Cost function is given by, J(0) = ~itsts }OB Pynodei(¥ PS) ikelil lently, « This means cost is simply negative log-likelihood and area between training set and model distribution. Specific form of cos model to model depending on the form of log Pract cross-entropy Jhanges from © Cost function witli Gaussian model : if Prote(VIX) = NOTE Gs 6, D rte ror cost is, then using maximum likelihood the mean squared ¢1 ‘Where “const” depends on the variance of Gaussian. . Pe st from mov" - e Advantage of this approach p cost is that cooarnaner| imum like a removes the burden of designing cost functions for eae! : a i be © Desirable property of gradient : Gradient must serve as a good guide to thr learning algorithm. =~ a saTBe 8° predictable enough to y jzation: AT) . * Cross entropy and sregularizat FOPEIDY of cross-entropy cost used for MLE is ‘e A minimo™ ° ee that it does not have ja minimu”, valus“For discrete output variables, they cannot represent probability of zea Gr one byeCome arbitrarily close. Logistic regression is an example. iW * For real-valued outp ut variable it becomes possible to assign extremely high density to /e-g, by learning the variance parameter of Gaussian output “Atropy approaches negative infinity. correct training set ciutputs, and the resulting cross-ey + Learning conditionA1 statisti: Instead of leaming a fall probability distribution, we often want to 1627n just one conditional statistic of y given x. Zranction : If we have a sufficiently powerful neural network, we can think of it a5 %eing powerful enough to determine any function “f”. only/by boundedness and continuity, ¢ Learning a, This function is limited From this point of view, cost function is a function rather than a function. View cost as a functional, not a function, We can think of learning as a task of choosing a function rather than a set of parameters. We can’ design our cost funetion to have its minimum occur at a specific function we desire. For example, design the cost functional to have its minimum lie on the function that maps x to the expected value of y given x. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 1-40 Deep Leaming tion requires a mathemsy: * Solving an optimization problem with respect to a func i} thematic called calculus of variations. Li t Tea to poor results when yey * Mean squared error and mean absolute error offer . aturate produce very small ar 4 Bradient-based optimization, Some output units ss-c When combined with these cost functions. This is one reason cross-entropy cog, Popular than mean squared error and mean absolute error. ERE output units * Choice of cost function is tightly coupled with choice of output unit, Most of the i, We use cross-entropy between data distribution and model distribution. Choice off, * Tepresent the output then determines the form of the cross-entropy function. In logis Tegression, output is binary-valued. Any kind of neural network unit that may be Used an output can also be used as a hidden unit. \ * The role of the output layer is to provide som additional transformation from, features to complete the task that the network a eae One simple kind oF oupy Unit js &2 output Unit based on an affine transformation with no nonlinearity. These a Mey Often just called Sinear units. * Types of output units : . 1. Linear units for mean of a Gauissian output , 2. Sigmoid units for Bernoulli output distributions - Softmax units for Multinoulli output Other output types Aw 1. Linear units for Gaussian output distributions * Linear unit : Simple output based on affine transformation with no Ne Given features h, a layer of linear output units Laie! a vector he W'h +b. © Linear units ate often used to produce mean } of a conditichal Gaussiat distribution P(ys) = NO 9, D.

You might also like