Download as pdf
Download as pdf
You are on page 1of 26
ol IT 3,1 MACHINE LEARNING Machine Learning (ML) is an interesting and more popular subfield of Artificial | Jatelligence (Al). The Pivotal goal of machine learning is to understand the structure of data, so as to fit the data into models that can be leveraged, understood and deployed by people for various applications) Two famous definitions that draws interest is given _ below: aM Arthur Samuel defined Machine learning as “the field of study that gives computers the ability to learn without being explicitly programmed “. Tom Mitchell defined Machine Learning as a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E. Example of Machine Learning Problem: Classification of spam mails * Task (T): Classifying the unnecessary mails as spam * Performance (P): Percentage of spam mails that are correctly sent to the spam folder 43.2 Supervised Learn «Training / Experience (E): A dd ML dataset of mails with spam mails: 3.1.1. Relation between AI ant Machine Teaming Fig 3.1: Relation between ML, AT and Deep Learning The domains of AI, ML and Deep Leaning (DL) are very closely couple Primarily, Al focusses on the non-biological systems that are more likely to exhib human-like behaviour. While ML algorithms leas the models, trends, pattems ang rules automatically ftom the underlying data representations. DL is an emerging technology where the algorithms that parametrize multilayer neural networks f0 lear the data representation. 3.1.2 Traditional Programming vs Machine Learning lapse) Inputs tomate + MOH outpts Proganmer > Pog a) Traditional Programming b) Machine Learning _ Figs 31.2 Difference between Traditional programming and Machine Learning T_T ____ ‘Traditional Programming ‘Machine Learning Programmer develops a program, Programmer allows the ML model to ‘executes it and gets the output. earn complex associations between the input and output. Lo am processes the new dag” Zee prot ‘wo f (i if i Fig 3.3: Types of Machine Learning ‘eML algorithms are broadly classified into: «Supervised algorithms + Unsupervised algorithms «+ Semi-supervised algorithms + Reinforcement learning algoritims ) a Supervised algorithms + These are also known as predictive algorithms, and they are used to classify or predict the outcomes of new input with prior knowledge acquired from previous inputs. 3.4 Supervised Learnin The leaming process in supervised algorithms is guided by the inputs and the, corresponding outputs. “The mapping between the inputs and their outputs is the Key for the algorithm, learn. ‘The final ML model is developed after iterative learning of the training exam, «The supervised algorithms can be further classified into: > Regression algorithms: The output variable of these algorithms Will be gy absolute or discrete value. For instance, the prediction of next day, temperature is done using regression algorithm. > Classification algorithms: The output variable of these algorithms is a clas, ‘or category. For instance, the prediction of next day's overall weather is dong using classification algorithm, since the outcome may be one among {sun outcast, rainy, cloudy}. + Examples: linear regression, Support Vector Machines (SVM), Regression tres, logistic regression etc. Unsupervised algorithms Leaming in unsupervised algorithms happens without labelled data that is, the training examples are presented to the algorithm without target or output. ‘© The algorithm Jeams the underlying patterns and similarities between the input data to discover the hidden knowledge. This process is referred as knowledge discovery. © The unsupervised algorithms are further categorized as: > Clustering algorithms: The knowledge discovery in these types of algorithms happens by uncovering the inherent similarities among the training data. > Association: These are algorithms that extractrules that can possibly deserite large class of data. © Examples: Fuzzy logic, K means clustering, K-Nearest Neighbours etc. iced algorithms 14mg process iN SEM-super i 2 ing a pee ta, The Models taned on i ganity of unlabelled dat, Algorith laa of hig Patil og led data ang amning liven’ 9 jx incured (0 PIOCU bee g 1 esd algorithms. a ‘ f semi-supervised working ervised algorithms can *e clustering of related examples using ata isthe ‘Motivation forthe sis ofseni. ‘ered in top ing the available labelled SUPE ised methods us xa examples. 'MPIES to label the yoot learning algorithms fearing einfocerent ang kes ple fn an eal Bebaviin he Ben er romance agents willbe iteratively rewarded using ino Tratisthe central factor in guiding the agent to ag « decide on the next step. as emaining unlabeled by making te svar a iam atin nasi tt feedback signal. This otc lean he emtoane se outcome of the Tearing in these alg ths isan opt rmaximises the performance ofthe agent 'S an optimal policy that peinforcement learning is commonly used inthe fil of robots, xamples: Adversarial networks and QJeaming. LINEAR REGRESSION MODELS 3.6 Supervised Learning # Linear Regression is a supervised and statistical ML method that is used gg, Jationship between the data-points to dray predictive analysis. This uses the rel straight line through them, Linear regression makes predictions for continuous! such as sales, age, price, income etc. real or numeric variable, s ‘algorithm shows a linear relationship between a ‘Linear regression dependent (y) variables. dependent (y) and one or more i + Regression finds how the value of the dependent variable is changing according to the value of the independent variable. “The model provides a sloped straight line representing the relationship between the variables. ‘The mathematical equation of simple linear regression is given below: Y, = Bo + BX +E Here, Y; is the predicted output for the instance i and is the dependent o- explained variable. Bo is the intercept of the line while Bi is the slope or scaling factor for each input. X; is the independent variable or explanatory variable or predictor or feature that governs the entire learning process. ¢j is the error ‘component. 3.2.1 Least Squares Regression The least squares method is a statistical procedure to find the best fit for a set of data points by minimizing the sum of the offsets or residuals of points from the plotted curve. ntl, is laenceo wares reBression iS USCA 19 preg 2d Machy woe it the Bhavan pg en? depen i nod of regression analysis be tis . and y-axis graph. ja shows 2 T6ONET Telaionship been i izes the VE stance from the data Points oy Teaesion fine ue : oS a ” i “least squares” indicates the sm, 1 regressioy “ e term known as variance, + Mperwi5e se est sues method is often applied in datg fitting + pest fit result is assumed to reduce the " TH are sated 10 be the difeences betwen theo responding fitted value given inthe mege salue and co basic categor ere are WO Boies of least-squares problems: wet OF squatederorso residual ed oF experimental . The > Ordinary or linear I Nonlinear least Squares: iterative method 1p jinear model with each iteration ‘he given data points can to be minimized by the met ‘ thod of reducing roffsets of each point from the line. of reducing residuals Dope least-squares method of regression analysis i i best suited fo fe models and trend analysis. for prediction tis best used in the fields of economics, finance, and stock markets wherein the value of any future variable is predicted with the help of existing variables and the relationship between the same. + The least-squares method provides the closest relationship between the variables. + The difference between the sums of squares of residuals to the line of best fit is minimal under this method. + The computation mechanism is simple and easy to apply. 3.8 Supervised Learning Disadvantages LM «This method relies on establishing the closest relationship between a given, fe of variables. ‘+ The computation mechanism is sensitive to the data, and in ease of any outliers, the results may affect severally. + More exhaustive computation mechanisms are applied for non linear problem\ Least Square Algorithm = 1. For each (x, y) point calculate x? and xy aot | 2. Sum all x, y, x? and xy, which gives us Ex, By, Ex? and Dxy 3. Calculate Slope b: ExEy ty-- b= ae we Where n is the number of points. 4, Caleulate Intercept a: | | = Dy-(Ex) n | 5, Assemble the equation of a line: Y= bxta ‘Example 1: The below table give the statistics about the number of hours or rainfall in ys ‘hennai and the number of French fries sold on a week from Monday to Friday ing canteen. Predict the number of French fries to be prepared on Saturday, if a rainfall of 8 hours is expected. Hours of Rain (x) | No. of French Fries sold (y) 2 4 3 3 3 7 7 10 9 15 Sy, Ex? and Iny Ex=26 , Find the slope (b) SExy EEE bev ae? 4 Coleulate the Intercept a ar(41-(1.5182*26)/5= s, Form the equati = LY@EX) 13049 (263-(2641 SY (168-26"26)5 1 sigp = Y= 1.5182x40,3049 cyaputing the Error x] y | Y= 151825403089 | Error (Wy 2) 4 3343 06587 3/5 4.8595 “01405 3/7 7.8959 08959 7/10 10.9323 09523 loys 13.9687 0313 ye] Be 3.10 Supervised Learning Visualizing the Line of fit: “ y= 15183x + 03049, wo) > ‘ . . » Fig 3.5: Line of Fit t rains for 8 hours SoM Number of French fries to be prepared i Substitute x=8 in Y= 1.5182%+0.3049, then Y=12.45 So, approximately 13 French fries will be sold on Saturday. 3.2.2 Single and Multiple Variables ‘+ Simple or single linear regression performs regression analysis of two variables, ‘The single independent variable impacts the slope of the regression line, ‘* Multiple regression is a broader class of regressions that encompasses linear and nonlinear regressions with multiple explanatory variables. «Each independent variable in multiple regression has its own coefficient to censure each variable is weighted appropriately to establish complex connections between variables. ‘© Two main operations are done in multiple variable regression: i) Determine the dependent variable based on multiple independent variables. ii) Determine the strength of the relationship is between each variable. + Multiple regression assumes there is not a strong relationship between each independent variable. $a Iitettige amnes theres corlation “ ey i #2 epee! variable, SEEN each ing le sit! Jlationships is we these Fe ighted 19 pet eh drive the dependent value by ag, — More im arid gepencent variable, M8 a unigu in tiple variables for regression is my, anal ion. More complex rear mo Ses 1 Oa eresion ple lenin can oN one sion. AC Quin ear o8 ‘hited through, ulti al PES to predioe 1 Mia variable Tre outcome of single Y= tbo, thy fe equation, bs, bz, ..., yathe 0b0¥ aq a are the Slopes for the individ 6 exe val variables Logan between single and Multiple variable regression Ee varable regression > gent variable Y is predicted — Multiple variable rgresion~ cdepet 7 (pe dre explanatory variable x. sige 0 assessor coefficient or — gan Predicting BMI from age. = Alependent variable Ys prediced ~ rom multiple explanatory variables, thas multiple regression eoei ficients. jo Example: Predicting BMI fom age, height, gender ete. Dom uamptions of Linear Regression je regression have some fundamental assumptions: «+ Linearity: There must be a linear relationship between the dependent and independent variables. «+ Homoscedasticity: The residuals must have a constant variance. «+ Normality: Normally distributed error + No Multicollinearity: No high correlation between the independent variables, 3.12 Supervised Learning “ b) Multivariate linear regression; Three variables are used to predict the weather a) Simple Li wr Regression: Onl temperature is used to predict the weather Fig: 3.6 Linear vs Multivariate regression (3.2.3 Bayesian Regression yN * Unlike the least square regression, the Bayesian regression uses probabil distributions rather than point estimates. This method is used when the available data is very scanty. * Bayesian linear regression pushes the idea of the parameter prior a regression further and does not even attempt to compute a point estimate of the parameters, but instead the full posterior distribution over the parameters is taken into account when making predictions. * This means it do not fit any parameters, but computes a mean over all plausible parameters settings (according to the posterior). me wo Fig 3.7: Schematics of Bayesian regression where mo variance and o is the random noise the mean, Sois the ‘© This allows to put a prior value on the coefficients and on the noise so that in the absence of data, the priors can take over the regression process. linear regre ayesian Ssion can, "ell wh A ue confidence and which PANS tis ey y to be drawn from a proba Sintted as su con i i regression Is t0 find Posterip, , Om The jineat id the 1 parameters and the responge we 4 from the data distribution posterior Probability OF the sao Parameters 0 Ting iets and outputs as given inthe below ae i coin ap AUation; 0 the PCBly.x) = “REP 3) X) is the e term POBIYs Posterior probabiti ‘ Monet ve the inputs and outpuig NY Sistibuton of iy model 1 t0 the likelihood of th sis is eau data, Py, probity oF I PAM and dies Hy. utile bythe roe Weedon the Bayesian inference ization conan. This i Posterior (Likelihood Price) Nonaiztin Likelihood describes the probability ofthe target values parame ‘ven the deta and Frio describes the initial knowledge about which paremetr valves ae like and unlikely. y «+ Bydence or normalisation describes the joint probbiliy ofthe dat nd tres «Posterior describes the probability ofthe parameters given the observed data nd targets. ‘+ The posterior distribution for the model parameters is proportional to the likelihood of the data multiplied by the prior probability ofthe parameters. + Priors: This is included in the model based on the domain knowledge or based ‘on wild guess for what the model parameters should be. This contrasts withthe 3.14 Supervised Learning J “vege measure Of he fg near discrit a linear d a io ‘* The problem of fin e jan E problem of minimizing a criterion function: . sev fe sre . ces is the sample risk, oF train: a jon for classification purposes is the s ani, i +The ite a a eee Toss ncred in lasifyng the et of ig perong HE MOFMAL DSEEON ogy error, whiet us ‘ive if i wiih is positive FX i8 onthe positive ggg" the dey samples. . we "Se and negative ify it Setbric « Ikis dificult to derive the minimum risk linear diseriminant e ison henge ti | related criterion functions é category? © Hence it is essential to investigate sever=! that ag wo more than a way to devi analytically more tractable ere is more than @ Way 10 Gevise mules cay mt functions: Its POsSIBle reduce the probly a through fi ‘Types of discriminant functions vei em is solved BY linear discriminant “to ctWo-class probleme aw ee . where 7 fe igned to wi A more feasible arch wenig n v2 p ose hota every Pair Of classes. The followin enlaion det inear fo 8 Two case category: A two-category classi 2(8)>0 and w2 if <0. If gx) left undefined. The equation 2(%) fier implements the following decision rule: Decide w; 0, x can ordinarily be assigned to either class, or ang | jaf H ant functions: i " i )-0 defines the decision surface that separates poin | ss Li-wRetng _— assigned to w1 from points assigned to w2. When g(x) is linear, this decision surface is . fice assign x to Wi iF B1C) > B(%) where j andj are not hyperplane. If x1 and x2 are both on the decision surface, “ whertwo= wistwo sjajalinear machin, that divides he eae space i sale «) is the largest discriminant ifx isin region R mc ‘equal. The: Tesulting classifier to ¢ decision regions. in these sda be bre inion a i IER and Rj are contiguous, 2 wyperplane Hi defined by . Bi(X) = Bix) er words, (= 89) T+ (Wwio-Wjo) =O . | g3) = min + wart v= 0 ats) 0 Fig 3.13 a): Two class discriminant function In general, the hyperplane H divides the feature space into two half-spaces: decision region RI for wl and region R2 for w2. Because g(x)>0 if x is in RI, it follows that the normal vector w points into R1. It is sometimes said that any x in RI is on the Fig 3.13 b : Multi case category showing different hyperplanes 3.24 Supervised Leaning oF FUNCTIONS 3.4 PROBABILISTIC DISCRIMINANT discriminant functions with abilities to hi for recognition, checking similarity, featu funetions are more flexible in finding the 0 nt function is a probabilistic version of jn. ‘andle more complex data. This is widely re extraction and verification processes, 7y.* ptimal direction of data projections, se Probabilistic Linear Diseriminat ped by its center, and the suppor of, issian model, each class is deserit : Ina Gaus finite set of points. This is not eto, prior distribution of the class centers is @ for handling new classes. 'A robust model should learn this prior (Which models the differences betwe, ‘lasses) along with common variance of the class-conditional distributions (hig, models the differences between examples of the same class). ‘The Probabilistic Linear Discriminant (PLA) functions is a more principieg ‘method of combining different features so that the more discriminative features have more impact on recognition. “This is much useful in “one-shot learning” where a single example of previously unseen class can be used to build the model of the class. Multiple examples can be combined to obtain a better representation of the class, Let x=fao%2,...Xa} be the D-dimensional observations or data samples, PLD functions assumes that given data samples are generated from a distribution by finding the parameters of model that well describes the training data, ‘The choice of distribution from which data is assumed to be generated is based on ‘two factors: > Itshould represent different type of data > Computation of parameters is simple and fast. Ply) = N@ly, Py) Here y isa latent Or hidden or class variable which represents the mean of class. From the class variable, probability of generating data sample x should be generated. ®y represents class covariance of the class, This infers that once the class parameters of Gaussian are known, then the samples of this class is generated. js variable Y HSCF assumed The © pability of generating a » be penera + tne pro Particular ed fh he PR sumed distribution is caiey na = Sale Pris proba HEH op 5 iv fo ses that s ents a cen "andl class MAI WETE tei io 1 Foro to #COMMIOUS vl. Here ye Hn ap ‘a ‘tion Y el value PS, ney the sent variabley foreach cass can hel between + Benerated + Mfpmean m and Between-lass coarse gy ECan dis Fig 3. 14: Probabilistic linear discriminant ratages of PLDA 4 Generate class center using continuous non-linear function even fom single example of unseen class. i + Compare two examples from previously unseen class(es) to determine whether they belong to same class. «Perform clustering of samples from unseen classes jo uf Logistic Regression (Logistic regression estimates the probability of an event occurring based on a | fiven dataset of independent variables. Since emt ree the dependent variable is bounded between 0.and 1,)2.N\ 3.26 Sypris Learning, ne ability of a ‘on the odds—that is, the probability of sues. ‘A logit transformation is applied : © aivided by the probability of failure (-p)- TRS called log of odds or odds rag P Odds ratio (9)= 755 session gives 7-Po* Brxi+ B2x2*--- Bik The sigmig tation for linear . Thee resto to find the class and is given 8s function is used in logistic regressi 1 Piet inear regression in sigmoid function: Bae +1 Substitute the equation of li eb Bitat Bavet ata ~ Bi phat Bata * Bata BaXat Replace p in odds ratio, ga ePot Batt Bake Batat ABM Take log on both side: Ln(s) = Bo + Bas + BoX2 + Baxs + we FB RXR ‘This is same as linear regression. Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable, ‘To predict which class a data belongs, a threshold can be set. Based upon this threshold, the obtained estimated probability is classified into classes. This is called decision boundary. Decision boundary can be linear or non-linear. Polynomial order can be increased to get complex decision boundary. '* In this logistic regression equation, logit(p) is the dependent or response variable and x is the independent variable. © The beta parameter, or coefficient, in this model is estimated via Maximum Likelihood Estimation (MLE). © This method tests different values of beta through multiple iterations to optimize for the best fit of log odds. alk ——Ltal Intell nese iterations Produce theo egg inence hood fi I 0) zo maximize this Function to Find the yon tion, St paray "geeks the optimal coefficient is f oi found, th , i vation can be calcul, condi spservation om calculated, load, and sume Probabilities. foy probit ogee to yet precy Predicted og binary classification, @ probabitiy Jey 4 peobobilcy greater than 0 will predict 1, nS will predict 0 whi the model has been computed its bes pact a * pode! predicts the dependent variable, which is oe the how well the led aor Logistic Regression: Boodness of fit, rf inary Logistic Regression: S {he response has only two 2 posi Example: cancer or non cancer = i stic Regressic Multinom| 1 Logist eression: Three or more Erample: Predicting which cuisine the fod hig ee without ordering Indian, Continental etc. 188 to: South Indian, North «Ordinal Logistic Regression: Three or more categories with order goties with ordering, Example: Grading a student’s performance: Pass, average, dist rage, disinction) ~) ) om Tegiate Regression 65 70 75 8085 Fig 3.15: Sigmoid curve that maps only 0 or 1 logit value rervised Learn - = reasion and logistic regression i e a ——— Differences betw en linear reg! —Logistic Regression ~ gS |____Mnear Rens crasy te dependent tres roms Predict a dependent target independent input variable, independent input variable. ee = sereie or continuous | The output W! II be a binary value. number: ———F}faximum Likelihood Estim — ap eae ‘Maximum Likeli ta aa iaiod |For assessing the accuracy. assessing the accuracy. The best fit is a straight line. Example 2: Find the class for the regression equation deriv Regression equation in Example 1: Y= 1.5182x+0.3049) Ye 1.5182" 50+0.3049= 76.2149 Soe %2= 1.258 ed in example I for x=5) Sip s Peryg = 0386. Using the threshold value as 0.5, the given example will belong to class 0 witha probability of 38%. 3.5 PROBABILSITIC GENERATIVE MODEL A set of multivariate data, D, is explained in terms of a set of underlying causes, «a, The factors may instantiate highly non-linear interactions among the causes or between, the causes and the data. These models can generate new data instances and are widely used in unsupervised tasks such as probability and likelihood estimation, modelling data Antificiat t, Intelligence nd distin hing between classes using the and Machine tearnin an ayes theorem to find the joint probabitg» tilt gs ‘s aide ” Te rd date Fig 3.16: Generative Model ote OF perception: w! inferring the best set of causes t0 explain » given sani the PosterioF OVEr @OF Computing its mean. The d seoped here since Di is constant. All quantities are cond ba erifes the overall architecture wi ti piece of data involves enone (DM) ay ned on the model Which the causes a, aredefined 4@ = argmax P(a|D,, M) = argmax P(Djla, M)P(a|A/) ming or adaptation ‘The model, M, specifies the set of potential causes, their " a pear peo prob segenertive process by which they give rise to the data. Leaming a eee msl {utbest accounts for all the data is accomplished by maximizing the posterior distribution werthe models. This is in accordance with the Bayes’ rule: = P(M|D) x P(D|M)P(M) ‘The tol probability of all the data under the model: P(D|M) = P(D,|M) x P(Ds\M) x ...x P(D,|M) = TL,P(D\M) oy 3.30 Supervised Learnin “The probability of an individual data iter is obtained by summating over a gf the possible causes for the dats. P(DIM) =¥PUDila. M)Pla|M) {diterencesberwseadsriminaive and generative mos | Discriminative models ~~ Generative Models Diseriminative models draw boundarey inthe data space. Generative models describe how data is, placed throughout the space. ‘Measured by likelihood funetion ‘Measured by misclassification coats — “The models are generally robust to > outliers Outliers affect the model performance. ‘The conditional independence and “These models take more time fo train. stronger assumptions of these models, demands lesser training time. “They can generalize well over missing | Missing data ean be handled only with, data extensive training | Better explanatory power Relationships among the variable are not clear. ie) 1 Naive Bayes D (The Naive Bayes classification algorithm is a probabilistic classifier based ‘on probability models that incorporate strong independence assumptions, These independence assumptions do not have an impact on reality, ence eper Bayes’ Theorem is used for calculating conditional probabilities. Conditione probability is a measure of the probability of an event occurring given that another event has occurred. This describes how often A happens given that B happens, termed as called posterior probability P(BJA), when known about how often B happens given that A happens (P (B|A)), termed prior and how likely A is on its own (P(A)) and how likely 3 is on its own (P(B)).. PCAIB) «-fandamental ASSUmptION in Naty ‘The 4 must be treated equal) ayes is y. Revwiting the Mure ji ot the seetent of Machine leaming ere rot’ Nave Bayes MePendent eee while y is the Predicted elacg 1X1, ¥0, x5, sional = BX wets tot i! P(X) = Pray the chain rule of probability, the fi By of individual probabilit ities ee Plyltrs.e ty) = ominator remains same, pede itean be eliminated e Pyles, 4a) & Py) TY 2 ist Play) nd the class with maximum probability, a {se argmaxg, y = argmazyP(y) TY 1 P(zily) ss orNaive Bayes Classifier: ‘ype pinomial Naive Bayes: The fearures/predito, : Used teintanees in diserete categories, by the classifier are used to label uli Naive Bayes: The predictors used here a re boolean v javakes up only yes Or no values. lean variables. The parameters ussian Naive Bayes: When the predictors take {rete then it is Gaussian distribution, ¥P 4 continuous value and are not snantages of Naive Bayes Classifier: + Fastand easy ML algorithms to predict a class of datasets, + Itean be used for Binary as well as Muli-class Classifications + Itperforms better in Multi-class predictions Tntdvantages of Naive Bayes Classifier: Naive Bayes assumes that all features are independent or unrelated, soit cannot seni relationship between features. ones on, the result of the exan, ‘Example 3: Given the taining data in the table below, 7 ve Naive Bayes classification: age<30+ .come=medium, student rating=fair. ee RID ] Age | Income | Student ‘Credit_rating | | buys computer 7307 igh [No | Fair [se — 2 |<-30 High No Excellent | No [3 [31to40|High | No | Fair Yes” 4 |>40 Medium [No | Fair ‘Yes 3 |>40 Low | Yes Fair Yes 6 | >40 | Low ‘Yes Excellent No 7 [31 to40| Low ‘Yes Excellent ‘Yes 3 [<=30 | Medium |No | Fair |No 9 | <=30 Low ‘Yes Fair | Yes | 70/540 Medium | Yes Fair Yes 11] ==30 | Medium | Yes | Excellent Yes 4 72/31 to 40 | Medium | No Excellent Yes 13 )S1t40|High | Yes | Fair Yes 14 | 340, Medium | No Excellent No E= age<=30, income=medium, student=yes, credit-ating=fair F; isagee=30, E2 is income=medium, student=yes, E4 is credit-rating=fair We need to compute P(yes|E) and P(no[E) and compare them. _ PLE, | ves) PLE; | yes) PCE; | ves) PLE, | yes) Piyes) PO PIE) yes | E) Plyes)=9/14=0.643 P(no)=5/14=0.357 P(E 1 n0}=3/5= En ik »(E3 {no}. 502 PEN }=25=9 4 0222 0AM 0.657 0668043 9g PCE) PH ys A 1" naive Bayes classifier predicts buys "Wo £) - 96040204956 *COmpUter=yeg Pe) ee mA (asim MARGIN CLASSIFIER j ‘wo linearly separable classes, pinary response variable defined ys ata ma define a p-dimensional oper od it might me ‘hat separates agen expresion for estimating the hyperplane eis MX) = Bo + Bi X1 + &X,... servations of each cp tat al 0 lass fall on opposite gj sides ai hyperplane has the property that if fis consra ined of the hyperplane, This {0 bea unit vector given NN6l| = D6? =1 is ar positive perpendicular 8 the perpendicular distance Points. The smallest of which gees from the hyperplane. The margin is defined puveen the decision boundary and the closest ofthe data jsdefined as hyperplane margin, M, such that, (2,8 + Bo) > M Maximizing the margin leads to a particular choice of decision boundary. The ‘inal margin classifier is the hyperplane wi i ucts 'yperplane with the maximum margin, max {M} subject {All =1 3.34 Supervised Learning. Support vectors Maximam margin decision hyperplane Sy Margin is maximized Fig 3.17: Maximum margin classifier eM GB {6.1 Support Vector Machines eS Support Vector Classifier is an extension of the Maximal Margin Classifier whjgy, is. less sensitive to individual data points, It allows few data to be misclassified, hence called Soft Margin Classifier. In Support Vector Machines (SVM), a margin passes perpendicularly through he nearest points from each class to the hyperplane called as Support Vectors. The main goal of SVM is to find a yperplane in an N-dimensional space, where N indicates the number of features that distinctly classifies the data points. «The SVM classifier constructs hyper-lane in an N-dimensional space tha divides the data points belonging to different classes. + However, this hyper-pane is chosen based on margin as the hyperplane providing the maximum margin between the two classes is considered. These margins are caleulated using data points or Support Vectors. ‘+ Support Vectors are near to the hyper-plane and help in orienting it. + Multiple hyperplanes exists to separate the two classes of data points. + The objective is to find a plane with maximum margin. In other words the distance between data points of both classes and the hyperplanes should be maximum points Falling on ether sid g 2 f th pat yPerIane ees: we be ati in is * gumbo rures is 3, then the HER Be bypesan AOE oF inp fficult to imagine whe ee becomes HPS the manne eros dill ine When henumte = Hedin EMF are data poi "onal plane, sonport enor ated POI ata cry TD it + he position an orientation of the in the hy lane and ° yperplang, "PIMC and are dependen tng ese SUPPOR VECO Manet 7 . it are maximize . Fig 3.18: Support vectors and maximum margin rehyperplane H dividing is the classes is given by, Hiwl(x) +b=0 Here, b is the bias and x is the data points. The distance measure between any two pints in the hyperplane is given by laxo + by +c] Va? +b? d= 4.36 Supervised Learning pay om 8 BNP eg he dist ‘ra hyperplane eat The distance of 8 cean be written a8 er dys((e0)) = Trelle Euclidean agistance of the data points. the Eueli “The denominator indicates 2 Oif correct yelwToo + o) = { ‘< Oif incorrect Steps in SVM classe re classes, One ofthe classes is identified a 1 F . edicts ; : 1. SVM algo eid a -1cegatve cass) e sassifer, hinge loss function is used {0 find the maximum margin, In SVM classifier, the parameter vecte?) (Posi 1 ifetx>=0 ho) = 19. otherwise atl clases are comectly predicted the cost function is 0. The problem wig , Svat csr there is a trade-off between maximizing margin and the jg, Sy fed if the margin is maximized to a very large extent. A regulary ae is added is also to the Joss function. 4, Weights ae optimized by calculating the gradients. The gradients ae ping nly by sing the regularization parameter when there is no ero inte yaseifeation while the loss function is also used when misclassification happes ee, — rma Casees —pitancerom Coes asied bowser Fig 3.19: Hinge Loss in SVM Artificial ty or Support Vector Machine; teller ard Nachin Gane ; learning 3.37 & mye SVM ca” FBMOFE Outliers and find th, nari gv works well when there ia clear margin of ba hyper Plane tha asthe max mu 'aximum 3 Paration bet seis more effective in igh dimensional ac, between classes, 2M ges of Support Vector Machine: gv is memory efticie avo ¥M algorithm is not suitable for lage dat sets ‘VM doesnot perform very well when the data "je overlapping. tor classifier works «_Asthe support vec! ier works by putting datapoints, abov classifying hyperplane there is no probabilistic explanation a ae a fe ea ification. based SVM = Classifying the datapoints using hyperplanes is ot alvayeposble a the dta may not be linearly separable in 2D or there exist no hy in these cases Kemel based SVM is use. Hane to separate them Sethas more noise, target classes kernel ots in3D- I sing kernels: Map a lower dimension set of data points using a mapping function to one higher dimension where they are separated, «Fit line or hyperplane as per requirement to separate those points + Project the same data points to lower dimensions. itages of Kernel Support Vector Machine: + The Kemel trick is real strength of SVM. With an appropriate kemel function, we can solve any complex problem, * It scales relatively well to high dimensional data * Risk of overfiting is less) 2”. 2m Antifil de defines a localized A eli grcond Ny oo af node defines ized region inthe lachine learning 33 43.38 Supervised Lear Machine: + AN region have the same labels qo h® mPa 2 vector 1 (in classifiers ee Mere instances fas, ernel Supper it sssion). sificatioy ices falling 2 Se a fetion is no E88Y cutpus cin TeBression) ) OF very similar pumerie * Choosing a good ks pecision Trees © Longer training time for trae dass fi sodel. yn riate tree cach int - annette FA eee snsion, x; de ode i ested using ‘+ Difficult to understand ant input dimension, x} , is discrete, takin Sne input dimensions. If imensivs)) )° eee the value OFX) ad takes the comers Possible values, the ac Highly comput int ce oe edi orresponding branch the decision 4 " | a decision nods ha discrete branches anda numeric ne EMERE an ay < Canes, jor supervised learning whereby qe | | My qsmet (ordered): the testis @ comparison: PU SHOU be discretized. tf ey isa ierarchicat ite of recursive splits in a smaller i FrulX) x, a atl econ eid i 8 oo N “number of steps. 2° Wao s the threshold value. This value divides the es structure that is implemented thro te ve split: La= vl Bl . bl Bae Ra= {15 waa}, Succ nonparametric method, which can je tt jain fom the om ; ther divide these into two using other attribut sing SPINS OTOBONEI 10 each other. The leaf nodes deine hyper nae ren pace rectangles in the jerarchical data ‘A decision tree is hi : divide-and-conquer strategy. Itis an atin for both classification as well as FeBTeSSI / ; = dof ineral decision nodes decision node and ering ++ A decision tre is compos leaves. | ; a) Each decision node m implements a test function fn(x) with discrete UtComes ~——™\ labeling the branches. \ ‘aor eam © Given an input, at each node, a test is applied and one of the branches is takey aie depending on the outcome, This process starts at the root and is repeagy \ recursively until a leaf node is hit, at which point the value written in the eg x ut. constitutes the outpt ; a nie Coot _ A decision tree do not assume any parametric form for the class densities, hen “A non parametric and the tree structure is not fixed a priori. As the tree grovy branches and leaves are added, during the learning process depending on te | complexity of the problem inherent in the data. | + Each discriminant function f(s) defines a discriminant inthe d-dimensionl inp ds we | venue space dividing it into smaller regions that are subdivided as the tree is spawn fom Le ‘the root down. Fig 3.20: Structure of Decision tree Each leaf node is associated with an output label, and in case of regression the Tree induction is the construction of the tree given a training sample. Fora given leaves are labelled with numeric value. ‘mining set, there exists many trees that code it with no error, and tree size is measured Fhe decision nodes, 4.40 Supervised Learn gave complexity °F Finding ‘as the number of nodes it the ee smallest tree is ‘NP-complete PT asure tree, the goodness of . afer the split, for all brangyelts jor anode m, Nm i the yt is N. Nin Of Nm belong voce! impurity Me itis pure if ‘the same class. F the root nodes Classification Trees sn’ In classification or class! se - quantified by an impurity ne the instances choosing a branch of training instances reaching ™ Gi, with 5) Np = Nm ode m, te entimat forthe Probability OF clas, ;, Given that an instance: reaches at Now PIGLX TN i is O when none of the j if pi wre cither 0 oF I Ttis 1 inst Node sprit Oran 1 ifall such instances are of Ci, tances reaching node m are of class Ci, and jion trees: Impurity measures in elassificat vom ceasure given by the following expression Entropy: This is @ common impurity m x due = ~ Phu 1082 Pie ai 1 node m is not pur, then the instances should be split to decrease impurity, ng there are multiple possible attributes on which split can be. For a numeric attribuy, multiple split positions are possible. Among all, the tree look for the split that mining, impurity after the split because i is important to generate the smallest tee. Ifthe subsey after the split are closer to pure, fewer splits (if any) will be needed afterward. Thi process is locally optimal, and we have no guarantee of finding the smallest decision te Gini Index: Dwi? ‘The j represents the number of classes in the label, P represents the ratio of clas atthe i node. The Gini index, computes the degree of probability of a specific variable Gins(t) =1 ‘The waro10 ' degree of gini index varies + Sass exists there, Gini index OF VAlUC 88 1 signifies + Ceross various classes. jion tree is . A regress conatusted siilar to classification 7 e, XCOpE that is ept that the measure that is appropriate oy l In regression, the goodness of a split is m otherwise jeasured by the mean square error from stimated value. Consider gm isthe estimated vale in node m 1 thee Em = DL — amb) Nm ¢ pruning: 4 In decision tree @ node is not split further ifthe number of training instances reaching a node is smaller than a certain percentage ofthe traning set. + Any decision based on too few instances causes variance and induces generalization error. + The process of stopping tree construction early on before it is full is called prepruning the tree. + In postpruning, a decision trec is generated and continued further on without backtracking. ‘+ In post pruning trees are grown until all leaves are pure with have no training crror. Then find the subtrees that cause overfiting and prune them. 3.42 Supervised Learnt Tal Exracinn om TE atin THE UN FIE Hs oy - Ss 8 on a gertan FaTUFES MY MOL be use ‘A decision tree does its : ~ 1 Feature extraction is done by bya ., and after the wee 1S necessary variables ‘importa . Features closer to the root are more i ea inputs 0 another leering ya tree and then take only the features 0° ee eonjution of tess, gee responds t f ; Each path from the rot 108 HAP OOPS heat, The formation of IF-THEN hg conditions should be satisfied t0 T°8° Me called rule base. pesbian Tre ‘sige IF input, THEN Output yy comerson conse the paths monkey ‘npatt~1 AND Input, Then Outpt v2 apa = Input ) 2, ‘Then Output yy Fig 3.20: Rules extraction from decision trees Multivariate trees ‘In ammultivariate trees, at a decision node, all input dimensions can be used and thus itis more general. When all inputs are numeric, a binary linear multivariate nogeis defined as: fn(X): WI,X = Wmo > 0 ‘Multivariate decision trees alleviate the replication problems of univarae decision trees, In a multivariate decision tree, each test can be based on one or more 3: Given the training data i Gon following new 8 data in Example 1, build a Gis ofthe following new example: apec=30, inoyn winerta. jhe solution: Check which attribute provides the highe st Inform srining et based on that atribute. Caleulate the aintednic inorder to split the and the entropy of each attribute. The information gain is the nate a classify the set tie entropy information minus uta information of the two classes 1,8.) =10,5) = 9114 log,(9n4)—Sn14 log.(S/14)-0.94 ‘Age has three values age, (2 yes and 3 no), age, (4 yes and 0 no) and @iyy(3 yes2n0). Entropy(age) = 5/14 (-2/5 log(2/5)-3/Slog(3/5)) + 4/14 (0) + S/14 (2 7 YSlog(2/5)) (3/5) (0) + 5/14 (-3/510g(3/5) 344 Spr ett 0.970) = 5/1 = 0.6935 0.2465 in(age) = 0.94 - 0.6935 ~ Gain(age) = 0. yesand 20), income medium (4 yey Income has three values income high (2 2 no) and income low (3 yes 1 #0) ) and income Io ne gcunyaneg24) + 64 MOEA 296, 4/14 (1) + 6/14 (0.918) + 440.811) = 0.285714 + 0.393428 +0. 231714 =0,9108 0.94 - 0. 9108 = 0.0292 and 1 no) and student no (3 yes 4 no) 3/Tlog(3/7)-4/Tlog(4/7) Gain(income) = Student has two values student yes (6 Y°S = 714(-6/TI0g(6) + mat: = 1114(0.5916) + 7/14(0.9852) = 0.2958 + 0.4926 = 0.7884 Gain (student) = 0.94 ~ 0.7884 = 0.1516 Credit Ratng has two values rei rating fui (6 yes and 2 no) and ered ring Entropy(student) ‘excellent (3 yes 3 no) ting) = 8/14(-6/8l0g(6/8)-2/8log(2/8)) + 6/14(-3/6108(3/6)-3/6log(34, Entropy(credit_rat = 8/14(0.8112) + 6/140), = 0.4635 + 0.4285 = 0.8920 Gain(credit_rating) = 0.94 ~ 0.8920= 0.479 ‘Age has the highest Information Gain, start splitting the dataset using the attribute, ae ‘i Ne fexcelent No fr ne tae Yes ‘ootert Yer . (cc mabe alg Cans] aa Wh % treotent No a [nee Bo St | Comey S| te reef al FS Sa an laa mon here ‘The same process of splitting has to happen for the two remai ‘Temaining branches. For ranch age=s0, the attributes income, student and credit nati nese (rating can be applied for emul information is Sve, Su)=12,3) 25 lps) ~95ngv8)-097 Income has three values incomeyien (0 {) andinconee(lyesandOag) 7 mt Eniropy(income) = 2/5(0) + 2/5 (-1/2log(1/2)-1/210g(1/2)) + 1/5 (0) =2/5(1)=04 Gain(income) = 0.97 -0.4 = 0.57 ‘ncomenstin (I yes and Student has two values studentyes(2 yes and 0 no) and studentes (0 yes 3 no) Entropy(student) = 2/5(0) +3/5(0)=0 Gain (student) = 0.97 -0= 0.97 ng the other attributes. gp 4.46 Supersized Leo anon checking eR a oo ee oor =e Rane ches are fom distinct cls8e5, MaKe them ing ran these two new br ; as it Sheecspoxive cls slabe!: 40 [teats meoun 10. fat Vee ov yee tar Yee low Yes excatent mmecum Yes Tar Vee ercerent He ‘Again the same process is needed forthe other branch of age. ‘The mutual information is ISyes, Svo)= I(3,2)= -3/5 loga(3/5) —2/5 log2(2/5)=0.97 Income has two values incomenasum (2 yes and 1 no) and incomejow (1 yes and I ns) Entropy(income) = 3/5(-2/Slog(2/3)-1/3log(1/3)) + 2/5 (-1/2log(1/2)-1/210g(1/2)) = 3/5(0.9182)+25 (1) = 0.55+0. 4= 0.95 Gain(income) = 0.97 - 0.95 = 0.02 Student has two values studentyes (2 yes and 1 no) and studentyo (1 yes and | m) Entropy(student) = 3/5(-2/Slog(2/3)-1/3log(/3)) + 2/5(-1/2log(1/2)-1/210g(1/2)) = 05S Artificial tmeui ) 9 ¢y ‘The basic principle behind random forest is ensemble learning, which i a process at combining multiple homogeneous or heterogeneous classifiers to solve a complex Fioblem, to improve the performance of the model. A random forest algorithm consiste afmany decision trees. The forest is generated by the random forest algorithm is rained ‘trough bagging or bootstrap aggregating. In bagging, a random sample of data in a ‘alning set is selected with replacement (je. the individual data points can be chosen wore than once. ihe euteome based on the Prediatgs 43.48 Supervised Learnt n establishes mean of the output from yt rithm et fom forest algorithm eS crage Tint cts by takins "ses the prec nc tes nsresng ea us both the bagging and boo: ision ofthe OutCOME. It eoppp subset from sample taining data wi, sire iin iy voting. + Bagging: Crestes put is bas replacement & the final ou! ati ; ic eamers into SO" a MB Sequentiay it nines Wee 1¢ highest + Boosting: Com Ihe final moet as higl models suct ym Forest: Randor Steps to construct oct sample Input Random Forest Prediction Fig 3.22: Random Forest with 600 individual trees decision trees during training time and give 1m Forest constructs multiple th evengratthe ‘lasses asthe final prediction. The following are the steps: av «Pick at random k data points from the training set. ‘Build a decision tree associated to these k data points. © Choose the number N of trees you want to build and repeat the above steps. + Fornew data point, make each of the tree predict the value of y for the data pon and assign the new data point to the average across all of the predicted y values Artificial nen, jor regression. In case of class nce and Machine learning 3.49 to angrenate Ne TESUlls oF in ification task , 8 majority y vidual trees J°T"Y Voting classifier is used sion in random forests ys , 1 5 candom Forest TeBresSion fol ‘ lows the conce sin (eat) £0 independent ae SBF Sipe reese, Vee at neh produces 8 SPIT predietign tree PSE! in the tan mene an utof the TERTESSION. This is contrary fo 'eaN Prediction of the in of the individual trees is vnined bY the MOE OF frequency gr ro wom forest et ares of Random Forest at sere! classificat the dee ssific tion, whose output, " sion trees’ class, ° fe , 5 - Not all attributes piversity- S/Variables/featuy individual tree, each tree is different Hs Fe considered while making an Jnnmune to the curse of dimensionality Seen by the decision tree, «Stability of resales! because the result is based on mi pifferences petween(decision trees and Random forest [Decision trees ajority voting/ averaging. Random Forest As the output of Random forests are based om average or majority ranking, there is no overfitting. Decision trees suffers from overfiting if en without any control. ‘single decision tree is faster in ‘computation It is comparatively slower as aggregating the results of the individual trees ma consume time. ry ‘a data set with features is taken as input by a decision tree it will formulate tof rules to predict the outcome. Random forest randomly selects observations, builds a decision tree and. the average result is taken. It is not rule -| (Aavantages of random forest a. It can be used in classification as well as regression problems. It solves the problem of overfitting as output is based on majority voting o, averaging. It performs well even if the data contains null/missing values. Each decision tree created is independent of the other thus it shows the Property of parallelization. It is highly stable as the average answers given by a large number of trees are taken. It maintains diversity as all the attributes are not considered while making each decision tree though it is not true in all cases. It is immune to the curse of dimensionality. No train and test split is needed as there will always be 30% of the data which is not seen by the decision tree. Disadvantages Itis highly complex compared to decision trees where decisions can be made by following the path of the tree. Training time is more compared to other models due to its complexity. Whenever it has to make a prediction each decision tree has to generate output for the given input data. This requires more resources) [Qe

You might also like