Download as pdf
Download as pdf
You are on page 1of 14
3.Explain in detail about Bayesian Network ‘ Bayesian Network 3.1 Bayesian Network 3.2 Joint probability distribution: 3.3 Constructing Bayesian Network 3.4 Example 3.5 The semantics of Bayesian Network 3.6 Applications of Bayesian networks in Al 3.1 Bayesian Network "A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional dependencies using a directed acyclic graph." It is also called a Bayes network, belief network, decision network, or Bayesian model. Bayesian Network can be used for building models from data and experts opinions,and it consists of two parts: © Directed Acyclic Graph ° Table of conditional probabilities The generalized form of Bayesian network that represents and solve decisionproblems under uncertain knowledge is known as an Influence diagram. It is used to represent conditional dependencies. It can also be used in various tasks including prediction, anomaly detection, diagnostics, . automated insight, reasoning, time series prediction, and decision making under uncertainty. A Bayesian network graph is made up of nodes and Arcs (directed links). a Node ( Q ; (2) Figure 2.1 - Example for Bayesian Network Each node corresponds to the random variables, and a variable can be continuous or discrete. * Arc or directed arrows represent the causal relationship or conditional probabilities between random variables. ‘These directed links or arrows connect the pair of nodes in the graph. These links represent that one node directly influence the other node, and if there is no directed link that means that nodes are independent with each other. Example In the figure 2.1, A, B, C, and D are random variables represented by the nodes of the network graph. * Considering node B, which is connected with node A by a directed arrow, then node A is called the parent of Node B. * Node Cis independent of node A. The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic graph or DAG. The Bayesian network has mainly two components: 1. Causal Component 2. Actual numbers Each node in the Bayesian network has condition probability distribution P(X, |Parent(X) ), which determines the effect of the Parent on that node. Bayesian network is based on Joint probability distribution and conditional probability. 3.2 Joint probability distribution: ¢ If variables are x1, x2, x3,, xn, then the probabilities of a different combination of x1, x2, x3. xn, are known as Joint probability distribution. + P[X1, Xz, Xs, ,XnJ, can be written as the following way in terms of the joint probability distribution. = P[xa] X2, Xap, Xn]- P[X2, Xa,» Xn} = [Xi] X2y Xapeeeeey Xa]P[K2|X3 pee In general for each variable Xi, PCKIXi,. Xn} P[Xn-1]Xn]P[Xn]- X1) = P(X |Parents(X, )) Advantages of the Decision Tree « It is simple to understand as it follows the same process which a human follow while making any decision in real-life. It can be very useful for solving decision-related problems. * Ithelps to think about all the possible outcomes for a problem. * There is less requirement of data cleaning compared to other algorithms. Disadvantages of the Decision Tree + The decision tree contains lots of layers, which makes it complex. + It may have an over fitting issue, which can be resolved using the Random Forest algorithm. + For more class labels, the computational complexity of the decision tree may increase. 8. Elaborate in detail about Random Forest in Supervised Learning. RANDOM FOREST Random Forest Steps in the working process of Random Forest Need for Random Forest Example: Important Features of Random Forest Applications of Random Forest Advantages of Random Forest Disadvantages of Random Forest Difference between Decision Tree & Random Forest Random Forest «¢ Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. * Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. « Itcan be used for both Classification and Regression problems in ML. * Itis based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. * The greater number of trees in the forest leads to higher accuracy and prevents the problem of over fitting. Steps in the working process of Random Forest © The Working process can be explained in the below steps and diagram: Step 1: In Random forest n number of random records are taken from the data set having k number of records. Step 2: Individual decision trees are constructed for each sample. Step 3: Each decision tree will generate an output. Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively. Need for Random Forest * Ittakes less training time as compared to other algorithms. © It predicts output with high accuracy, even for the large dataset it runs efficiently. « Itcan also maintain accuracy when a large proportion of data is missing. Example: «Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest classifier predicts the final decision. Instance Class. Figure 3.16 - Example for Random Forest 3. Explain in detail about Linear Classification Models - Discriminant function, LINEAR CLASSIFICATION MODELS - DISCRIMINANT FUNCTION. Linear Classification Models ‘Types of ML Classification Algorithms Discriminant function Linear Classification Models + The Classification algorithm is a Supervised Learning technique that Js used to identify the category of new observations on th training data. basis of ‘+ In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. PREPARED BY: Mrs. REVATHL, AP/CSE 26 (65 3491 ~ ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3 + Classes can be called as targets/labels or categories. + The output variable of Classification is a category, not a value, such as "Green or Blue’, “fruit or animal’, etc. * Since the Classification algorithm is a supervised learning technique, hence it takes labeled input data, which means it contains input with the corresponding output. + In classification algorithm, a discrete output function(y) is mapped to Input variable(s). y=f(x), whore y = categorical output ‘The best example of an ML. classification algorithm is Email Spam Detector. ‘+ The goal of the classification algorithm is © Take a D-dimensional input vector x © Assign ittoone of K discrete classes Ck, k= 1,...,K ‘+ Inthe most common scenario, the classes are taken to be disjoint and each input is assigned to one and only one class ‘+ The input space is divided into decision regions + The boundaries of the decision regions © decision boundaries © decision surfaces ‘+ With linear models for classification, the decision surfaces are linear functions, Classes that can be separated well by linear surfaces are linearly separable. ‘+ Inthe figure 3.5, there are two classes, class A and Class B. + These sses have features that are similar to each other and dissimilar to other classes. igure 3.5 - Example of Classification The algorithm which implements the classification on a dataset is known as a classifier. * There are two types of Classifications: © Two-class problems : o Binary representation or Binary Classifier: © Ifthe classification problem has only two possible outcomes, then it is called as Binary Classifier. o There is a single target variable t € (0, 1) © t= represents class C1 © t=O represents class C2 o Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc. © Multi-class Problems: If a classification problem has more than two outcomes, then it is called as Multi-class Classifier. o Example: Classifications of types of crops, Classification of types of music. © 1-0f-K coding scheme o There is a K-long target vector t, such that If the class is Cj, all elements t, of t are zero fork #j and one for k = j teis the probability that the class is Cx, K = 6 and Cx = 4, then t = (0, 0, 0, 1, 0, 0)* * The simplest approach to classification problems is through construction of a discriminant function that directly assigns each vector x to a specific class Types of ML Classification Algorithms: * Logistic Regression + K-Nearest Neighbors * Support Vector Machines Kernel SVM * Naive Bayes « Decision Tree Classification « Random Forest Classification Discriminant function e A function of a set of variables that is evaluated for samples of events or objects and used as an aid in discriminating between or classifying them. § 3491 — ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT IV 6. Explain in detail about Unsupervised Learning * In supervised learning, the aim is to learn a mapping from the input to an output whose correct values are provided by a supervisor. In unsupervised learning, there is no such supervisor and we have only input data. + The aim is to find the regularities in the input. There is a structure to the input space such that certain patterns occur more often than others, and we want to see what generally happens and what does not. In statistics, this is called density estimation ‘+ One method for density estimation is clustering, where the aim is to find clusters or groupings of input. © Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision. Clustering * Given a set of objects, place them in a group such that the objects in a group are similar to one another and different from the objects in other groups Cluster analysis can be a powerful data-mining tool for any organization. Cluster is a group of objects that belongs to the same class, Clustering is a process of partitioning a set of data in a meaningful subclass. Figure 4.4 Clustering Clustering Methods : + Density-Based Methods: These methods consider the clusters as the dense region having some similarities and differences from the lower dense region of the space. These methods have good accuracy and the ability to merge two clusters, Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc. EPARED BY: Mrs.J.Revathi, AP/CSE scer cal ——_—_|_ ———_—__— 53491 — ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT IV + Hierarchical Based Methods: The clusters formed in this method form a tree- type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two category + Agglomerative (bottom-up approach) + Divisive (top-down approach Unsupervised Learning : K means e K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. ° Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. e It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties. e It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training as in Figure 4.4. e It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters. The algorithm takes the unlabeled dataset as input, divides the dataset into k- number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm. The k-means clustering algorithm mainly performs two tasks: o Determines the best value for K center points or centroids by an iterative process. o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster. Hence each cluster has datapoints with some commonalities, and it is away from other clusters. Figure 4.5 Explains the working of the K-means Clustering Algorithm 18 PARED RY: Mrs.J.Revathi, AP/CSE SCET Instance based learning:KNN o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. o K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. o K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. oc K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. o Itis also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much EPARED BY: Mrs.J.Revathi, AP/CSE ‘SCET a) SS Why do we need a K-NN Algorithm? Suppose there are two categories, i.e,, Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, ‘we can easily identify the category or class of a particular dataset. Consider the below diagram: * category category 8 —— Ser Figure 4.6 Explains the working of the K-NN Algorithm How does K-NN work? ‘The K-NN working can be explained on the basis of the below algorithm: © Step-1: Select the number K of the neighbors © Step-2: Calculate the Euclidean distance of K number of neighbors © Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. © Step-4: Among these k neighbors, count the number of the data points in each category. © Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. © Step-6: Our model is ready. 'REPARED BY: Mrs.J.Revathi, AP/CSE. scer ou S—HA um i — CS 3491 — ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT ‘Suppose we have a new data point and we need to put it in the required category. Consider the below image: te ote % ° ° oN son % 4g _ NewData oo Fo point 7. Explain in detail about Gaussian Mixture models and Expectation Maximization. Gaussian Mixture Model © This model is a soft probabilistic clustering model that allows us to describe the membership of points to a set of clusters using a mixture of Gaussian densities. It is a soft classification (in contrast to a hard one) because it assigns probabilities of belonging to a specific class instead of a definitive choice. In essence, each observation will belong to every class but with different probabilities. xture models are more flexible, they can be more an K-means. K-means is typically faster to converge and so may be preferred in cases where the runtime is an important consideration. © In general, K-means will be faster and more accurate when the data set is large and the clusters are well-separated. Gaussian mixture models will be more accurate when the data set is small or the clusters are not well-separated. © Gaussian mixture models take into account the variance of the data, whereas K-means does not. © Gaussian mixture models are more flexible in terms of the shape of the clusters, whereas K-means is limited to spherical clusters. © Gaussian mixture models can handle missing data, whereas K-means cannot. This difference can make Gaussian mixture models more effective in certain applications, such as data with a lot of noise or data that is not well-defined. * The mixture model where we write the density as a weighted sum of component densities. pix) = > GPG) ‘+ Where P(Gi) are the mixture proportions and p(x|Gi) are the component densities. EPARED BY: Mrs.J.Revathi, AP/CSE, scer 2 ‘S.3491 — ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING untriv ‘+ For example, in Gaussian mixtures, we have p(x|Gi) ~ Ni, 29, and defining ti = P(Gi), we have the parameter vector as @ = (mm, E0ha that we need to learn from data, Figure 4.10 The generative graphical representation of a Gaussian mixture model. ‘The EM algorithm that is a maximum likelihood procedure: bace = arg maxlog p(X|) If we have a prior distribution p(), we can devise a Bayesian approach. For example, the MAP estimator is sar = argmaxlog p(|X) = argmaxlog p(X|) + log p(®) If we have a prior distribution p(*), we can devise a Bayesian approach. For example, the MAP estimator is, yap = arg max log p(X) = arg max log p(X b) +logp(o) ‘The mean and precision (inverse covariance) matrix, we can use a normal- Wishart prior pb) = por) [1 pan. Ad = Dirichlet(rr/ex) []normal-Wishart(#1o, «0, Vo, Vo) Expectation-Maximization Algorithm ‘+ In k-mean, clustering is the problem of finding codebook vectors that minimize the total reconstruction error. ‘+ Here the approach is probabilistic and we look for the component density parameters that maximize the likelihood of the sample. ‘+ Using the mixture mode! of equation , the log likelihood given the sample X= fxt}tis Lewixy tox DI pox) ‘+ Where @ includes the priors P(G) and also the sufficient statistics of the component densities p(xt] Gi) PARED BY: Mrs.J.Revathi, AP/CSE ‘cer = 3491 — ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING unrriv ‘+ Unfortunately, we cannot solve for the parameters analytically and need to iterative optimization. ‘© The expectation-maximization algorithm (Dempster, Laird, and Rubin 1977; Redner and Walker 1984) is used in maximum likelihood estimation where the problem involves '* Two sets of random variables of which one, X, is observable and the other, Z, is hidden. ‘* The goal of the algorithm is to find the parameter vector @ that maximizes the likelihood of the observed values of X, L(1X). ‘* But in cases where this is not feasible, we associate the extra hidden variables Z and express the underlying model using both, to maximize the likelihood of the joint distribution of X and Z, the complete likelihood Le 1X, 2. * Since the Z values are not observed, we cannot work directly with the ‘complete data likelihood Lc; instead, we work with its expectation, Q. X and the current parameter values /, where / indexes iteration, * This is the expectation (E) step of the algorithm. Then in the maximization (M) step, we look for the new parameter values, ¢P/+1, that maximize this. Thus Estep > Q(@|@!) = ELL£-(@1X, 2)|X,o'] M-step : !*? = argmax Q(#|6!) + Inthe E-step we estimate these labels given our current knowledge of, ‘components, and in the M-step we update our component knowledge given the labels estimated in the E-step. + These two steps are the same as the two steps of k-means; calculation of (E-step) and re-estimation of mi (Mstep). 7. Explain in detail about error backpropagation, > Backpropagation is one of the important concepts of « neural network. ora single taining ‘example, Backpropagation algorithm calculates the gradient ofthe error function > Backpropagation algorithms are a set of methods used to efficiently train anifical neural networks following a gradient descent approach which exploits the chain rule. > The main features of Backpropagation are the iterative, recursive and efficient method through which it calculates the updated weight to improve the network until it is notable to perform the task for which iis being trained, > Derivatives of the activation function to be known at network design time is required to Backpropagation How Backpropagation Algorithm Works? The Back propagation algorithm in neural network computes the gradient of the loss function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct computation. It computes the gradient, but it does not define hhow the gradient is used. It generalizes the computation in the delta rule (see in Figure 5.16) ‘PREPARED BY: J REVATHI, APCSE ‘SCET 26 (C8601 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING. uNIry, @ Meseniever Output layer oFe, Difference'n lesired values Fane, Backprop output layer Figure 5.16 Back propagation neural network 1. Inputs X, arrive through the preconnected path 2, Input is modeled using real weights W. The weights are usually randomly selected. Calculate the output for every neuron from the input layer, to the hidden layers, to the output layer. 4. Calculate the error inthe outputs Errory= Actual Output ~ Desired Output ‘Travel back from the output layer to the hidden layer to adjust the weights such that the error is decrease. 6. Keep repeating the process until the desired output is achieved ‘Types of Backpropagation Networks > Two Types of Backpropagation Networks are: 3, Static Back-propagation 4 Recurrent Backpropagation Static back-propagation: is one kind of backpropagation network which produces a mapping of a static input for Static output. It is useful to solve static classification issues lke optical character recognition. Recurrent Backpropagation: Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After that, the error is computed and propagated backward. 8. Explain in detail about Unit saturation aks the vanishing gradient problem). > The vanishing gradient problem isan isue that sometimes arses when training mochine learning algorithms through gradient descent. This most offen occurs in neural networks that have several neuronal layers such as in a decp leaming system, but ao occurs in recurent neural networks. > The key point is thatthe calculated partial derivatives sed to compute the gradient as one oes deeper into the network. Since the gradients control how much the network learns ‘daring training, the gradients are very small or ero then litle to no training can take place, leading to poor predictive performance. ‘The problem: > As more lyers using certain activation functions are added to neural networks, the gradients ofthe loss function approaches zero, making the network hard to tain Why: > Certain activation functions, like the sigmoid fonction, squishes a large input space ino a ‘small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change inthe output. Hence, the derivative becomes small “The sigmoid function and its derivative "> Asan example, the below Figure 5.17 isthe sigmoid function and its derivative. Note how ‘when the inputs of the sigmoid function hecomes lager or smaller (when [x| becomes bigger, the derivative hecomes close to zero. ———— PREPARED BY:IREVATHL APCSE scer—28 (CS MO} — ARTIFICIAL INTELLIGENCE NING. y —Sigmoid Derivative Sigmoid 0 8 Ae ee Oe Ab Figure 5.17 The sigmoid function and its derivative Why it's significant: For shallow network with only a few layers that use these activation, this isnt a big problem. However, when more layers ae usd, it ean cause the gradient 1 be too smal for traning to work effectively. Gradients of neural networks are found using backpropagation. Simply pu, backpropagation finds the derivatives of the network by moving layer by layer fom the final layer tothe intial one. By the chain rule, the derivatives of each layer are mukiplied down the network (from the final layer to the initia) to compute the derivatives ofthe int layers. > However, when n hidden layers use activation ike the sigmoid an function, m small derivatives are multiplied together. Thus, the gradicat decreases exponcatilly as we Propagate down tothe inital layers. A small gradient means that the weights and biases of the initial layers will not he updated effectively with each training session. ince these intial layers ar often crucial to recognizing the core elements of the input data, it can lad ‘o overall inaccuracy of the whole network. Solution: > The simplest solution is to use other activation functions, such as ReLU, which doesnt ‘cause a small derivative. Residual networks are another solution, as they provide residual ‘connections straight to eae ayers. > The residual conection directly adds the value atthe beginning ofthe block, x, to the end of the block (F(x) +). This residual connection doesnt go through activation functions ‘that "squashes the derivatives, resulting in higher overall derivative ofthe block.see in

You might also like