Download as pdf
Download as pdf
You are on page 1of 33
UNIT III Probabilistic Models, Unsupervised Learning, and Reinforcement Learning Prepared By - Mrxfgrut Raut Statistical Learning The key concepts of statistical learning is data and hypothesis. Here, the data are evidence—that is, instantiations of some or all of the random variables describing the domain. The hypotheses are probabilistic theories of how the domain works, including logical theories as a special case. Consider a simple example. Our favorite Surprise candy comes in two flavors: cherry (yum) and lime (ugh). The manufacturer has a peculiar sense of humor and wraps each piece of candy in the same opaque wrapper, regardless of flavor. The candy is sold in very large bags, of which there are known to be five kinds—again, indistinguishable from the OUtSige: 45, apou snow : ° hl: 100% cherry, * h2: 75% cherry + 25% lime, ° h3: 50% cherry + 50% lime, ° hd: 25% cherry + 75% lime, ° AS: 100% lime Given a new bag of candy, the random variable H (for hypothesis) denotes the type of the bag, with possible values hl through hs. His not directly observable, of course, As the pieces of candy are opened and inspected, data are revealed—D1, D2, . . ., DN, where each Di is a random variable with possible values cherry and lime. The basic task faced by the agent is to predict the flavor of the next piece of candy. Prepared By -MraJagrt Raut Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes predictions on that basis. That is, the predictions are made by using all the hypotheses, weighted by their probabilities, rather than by using just a single “best” hypothesis. In this way, learning is reduced to probabilistic inference. Let D represent all the data, with observed value d; then the probability of each hypothesis is obtained by Bayes’ rule: P(t; |d) = @P(d| hy)P(hi) Now, suppose we want to make a prediction about an unknown quantity X. Then we have: PX |) = SPU [dh POhe a) = PLA YPC Ia), where we have assumed that each hypothesis determines a probability distribution over X. Prepared By -Mrssapr Sau " . This equation shows that predictions are weighted averages over the predictions of the individual hypotheses. The hypotheses _—themselves_—are_ essentially “intermediaries” between the raw data and _ the predictions. The key quantities in the Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the data under each hypothesis, P(d | hi). For our candy example, we will assume for the time being that the prior distribution over hl, .. . , h5 is given by (0.1, 0.2, 0.4, 0.2, 0.1), as advertised by the manufacturer. The likelihood of the data is calculated under the assumption that the observations are iid. so that Pea na) = TT Peds iho Prepared By - MraJegruiS. Raut 5 eo 4 09 Foo io 2 os | Z os 2 as o 2 4 6 8 Ww Numb of ebscrvations ind Number of bscraties in d (a) ) re 20.1 (a) Posterior probabilities P(h, [dy,....dy) from Equation (20.1), 7 number of observations 1V ranges from 1 (0 10, and each observation is ofa time candy (Bayesian prediction P(dy «= He |...) from Equation 20.2. Prepared By -MesJagru Raut ‘ * A very common approximation—one that is usually adopted in science—is to make predictions based on a single most probable hypothesis—that is, an hi that maximizes P(hi | d). This is often called a maximum a posteriori or MAP hypothesis. Predictions made according to an MAP hypothesis haar are approximately Bayesian to the extent that P(X | d) = P(X | hMAP). In our candy example, hmar =hS after three lime candies in a row, so the MAP learner then predicts that the fourth candy is lime with probability 1.0—a much more dangerous prediction than the Bayesian prediction. As more data arrive, the MAP and Bayesian predictions become closer, because the competitors to the MAP hypothesis become less and less probable... aut 7 LEARNING WITH COMPLETE DATA + The general task of learning a probability model, given data that are assumed to be generated from that model, is called density estimation. Data are complete when each data point contains values for every variable in the probability model being learned. We focus on parameter learning—finding the numerical parameters for a probability model whose structure is fixed. Prepared 8y - MrsJagrt S. Raut Maximum-likelihood parameter learning: Discrete models * Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime-cherry proportions are completely unknown; the fraction could be anywhere between 0 and | In that case, we have a continuum of hypotheses. The Parameter in this case, which we call 0, is the proportion of cherry candies, and the hypothesis is h0. If we assume that all proportions are equally likely a priori, then a maximum likelihood approach is reasonable. If we model the situation with a Bayesian network, we need just one random variable, Flavor. It has values cherry and lime, where the probability of cherry is 0. + Now suppose we unwrap N candies, of which c are cherries and | =N~ c are limes, = According to Equation, Petty) = TL Pa ih the likelihood of this particular data set is Pratt) = [] Pid) = 0-0) + ‘The maximum+likelihood hypothesis is given by the value of 6 that maximizes this expression. The same value is obtained by maximizing the log likelihood, {4} ha) bog Pho) = J_ Toe Pd ha) = clog + Cog ~ 0 . it Prepare y- Mrs 5. Rat 10 + To find the maximum-likelihood value of , we differentiate L with respect to 6 and set the resulting expression to zero: dL{dihy) _ ae 3 * In English, then, the maximum-likelihood hypothesis hML asserts that the actual proportion of cherries in the bag is equal to the observed proportion in the candies unwrapped so far. Prepared By «Mrs apr. aut " Naive Bayes models + In this model, the “class” variable C (which is to be predicted) is the root and the “attribute” variables Xi are the leaves. + The model is “naive” because it assumes that the attributes are conditionally independent of each other, given the class + Assuming Boolean variables, the parameters are = P(C = true) Oy = P(X, = true |C= true) Oy P(N, = trve | false * Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values x1, ..., xn, the probability of each class is given by PC ha... 4n) = a MC)T] PEC) + A deterministic prediction can be obtained by choosing the most likely class. Prepared By = Mrs)agruS Raut Following figure shows the learning curve for this method when Cis applied to the restaurant problem. Naive Bayes learning turns out to do surprisingly well in a wide ange of applications. The boosted version is one of the most effective general-purpose learning algorithms. Naive Bayes learning scales well to very large problems: with n Boolean attributes, there are just 2n + 1 parameters, and no search is required to find hML, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes leaming systems have no difficulty with noisy or missing data and can give probabilistic predictions when appropriate. apd Mr 2 LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM * Many real-world problems have hidden variables (sometimes called latent variables), which are not observable in the data that are available for learning For example, medical records often include the observed symptoms, the physician’s diagnosis, the treatment applied, and perhaps the outcome of the treatment, but they did not contain a direct observation of the disease itself. In figure, there are three observable predisposing factors and three observable symptoms. Assume that each variable has three possible values (¢.g.. none, moderate, and severe). Removing the hidden variable from the network in (a) yields the network in (b); the total number of parameters increases from 78 to 708. Thus, latent variables can dramatically reduce the number of parameters required to specify a Bayesian network. This, in turn, can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important, but they do complicate the learning problem. Prepared By - Mea grat S.Raut 5 Unsupervised clustering: Learning mixtures of Gaussians * Unsupervised clustering is the problem of discerning multiple categories in a collection of objects. The problem is unsupervised because the category labels are not given. For example, suppose we record the spectra of a hundred thousand stars; are there different sypes of stars revealed by the spectra, and, if so, how many types and what are their characteristics? Stars could not carry labels , thus astronomers had to perform unsupervised clustering to identify these categories. Unsupervised clustering begins with data, Prepared By = Mrsjagrut S. Raut * Following figure (b) shows 500 data points, each of which Specifies the values of two continuous attributes. The data points might correspond to stars, and the attributes might correspond to spectral intensities at two particular frequencies, Clustering presumes that the data are generated from a mixture distribution, P. Such a distribution has k components, each of which is a distribution in its own right. A data point is generated by first choosing a component and then generating a sample from that componen Winns Rae ” Let the random variable C denote the component, with values 1,...,k; then the mixture distribution is given by Px) = Yo P(C=1) PC =i) x refers to the values of the attributes for a data point. For continuous data, a natural choice for the component distributions is the multivariate Gaussian, which gives the so- called mixture of Gaussians family of distributions. The parameters of a mixture of Gaussians are wi =P(C (the weight of each component), pi (the mean of each component), and Zi (the covariance of each component). Figure (a) shows a mixture of three Gaussians. The unsupervised clustering problem, then, is to recover a mixture model like the one in Figure (a) from raw data like that in Figure (b). Prepared By Mes grt. Raut i + The basic idea of EM in this context is to pretend that we know the parameters of the model and then to infer the probability that each data point belongs to each component. After that, we refit the components to the data, where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component The process iterates until convergence. For the mixture of Gaussians, we initialize the mixture-model parameters arbitrarily and then iterate the following two steps: 1. E-step: Compute the probabilities pij =P(C =i | xj the probability that datum xj was generated by component i. By Bayes’ rule, we have pij =aP(xj |C =i)P(C =i). The term P(xj |C =i) is just the probability at xj of the ith Gaussian, and the term P(C =i) is just the weight parameter for the ith Gaussian. 2. M-step: Compute the new mean, covariance, and component weights using the following: i — Saye Spey - ess wey WN «© where N is the total number of data points. The E-step, or expectation step, can be viewed as computing the expected values pij of the hidden indicator variables Zij , where Zij is 1 if datum xj was generated by the ith component and 0. otherwise. + The M-step, or maximization step, finds the new values of the parameters that maximize the log likelihood of the data, given the expected values of the hidden indicator variables. Prepared By - Mes agri. Raut 2 What is Reinforcement Learning? + Reinforcement Learning is a feedback-based Machine learning technique in which an agent leams to behave in an environment by performing the actions and seeing the results of actions. + For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty. + Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset, it is bound to learn from its experience. + The agent interacts with the environment and explores it by itself. The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum positive rewards. Prepared By - MesJagrS.Raut 2 The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of machine learning method where an intelligent agent interacts with the environment and learns to act within that." Environment mer Oita Reward Actions State Prepared By - MrajegratS Rout 2 Terms used in Reinforcement Learning Agent(): An entity that can perceive/explore the environment and act upon it. Environment(): A situation in which an agent is Present or Surrounded by. In RL, we assume the stochastic environment, which means it is random in nature, Action(): Actions are the moves taken by an agent within the environment. State(): State is a situation returned by the environment after each action taken by the agent, Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent. Policy(): Policy is a strategy applied by the agent for the next action based on the current state. Value(): It is expected long-term retuned with the discount factor and Opposite to the short-term reward. Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a). Prepared By - Mrs 5. Rae a PASSIVE REINFORCEMENT LEARNING © In passive learning, the agent’s policy x is fixed: in state s, it always executes the action n(s). - « Its goal is simply to learn how good the policy is—that is, to learn the utility function Un(s). Following figure shows a policy for the maze world that agent needs to explore and the corresponding utilities. co oo " The passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm. The main difference is that the passive learning agent does not know the transition model P(s | s, a), which specifies the probability of reaching state s from state s after doing action a; nor does it know the reward function R(s), which specifies the reward for each state. The agent executes a set of trials in the environment using its policy z. In each trial, the agent starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in that state. The object is to use the information about rewards to learn the expected utility Un(s) associated with each nonterminal state s. The utility is defined to be the expected sum of (discounted) rewards obtained if policy z is followed. Prepared By - Mssagrutl Raut a a percep indicating he cunt sate sand reward signal ” function PASSIVE-TD-AGENT( percept) returns an action inputs: pen persistent: 2, fixed policy Uatable of utilities, initially empty Ni,a tab of frequencies fortes, intially 2er0 sa, rathe previous sa, actin, and read, nally ul if's'is new then Uls'| =r" | if sis not null then Ufsl- Us + als} + 20 ifs! TERMINAL? then s, tll else s, 4, return 0 increment] ; | | | | Figure 214 Apassiverinforcementeaming gn! thst eamsutty estimates sing en poral dienes The steps futon a(n) chsento ese comerec. as eed in the text. Prepared By - MesJerut Rout 2 ACTIVE REINFORCEMENT LEARNING |" A passive learning agent has a fixed policy that determines its behavior. An active agent must decide what actions to take. First, the agent will need to learn a complete model with outcome Probabilities for all actions, rather than just the model for the fixed Policy. The simple learning mechanism used by PASSIVE-ADP- AGENT will do just fine for this. Next, we need to take into account the fact that the agent has a choice of actions. The utilities it needs to learn are those defined by the optimal policy; they obey the Bellman equations. It is a way of calculating the value functions in dynamic programming or environment that leads to modern reinforcement learning. U(s) = Ris) + max Y Pls’ |s.a)U(s) « Where, U(s)= value calculated at a particular point. R(s) = Reward at a particular state s by performing an action, ¥ = Discount factor U(s")=The value at the previous state. Ree pe » * The max of the complete values has been taken because the agent tries to find the optimal solution always. + An agent must make a tradeoff between exploitation to maximize its reward—as reflected in its current utility estimates—and exploration to maximize its long-term well-being. * Pure exploitation risks getting stuck in a rut. Pure exploration to improve one’s knowledge is of no use if one never puts that knowledge into practice. + In the real world, one constantly has to decide between continuing in a comfortable existence and striking out into the unknown in the hopes of discovering a new and better life. With greater understanding, less exploration is necessary. Prepared By «Mrs grt, Raut 28 APPLICATIONS OF REINFORCEMENT LEARNING 1] Applications to game playing * Gerry Tesauro’s backgammon program TD-GAMMON (1992) forcefully illustrates the potential of reinforcement learning techniques. In earlier work, Tesauro tried learning a neural network representation of Q(s, a) directly from examples of moves labeled with relative values by a human expert. This approach proved extremely tedious for the expert. It resulted in a program, called NEUROGAMMON, that was strong by computer standards, but not competitive with human experts. The TD-GAMMON project was an attempt to learn from self-play alone. The only reward signal was given at the end of each game. Prepared By - MeaJagr 5. Raut » The evaluation function was represented by a fully connected neural network with a single hidden layer containing 40 nodes. Simply by repeated application of Equation (21.12), TD-GAMMON learned to play considerably better than NEUROGAMMON, even though the input representation contained just the raw board position with no computed features. This took about 200,000 training games and two weeks of computer time. Although that may seem like a lot of games, it is only a vanishingly small fraction of the state space. When precomputed features were added to the input representation, a network with 80 hidden nodes was able, after 300,000 training games, to reach a standard of play comparable to that of the top three human players worldwide. Prepared By - MesJgrut§ Raut 30 2] Application to robot control + The setup for the famous cart-pole balancing problem, also known as the inverted pendulum, is shown in Figure below, ‘9 Scup forthe problem of balancing along poe ea top ofa moving cart. Th cat can be hed let orig by aconeller that obscrves x 8, and 8 Prepared By -Mesjagrut 5. Raut aI The problem is to control the position x of the cart so that the pole stays roughly upright (0 ~ /2), while staying within the limits of the cart track as shown. The cart-pole problem differs from the problems described earlier in that the state variables x, 8, x’ , and ‘@ are continuous. The actions are usually discrete: jerk left or jerk right, the so- called bang-bang control regime. Their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials. Moreover, unlike many subsequent systems, BOXES was implemented with a real cart and pole, not a simulation. The algorithm first discretized the four-dimensional state space into boxes—hence the name. It then ran trials until the pole fell over or the cart hit the end of the track. Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence. Prepared By «Hes agri SR 2 Q-Learning Q Learning is a machine learning approach that enables a model to iteratively lear and improve over time by taking the correct action Qrleaming is a type of reinforcement learning, Q-learning also takes an off-policy approach to reinforcement learning. A Q-learning approach aims to determine the optimal action based on its current state. The Quearning approach can accomplish this by either developing its own set of rules or deviating from the Prescribed policy. Because Q-learning may deviate from the given policy, a defined policy is not needed. It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s." Prepared By Magra. Raut » The below flowchart explains the working of Q- learning, ae tale ae Perform the selected action Measure the reward | Update the @table | 4 Prepared By - Mrejagrt Raut ™ 'S @ popular model-free reinforcement learning algorithm based on the Bellman equation. The main objective o can inform the agen maximizing the rewa Itis an off-poliey RL ata current State. * The goal of the agent in Q-learning is to maximize the value of Q The value of Qearning can be derived from the Bellman f Q-learning is to learn the policy which t that what actions should be taken for rd under what circumstances. that attempts to find the best action to take equation. Consider the Bellman equation given below: U (9) = Ris) + max P| sae") Where, U(s)= value calculated at a particular point. R(s) = Reward at a particular state s by performing an action. y= Discount factor be vious 9 UG Ths value a be Peto mest What is a Q-table? ' + The Qtable includes columns and rows with lists of rewards for the best actions of each state in a specific environment. A Q- table helps an agent understand what actions are likely to lead to positive outcomes in different situations. # The table rows represent different situations the agent might encounter, and the columns represent the actions it can take. As the agent interacts with the environment and receives feedback in the form of rewards or penalties, the values in the Q-table are updated to reflect what the model has learned. inforcement learning is to gradually improve ee eres ce the Q-table to help choose actions, ‘wth Peon eedback the Q-table becomes more accurate so the agent pare better decisions and achieve optimal results, Prepred By -MrsagratS. Raut 36 The Q-table is di ee Qutable is directly related to the concept of the Q-function. en is a mathematical equation that looks at the State of the environment and the action under consideration as inputs. The Q- i coer then generates outputs along with expected ae oe is for that action in the specific state. The Q-table on agent to look up the expected future reward for any ate-action pair to move toward an optimized state . Prepared By - Mee jag S. Raut y The steps involved in the Q-learning algorithi | mM process ii the following: ae Qtable initialization. The first step is to create the Q-tabl place to track each action in each state and the associated progress. Observation. The agent needs to observe the ci ° urrent environment. state of the Action. The agent chooses to act in the environment, U completion of the action, the model observes if the action beneficial in the environment. ction is Update. After the action has been taken, it's ti it's time Q-table with the results. to update the Repeat. Repeat steps 2-4 until the model reach ee es i state for a desired objective. a termination Prepared By - Mrajagrut 5 Rout 28 What are the advantages of Q-learning? * The Q-leaming approach to reinforcement learning can potentially be advantageous for several reasons, including the following, Model-free, The model-free approach is the foundation of Q-learning and one of the biggest potential advantages for some uses. * Rather than requiring prior knowledge about an environment, the Q- learning agent can learn about the environment as it trains. The model- free approach is particularly beneficial for scenarios where the underlying dynamics of an environment are difficult to model or completely unknown. + Off-policy optimization. The model can optimize to get the best possible result without being strictly tethered to a policy that might not enable the same degree of optimization, + Flexibility. The model-free, off:policy approach enables Q-learning flexibility to work across a variety of problems and environments. + Offline training. A Q-learning model can be deployed on pre- collected, offline data sets, Prepared By - Mrajegrut S Raut » What are the disadvantages of Q-learning? « The Q-learning approach to reinforcement model machine learning also has some disadvantages, such as the following: + Exploration vs. exploitation tradeoff. It can be hard for a Q-learning model to find the right balance between trying new actions and ing with what's already known. It's a dilemma that is commonly sticl } y the exploration vs. exploitation tradeoff for referred to as @ reinforcement learning. Curse of dimensionality. Q-leaming can potentially face a machine jearning risk known as the curse of dimensionality. The curse of dimensionality is a problem with high-dimensional data where the amount of data required to represent the distribution increases exponentially. This can lead 10 computational challenges and decreased accuracy. Overestimation, A Q-learning model can sometimes be too optimistic ‘and overestimate how good a particular action or strategy is, Performance. A Q-learning model can take a long time to figure out the best method if there are several ways to approach a problem Prepared By - Mra SR a Hidden Markov Model . ae Markov Model (HMM) is a statistical model that is used to describe the probabilistic relationship between a sequence of observations and a sequence of hidden states. + It is often used in situations where the underlying system or process that generates the observations is unknown or hidden, hence it got the name “Hidden Markov Model.” * It is used to predict future observations or classify sequences, based on the underlying hidden process that generates the data. * An HMM consists of two types of variables: hidden states and observations. + The Hidden states are the underlying variables that generate the observed data, but they are not directly observable. + The Observations are the variables that are measured and observed. Prepared By Msn Raut + The relationship between the hidden states and the observations is modeled using a probability distribution. + The Hidden Markov Model (HMM) is the relationship es the hidden states and the observations using two sets 0! probabilities: the transition probabilities and the emission probabilities. + The Transition proba transitioning from one hidden state to another. + The Emission probabilities describe the probability of observing an output given a hidden state. bilities describe the probability of Prepared By - Mesa Raut a Hidden Markov Model Algorithm The Hidden Markov Model (HMM using the following steps: )) algorithm can be implemented Step 1: Define the state space and observation space The state space is the set of all possible hidden states, and the observation space is the set of all possible observations. * Step 2:Define the initial state distribution This is the probability distribution over the initial state * Step 3:Define the state transition probabilities These are the probabilities of transitioning from one state to another. This forms the transition matrix, which describes the probability of moving from one state to another. * Step 4:Define the observation likelihoods: These are the probabilities of generating each observation from each state. This forms the emission matrix, which describes the probability of generating each observation from each state. Prepared By - Mrsjarut§ Raut a + Step 5:Train the model . oo The parameters of the state transition probabilities and the observation likelihoods are estimated using the Baum-Welch algorithm, or the forward-backward algorithm. This is done by iteratively updating the parameters until convergence. + Step 6:Decode the most likely sequence of hidden states i served data, the Viterbi algorithm is used to caput the aa likely sequence of hidden states. This can be used to predict future observations, classify sequences, or detect patterns in sequential data. . :Evaluate the model ee .formance of the HMM can be evaluated using various metres such as accuracy, precision, recall, or F1 score. i ithm involves defining the state space, To summarise, the uy op arameters of the state transition oo ator aed observation likelihoods, training the mode using the probabilities and observatie” forward-backward algorithm, decoding Baum-Welch algorithm Of ridden states using the Viterbi algorithm, ikely sequence Oo! ’ a porting the performance of the model Prepared By - Mrsjorat 5 Raut “ Unsupervised Learning Unsupervised leaming is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. Unsupervised Learning is also called as ‘Learning without a Teacher’. It means no training will be given to the machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself. Prepared By - Mr grat § Raut 6 For example, suppose machine is given an image having both dogs and cats which it has never seen. Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats *. But it can categorize them according to their similarities, patterns, and differences, ie., we can easily categorize the above picture into two parts. The first may contain all pics having dogs in them and the second part may contain all pics having cats in them. Here we didn’t learn anything before, which means no training, data or examples. It allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with unlabelled data. Prepared By - MesJogrt § Raut “ Working of unsupervised learning can be understood by the below diagram. Prepared By - MrejngrtS Raut ” Unsupervised learning is classified into two categories of algorithms: Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y. Prepared By -MraJagrut S Raut “ ——_— Advantages of unsupervised learning: + It does not require training data to be labeled. + Dimensionality reduction can be easily accomplished using unsupervised learning. + Capable of finding previously unknown patterns in data. + Flexibility: Unsupervised learning is flexible in that it can be applied to a wide variety of problems, including clustering, anomaly detection, and association rule mining. + Exploration: Unsupervised learning allows for the exploration of data and the discovery of novel and potentially useful patterns that may not be apparent from the outset. + Low cost: Unsupervised learning is often less expensive than supervised learning because it doesn’t require labeled data, which can be time-consuming and costly to obtain. . Prepred 8y- Mrs grut Rout ° | Disadvantages of unsupervised learning + Difficult to measure accuracy or effectiveness due to lack of predefined answers during training. + The results often have lesser accuracy. + The user needs to spend time interpreting and label the classes which follow that classification. + Lack of guidance: Unsupervised learning lacks the guidance and feedback provided by labeled data, which can make it difficult to know whether the discovered patterns are relevant or useful. + Sensitivity to data quality: Unsupervised learning can be sensitive to data quality, including missing values, outliers, and noisy data. + Scalability: Unsupervised learning can be computationally expensive, particularly for large datasets or complex algorithms, which can limit its scalability. Meg. Raut 0 Supervised vs. Unsupervised Machine Learning: Unsupervised machine Parameters Supervised machine learin ca " . Tearving =a Mechs we ding ed | Arb we dein da a N Computational Complexity Simpler method Computationally capex Accuracy Highly accurate Less accurate No of lates No of clase is known No of clases not known Data Analysis Uses ofine analysis Uses retime analysis of daa near and Logic rretion, | yeaa cheering, Hirrcical Algorithms used Random forest, luster ‘Support Vector Machine, Neural " ee Apion lgoritim, ee Output Desired outputs given Desired ouput isnot given Traning data Use taining daa to infer model Notrining daa is used Iti not possible roleam larger and Tt is posible fo learn ligt and more Complex mode! ‘more complex models than with | complex modes with unsupervised supervised leaming learning Mode We can est our made, ‘We cannot es our model ar ‘Supervise learning is also callel_| Unsupervised leaning is also called classification hsterng, Example: Optical character Example i sample: Find fice in an image : recognition ete : Prepared By Mes agra 5: Raut si Introduction to Association Rule Mining Association rule mining often known as ‘market basket’ analysis is very effective technique to find the association of sale of item X with item Y. In simple words, market basket analysis consists of examining the items in the baskets of shoppers checking out at a market to see what types of items ‘go together’. It would be useful to know, when people make a trip to the store, what kind of items do they tend to buy during that same shopping trip. Association rule mining is used to identify groups of items which are frequently purchased together (customers’ purchasing behavior). For example, ‘IF one buys bread and milk, THEN he/she also buys eggs with high probability.” This information is useful for the store manager for better planning of stocking items in the store to improve its sale and efficiency. Prepared By - Mes sgrut S. Raut 2 et us suppose that the store manager, receives customer complaints bout heavy rush in his store and its consequent slow working. He may then decide to place associated items such as bread and milk logether, so that customers can buy the items easier and faster than if hese were at a distance. It also improves the sale of each product. Prepared By - MreJagrS. Raut Nowadays, recommendations given by online stores like Amazon and Flipkart also make use of association mining to recommend products related with your purchase and the system offers the list of products that others often buy together with the product you have just purchased Besides the examples from market basket analysis given above, association rules are used today in many application areas such as intrusion detection, Web usage mining, bioinformatics and continuous production. Programmers use association rules to build progr machine learning. This is commonly known as market basket data analysis. ing can also be used in applications like web mining, medicine, adaptive ams capable of Association rule min marketing, customer segmentation, learning, finance and bioinformatics, etc. Prepared By - Megat S Raut Defining Association Rule Mining Association rule mining can be defined as identification of frequent patterns, correlations, associations, or causal structures among sets of objects or items in transactional databases, relational databases, and other information repositori Association rules are generally if/then statements that help in discovering relationships between seemingly unrelated data in a relational database or other information repository. For example, ‘If a customer buys a dozen eggs, he is 80% likely to also purchase milk.” An association rule consists of two parts, i.e., an antecedent (if) and a consequent (then). An antecedent is an object or item found in the data while a consequent is an object or item found in the combination with the antecedent. Prepared By - Mrsagrut§ Raut 8 * Association rules are often written as X > Y meaning that whenever X appears Y also tends to appear. X and Y may be single items or sets of items. Here, X is referred to as the rule’s antecedent and Y as its consequent. * For example, the rule found in the sales data of a supermarket could specify that if a customer buys onions and potatoes together, he or she will also like to buy burgers. This rule will be represented as onions, potatoes — burger. * The concept of association rules was formulated by two Indian scientists Dr Rakesh Agrawal and Dr R. Srikant. Prepared By - Mes grt. Raut “6 Rep}esentations of Items for Association Mining shop stocks is n. In Table 9.1, there are 6 cola, thus n = 6 for this shop. n list is represented by I and its items are represented by (i1i2,...in}. nber of transactions are represented by N transactions, ic., N= 5 for the shop given in Table 9.1 Table 9.1. Sale database TO ems 1 (read, Milk p3 (Bread, Diapers. Beer, Eggs) 2 (Milk, Diapers, Beer, Cota} 4 (Gread. Mitk, Diapers, Beer) 5 (Bread, Milk, Diapers, Cota) Each transaction is denoted by T [t1, £2, .tN] each with a unique identifier (TH transaction consists of a subset of items (possibly a small subset) purchased by one customer Let each transaction of m items be (11,12, ..., im), where m <= n {number of items in a transaction should be less than or equal to total items in the shop]. Typically, transactions differ in the number As shown in Table 9.1, the first transaction is represented as T1 and it has two items, Le. m = 2, with il = Bread and i2 = Milk. Similarly transaction T4, has four items, having m = 4, with il « Bread, i2 = Milk, i3 - Diapers and i4 = Beer Prepared By - MrajapruiS Raut 7 he Metrics to Evaluate the Strength of Association Rules + The metrics to judge the strength and accuracy of the rule are as follows: Support Confidence Lift > Support + Let N is the total number of transactions. * Support of X is represented as the number of times X appears in the database divided by N, while the support for X and Y together is represented as the number of times they appear together divided by Nas given below. Support(X) = (Number of times X appears) / N = P(X) Support(X¥) = (Number of times X and Y appear together) / N =PIXNY) ‘Thus, Support of X is the probability of X while the support of XY is the probability of XN Y. Prepared By - Mrs grat 5. Rau By Table9.2_ Sale database 110 ems 1 (Bread Milk 2 (Bread, Diapers Ber, Eggs (Mik, Diapers, Beer, Cola) 4 [eread Mik, Oiapers, Beer 5 __{8read, Mik, Diapers, Cola Support Bread)» Number of times Brea! appears / total numberof translations» 4/5 =P Bread Suppor Mik « Number of times Mik appears / total umber of translations = 5» Nik Support(Diapers) « Number of times Dispers appears / tral numberof transations » 45 PDiapes Support Bes Support Eas) » Number of tn Number of times Cola appeas / total number af translations» 2/5» Pot Number of times Beer appeats/ total numberof translations «3/5 « Pi Best Eggs appeats / total numberof translations #1/5 » PUL Support(C Support(Bread, Milk) « Number of times Bread, Milk appear together / total number of Bread > Milk) translations = 3/5 « P Prepared By - Me grat. aut ”° + A high level of support indicates that the rule is frequent enough for the business to take interest in it * Support is very important metric because if a rule has low support then it may be the case that the rule occurs by chance and it will not be logical to promote items that customers seldom buy together + But if a rule has high support then that association becomes very important and if implemented properly will result in increase in revenue, efficiency and customer satisfaction. Prepared By - Mra. Raut « Confidence Let us suppose that support for XY is 80%, then it means that XY is very frequent and there are 80% chances that X and Y will appear together in a transaction. This would be of interest to the sales manager. Let us suppose we have another pairs of items (A and B) and support for A—B is 50%. Of course it is not as frequent as XY, but if this was higher, such as whenever A appears there is 90% chance that B also appears, then of course it would be of great interest. Thus, not only the probability that A and B appear together matters, but also the conditional probability of B when A has already occurred plays a significant role. This conditional probability that B will follow when A has already been occurred is considered during determining the confidence of the rule. Thus, Support and Confidence are important metrics to judge the quality of the association mining rule. Prepared MrsJagrutS. Raut + Confidence for XY is defined as the ratio of the support for X and Y together to the support for X . + Therefore if X appears much more frequently than X and Y appearing together, the confidence will be low. Confidence of (XY) = Support(XY) / Support(X) = P(X ¥) / P(X) = P(Y |X) + P(Y|X) is the probability of Y once X has taken place, also called the conditional probability of Y. Prepared By - Mrsjngruti S.Raue e Taples3 Prepared By - Mranrut Raut Lift Lift is the ratio of conditional probability of Y wh is gi to the unconditional probability of Y in the anes “eiven In simple words, it is Confidence of XY divid probability of Y. cian Lift = P(Y\|X)/P®) Or Lift = Confidence of (XY) / PQ) Or Lift = (PLOY) / PWD) Thus, lift can be computed by dividing the dnconditional probability of consequent Y. cE Prepared By -Mesjagrut S Raut “ Table 9.6 Dataset Antecedent Consequent A 0 A 0 A 1 A 0 8 1 8 0 8 1 Lift for Rulel, i.c., A-40 « PO | A)/ P(O) = (P(A. 0) / P(A))/ PO) « (3/4) / (4/7) 1.3125 Lift for Ruled, ie,, Bort P (1B) / P(1) = (P (Bi 1)/ P (B))/ PA) « (2/3) / (3/7) «1.55 As discussed earlier, the confidence for Rule | is 3/4*0.75 and confidence for Rule 2 is 0.66. Irshould be observed that although the Rule 1 has higher confidence as compared to Rule 2, but it has lower lift as compared to Rule 2 Naturally, it would appear that Rule! is more valuable because of having higher confidence: it appears more accurate (better supported). But, the accuracy of the rule can be misleading if it is independent of the dataset. Lift, as a metric is important because it considers both the confidence of the rule and the overall dataset, Prepared By - MrsJarut 5. aut

You might also like