Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

What is Learning?

• Herbert Simon: “Learning is any process by


which a system improves performance from
CS 391L: Machine Learning experience.”
Introduction • What is the task?
– Classification
– Problem solving / planning / control

Raymond J. Mooney
University of Texas at Austin
1 2

Classification Problem Solving / Planning / Control


• Assign object/event to one of a given finite set of • Performing actions in an environment in order to
categories.
– Medical diagnosis achieve a goal.
– Credit card applications or transactions – Solving calculus problems
– Fraud detection in e-commerce
– Worm detection in network packets – Playing checkers, chess, or backgammon
– Spam filtering in email – Balancing a pole
– Recommended articles in a newspaper
– Recommended books, movies, music, or jokes
– Driving a car or a jeep
– Financial investments – Flying a plane, helicopter, or rocket
– DNA sequences
– Controlling an elevator
– Spoken words
– Handwritten letters – Controlling a character in a video game
– Astronomical images – Controlling a mobile robot

3 4

Why Study Machine Learning?


Measuring Performance
Engineering Better Computing Systems
• Classification Accuracy • Develop systems that are too difficult/expensive to
construct manually because they require specific detailed
• Solution correctness skills or knowledge tuned to a specific task (knowledge
engineering bottleneck).
• Solution quality (length, efficiency) • Develop systems that can automatically adapt and
customize themselves to individual users.
• Speed of performance – Personalized news or mail filter
– Personalized tutoring
• Discover new knowledge from large databases (data
mining).
– Market basket analysis (e.g. diapers and beer)
– Medical text mining (e.g. migraines to calcium channel blockers to
magnesium)

5 6

1
Why Study Machine Learning? Why Study Machine Learning?
Cognitive Science The Time is Ripe
• Computational studies of learning may help us • Many basic effective and efficient
understand learning in humans and other
biological organisms. algorithms available.
– Hebbian neural learning • Large amounts of on-line data available.
• “Neurons that fire together, wire together.”
– Human’s relative difficulty of learning disjunctive • Large amounts of computational resources
concepts vs. conjunctive ones.
– Power law of practice available.
log(perf. time)

log(# training trials)


7 8

Related Disciplines Defining the Learning Task


• Artificial Intelligence Improve on task, T, with respect to
• Data Mining performance metric, P, based on experience, E.
• Probability and Statistics T: Playing checkers
• Information theory P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
• Numerical optimization
• Computational complexity theory T: Recognizing hand-written words
P: Percentage of words correctly classified
• Control theory (adaptive) E: Database of human-labeled images of handwritten words
• Psychology (developmental, cognitive)
• Neurobiology T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
• Linguistics E: A sequence of images and steering commands recorded while
• Philosophy observing a human driver.

T: Categorize email messages as spam or legitimate.


P: Percentage of email messages correctly classified.
9 10
E: Database of emails, some with human-given labels

Designing a Learning System Sample Learning Problem


• Choose the training experience • Learn to play checkers from self-play
• Choose exactly what is too be learned, i.e. the
target function.
• We will develop an approach analogous to
• Choose how to represent the target function. that used in the first machine learning
• Choose a learning algorithm to infer the target system developed by Arthur Samuels at
function from the experience. IBM in 1959.
Learner

Environment/
Experience Knowledge

Performance
Element 11 12

2
Training Experience Source of Training Data
• Direct experience: Given sample input and output • Provided random examples outside of the learner’s
pairs for a useful target function. control.
– Checker boards labeled with the correct move, e.g. – Negative examples available or only positive?
extracted from record of expert play • Good training examples selected by a “benevolent
• Indirect experience: Given feedback which is not teacher.”
– “Near miss” examples
direct I/O pairs for a useful target function.
– Potentially arbitrary sequences of game moves and their
• Learner can query an oracle about class of an
final game results. unlabeled example in the environment.
• Credit/Blame Assignment Problem: How to assign • Learner can construct an arbitrary example and
query an oracle for its label.
credit blame to individual moves given only
indirect feedback? • Learner can design and run experiments directly
in the environment without any human guidance.

13 14

Training vs. Test Distribution Choosing a Target Function

• Generally assume that the training and test • What function is to be learned and how will it be
used by the performance system?
examples are independently drawn from the
• For checkers, assume we are given a function for
same overall distribution of data. generating the legal moves for a given board position
– IID: Independently and identically distributed and want to decide the best move.
– Could learn a function:
• If examples are not independent, requires
ChooseMove(board, legal-moves) → best-move
collective classification. – Or could learn an evaluation function, V(board) → R,
• If test distribution is different, requires that gives each board position a score for how favorable it
is. V can be used to pick a move by applying each legal
transfer learning. move, scoring the resulting board position, and choosing
the move that results in the highest scoring board position.

15 16

Ideal Definition of V(b) Approximating V(b)


• If b is a final winning board, then V(b) = 100 • Computing V(b) is intractable since it
• If b is a final losing board, then V(b) = –100 involves searching the complete exponential
• If b is a final draw board, then V(b) = 0 game tree.
• Otherwise, then V(b) = V(b´), where b´ is the • Therefore, this definition is said to be non-
highest scoring final board position that is achieved operational.
starting from b and playing optimally until the end • An operational definition can be computed
of the game (assuming the opponent plays in reasonable (polynomial) time.
optimally as well). • Need to learn an operational approximation
– Can be computed using complete mini-max search of the to the ideal evaluation function.
finite game tree.

17 18

3
Representing the Target Function Linear Function for Representing V(b)
• Target function can be represented in many ways: • In checkers, use a linear approximation of the
lookup table, symbolic rules, numerical function, evaluation function.
)
neural network. V (b) = w0 + w1 ⋅ bp(b) + w2 ⋅ rp(b) + w3 ⋅ bk (b) + w4 ⋅ rk (b) + w5 ⋅ bt (b) + w6 ⋅ rt (b)
• There is a trade-off between the expressiveness of – bp(b): number of black pieces on board b
a representation and the ease of learning. – rp(b): number of red pieces on board b
• The more expressive a representation, the better it – bk(b): number of black kings on board b
will be at approximating an arbitrary function; – rk(b): number of red kings on board b
however, the more examples will be needed to – bt(b): number of black pieces threatened (i.e. which can
learn an accurate function. be immediately taken by red on its next turn)
– rt(b): number of red pieces threatened

19 20

Obtaining Training Values Temporal Difference Learning

• Direct supervision may be available for the • Estimate training values for intermediate (non-
target function. terminal) board positions by the estimated value of
their successor in an actual game trace.
– < <bp=3,rp=0,bk=1,rk=0,bt=0,rt=0>, 100> )
(win for black) Vtrain (b) = V (successor( b))
where successor(b) is the next board position
• With indirect feedback, training values can
where it is the program’s move in actual play.
be estimated using temporal difference
• Values towards the end of the game are initially
learning (used in reinforcement learning
more accurate and continued training slowly
where supervision is delayed reward). “backs up” accurate values to earlier board
positions.

21 22

Learning Algorithm Least Mean Squares (LMS) Algorithm


• Uses training values for the target function to • A gradient descent algorithm that incrementally
induce a hypothesized definition that fits these updates the weights of a linear function in an
examples and hopefully generalizes to unseen attempt to minimize the mean squared error
examples. Until weights converge :
• In statistics, learning to approximate a continuous For each training example b do :
function is called regression. 1) Compute the absolute error : )
• Attempts to minimize some measure of error (loss error (b) = Vtrain (b) − V (b)
function) such as mean squared error: 2) For each board feature, fi, update its weight, wi :

∑ )
[Vtrain (b) − V (b)]2 wi = wi + c ⋅ f i ⋅ error (b)
E = b∈B for some small constant (learning rate) c
B
23 24

4
LMS Discussion Lessons Learned about Learning
• Intuitively, LMS executes the following rules: • Learning can be viewed as using direct or indirect
– If the output for an example is correct, make no change. experience to approximate a chosen target
– If the output is too high, lower the weights proportional function.
to the values of their corresponding features, so the • Function approximation can be viewed as a search
overall output decreases
through a space of hypotheses (representations of
– If the output is too low, increase the weights
functions) for one that best fits a set of training
proportional to the values of their corresponding
features, so the overall output increases. data.
• Under the proper weak assumptions, LMS can be • Different learning methods assume different
proven to eventetually converge to a set of weights hypothesis spaces (representation languages)
that minimizes the mean squared error. and/or employ different search techniques.

25 26

Various Function Representations Various Search Algorithms


• Numerical functions • Gradient descent
– Linear regression
– Perceptron
– Neural networks
– Support vector machines – Backpropagation
• Symbolic functions • Dynamic Programming
– Decision trees – HMM Learning
– Rules in propositional logic
– PCFG Learning
– Rules in first-order predicate logic
• Instance-based functions • Divide and Conquer
– Nearest-neighbor – Decision tree induction
– Case-based – Rule learning
• Probabilistic Graphical Models
– Naïve Bayes • Evolutionary Computation
– Bayesian networks – Genetic Algorithms (GAs)
– Hidden-Markov Models (HMMs) – Genetic Programming (GP)
– Probabilistic Context Free Grammars (PCFGs) – Neuro-evolution
– Markov networks

27 28

Evaluation of Learning Systems History of Machine Learning


• Experimental • 1950s
– Samuel’s checker player
– Conduct controlled cross-validation experiments to – Selfridge’s Pandemonium
compare various methods on a variety of benchmark • 1960s:
datasets. – Neural networks: Perceptron
– Gather data on their performance, e.g. test accuracy, – Pattern recognition
training-time, testing-time. – Learning in the limit theory
– Analyze differences for statistical significance. – Minsky and Papert prove limitations of Perceptron
• 1970s:
• Theoretical – Symbolic concept induction
– Analyze algorithms mathematically and prove theorems – Winston’s arch learner
about their: – Expert systems and the knowledge acquisition bottleneck
• Computational complexity – Quinlan’s ID3
– Michalski’s AQ and soybean diagnosis
• Ability to fit training data
– Scientific discovery with BACON
• Sample complexity (number of training examples needed to – Mathematical discovery with AM
learn an accurate function)

29 30

5
History of Machine Learning (cont.) History of Machine Learning (cont.)
• 1980s: • 2000s
– Advanced decision tree and rule learning – Support vector machines
– Explanation-based Learning (EBL) – Kernel methods
– Learning and planning and problem solving
– Graphical models
– Utility problem
– Analogy – Statistical relational learning
– Cognitive architectures – Transfer learning
– Resurgence of neural networks (connectionism, backpropagation) – Sequence labeling
– Valiant’s PAC Learning Theory – Collective classification and structured outputs
– Focus on experimental methodology – Computer Systems Applications
• 1990s • Compilers
– Data mining • Debugging
– Adaptive software agents and web applications • Graphics
– Text learning • Security (intrusion, virus, and worm detection)
– Reinforcement learning (RL) – Email management
– Inductive Logic Programming (ILP) – Personalized assistants that learn
– Ensembles: Bagging, Boosting, and Stacking – Learning in robotics and vision
– Bayes Net learning
31 32

You might also like