Professional Documents
Culture Documents
Making
Making
id=3055303
AI
Erik Meijer
"If Google were created from scratch today, much of it would be learned, not
coded." —Jeff Dean, Google Senior Fellow, Systems and Infrastructure Group
Machine learning, or ML, is all the rage today, and there are good reasons for
that. Models created by machine-learning algorithms for problems such as
spam filtering, speech and image recognition, language translation, and text
understanding have many advantages over code written by human developers.
Machine learning, however, is not as magical as it sounds at first. In fact, it is
1 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
A big difference between human-written code and learned models is that the
latter are usually not represented by text and hence are not understandable by
human developers or manipulable by existing tools. The consequence is that
none of the traditional software engineering techniques for conventional
programs (such as code reviews, source control, and debugging) are applicable
anymore. Since incomprehensibility is not unique to learned code, these
aspects are not of concern here.
2 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
As it turns out, Bayes's rule is exactly what the doctor ordered when it comes
to bridging the gap between ML and contemporary programming languages.
3 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
Probability Distributions
First let's explore what probability distributions ℙ(A) are. The Wikipedia
definition, "a probability distribution is a mathematical description of a random
phenomenon in terms of the probabilities of events," is rather confusing from a
developer perspective. If you click around for a bit, however, it turns out that a
discrete distribution is just a generic list of pairs of values and probabilities
ℙ(A)=[A↦ℝ] such that the probabilities add up to 1. This is the Bayesian
representation for distributions. Isomorphically, you can use the frequentist
representation of distributions as infinite lists of type dist∈[A] such that as n
gets larger, sampling from the collection and counting the frequencies of each
element
Because the values we care about are usually not even comparable, we will
also avoid cumulative distributions. One reason that mathematicians like
standard continuous distributions— such as Gaussian, beta, binomial, and
uniform—is because of their nice algebraic properties, called conjugate priors.2
For example, a uniform prior combined with a binomial likelihood results in a
4 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
CDC∈ℙ(Weight)
Perhaps the most obvious method stacks the columns for skinny and obese
and 1 and then checks if p ≤ 0.4 yields obese, and otherwise yields skinny. In
general, this search is linear in the number values in the distribution, but
5 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
using tricks like binary search tree can speed things up. Mathematicians
call this the inverse transfer method.
The last method pads the lower probabilities by borrowing from the higher
probabilities. Amazingly, it is always possible to do this in a way such that
every column represents the probabilities for, at most, two values, so we
need only one comparison to pick the right value. This comparison can be
implemented using a second index table, and hence mathematicians call
this sampling algorithm the alias method.
6 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
7 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
Doctor∈ℙ(Food|Weight) = Weight➝ℙ(Food)
Now that you know that conditional probabilities are probabilistic functions,
things are starting to get interesting, since this means that multiplication (*)
ℙ(B|A)*ℙ(A)∈ℙ(A&B)
likelihood*prior =
select (a,b)↦p*q
Applying this definition to compute the result of Doctor*CDC, we obtain the table
shown in figure 3 for the joint probability distribution ℙ(Food&Weight).
8 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
Because the distributions for ℙ(Weight) and ℙ(Food) appear in the margins of this
table, mathematicians call them marginal probabilities, and similarly the
process of summing up the columns/rows is called marginalization. When
computing a joint distribution using (*), mathematicians often use the name
likelihood for the function and prior for the argument.
then sampling from the infinite collection Doctor*CDC, which results from applying
the prior to the likelihood, will have a ratio (obese:burger):(obese,celery):
(skinny,burger):(skinnny:celery) = 36:4:18:24.
The keen reader will note that (*) is a slight variation of the well-known
monadic bind operator, which, depending on your favorite programming
language, is known under the names (>>=), SelectMany, or flatMap. Indeed,
probability distributions form a monad. Mathematicians call it the Giry monad,
but Reverend Bayes beat everyone to it by nearly two centuries.
Note that as formulated, Bayes's rule has a type error that went unnoticed for
centuries. The left-hand side returns a distribution of pairs ℙ(A&B), while the
right-hand side returns a distribution of pairs ℙ(B&A). Not a big deal for
mathematicians since & is commutative. For brevity we'll be sloppy about this
9 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
as well. Since we often want to convert from ℙ(A&B) to ℙ(A) or ℙ(B) by dropping
one side of the pair, we prefer the C# variant of SelectMany that takes a
combiner function A⊕B∈C to post-process the pair of samples from the prior and
likelihood:
likelihood*prior =
select a⊕b↦p*q
Now that we know that (*) is monadic bind, we can start using syntactic sugar
such as LINQ queries or for/monad comprehensions. All that is really saying is
that it is safe to drop the explicit tracking of probabilities from any query
written over distributions (i.e., the code on the left in figure 4 is simply sugar
for the code on the right, which itself can be alternatively implemented with
the frequentist approach using sampling).
Another way of saying this is that we can use query comprehensions as a DSL
(domain-specific language) for specifying probabilistic functions. This opens
the road to explore other standard query operators besides application that
can work over distributions and that can be added to our repertoire. The first
one that comes to mind is filtering, or projection as the mathematicians prefer
to call it.
Given a predicate (A➝𝔹), we can drop all values in a distribution for which the
predicate does not hold using the division operator (÷):
ℙ(A)÷(A➝𝔹)∈ℙ(A)
10 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
joint ÷ evidence =
λb.from (a,b) in joint from b' in evidence where b=b' select (a,b)
We can show that (f*d)÷d = f. Applying the latter version to Bayes's rule results
in the following equivalence:
posterior∈ℙ(C|B)=B➝ℙ(C)
posterior(b) =
from a in prior
select a⊕b
Whichever way you spin it, this is incredibly cool! Bayes's rule shows how to
invert a probabilistic function of type ℙ(B|A) = A➝ℙ(B) into a probabilistic function
of type ℙ(A|B) = B➝ℙ(A) using conditioning.
PredictWeightFromFood∈Food➝ℙ(Weight)
PredictWeightFromFood∈Food➝ℙ(Weight)
11 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
In practice, most monads have an unsafe run function of type ℙ(A)➝M(A) that
teleports you out of the monad into some concrete container M. Mathematicians
call this the forgetful functor. For distributions dist∈ℙ(A), a common way to exit
the monad is by picking the value a∈A with the highest probability in dist.
Mathematicians use the higher-order function arg max for this, and call it MLE
(maximum likelihood estimator) or MAP (maximum a posteriori). In practice it
is often more convenient to return the pair a↦p from dist with the highest
probability.
A simple way to find the value with the maximal likelihood from a frequentist
representation of a distribution is to blow up the source distribution ℙ(A) into a
distribution of distributions ℙ(ℙ(A)), where the outer distribution is an infinite
frequentist list of inner Bayesian distributions [A↦ℝ], computed by grouping and
summing, that over time will converge to the true underlying distribution. Then
you can select the nth inner distribution and take its maximum value.
WeightFromFood∈Food➝[A↦ℝ]
12 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
if("obese" == weight)
if("skinny" == weight)
condition(food == food_)
return weight;
var a = sample(prior)
from a in prior
and randomly picks a value a∈A from a distribution prior∈ℙ(A). The condition(p)
To "run" this program, we pass the predict function into the WebPPL inference
engine as follows:
This samples from the distribution described by the program using the Infer
13 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
function with the specified sampling method (which includes enumerate, rejection,
and MCMC) that reifies the resulting distribution into a Bayesian representation.
For games such as AlphaGo,10 the agent code is often a neural network, but if
we abstract the pattern to apply to applications as a whole, it is likely a
combination of ML learned models and regular imperative code. This hybrid
situation is true even today, where things such as ad placement and search-
result ranking are probabilistic but opaquely embedded into imperative code.
Probabilistic programming and machine learning will allow developers to create
applications that are highly specialized for each user.
14 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
type information. For example, if the user types ppl, the JetBrains Rider IDE
shows all the properties of the string type as potential completions, as shown in
figure 6.
In this case, we create the model shown in figure 7, the set of all users as a
15 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
user∈ℙ(Click|Title)
Note we do not want to make any a priori assumptions about the underlying
distributions other than the frequentist stream of clicks received, given the
frequentist stream of titles served to the users.
The agent in this case wants to find out over time which possible title for a
story will generate the most clicks from the users, and hence we will model the
agent by the higher-order function that takes the users and from that creates a
distribution of titles:
agent∈(Title➝ℙ(Click))➝ℙ(Title)
that returns a distribution of pairs of titles and clicks using run as explained
earlier (this corresponds to the beta distribution part of the algorithm. We do
not track the "uncertainty" about ℙ(Click), but we can easily compute that
16 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
together with the click probability if that is useful). A small tweak is needed in
that we are interested only in clicks that are true, and not in those that are false
This allows us to observe how the probability that the user will click on each
title evolves over time as we see more clicks from the users. Whenever we
need to produce a new title, we use the Title for which the most recent
Title&Click↦ℝ has the highest probability (this is the Thompson sampling part of
the algorithm). In other words, the Bayesian bandit is essentially a merge sort
over the reified underlying probability distributions of the clicks from the users.
Conclusion
References
1. Agrawal, S., Goyal, N. 2012. Analysis of Thompson sampling for the multi-
armed bandit problem. Journal of Machine Learning Research: Workshop and
Conference Proceedings 23; http://jmlr.org/proceedings/papers/v23/agrawal12
17 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
/agrawal12.pdf.
5. Parr, T., Vinju, J. 2016. Towards a universal code formatter through machine
learning. Proceedings of the ACM SIGPLAN International Conference on
Software Language Engineering; http://dl.acm.org/citation.cfm?id=2997383.
6. Paulos, J. A. 2011. The mathematics of changing your mind. New York Times
Sunday Book Review; http://www.nytimes.com/2011/08/07/books/review/the-
theory-that-would-not-die-by-sharon-bertsch-mcgrayne-book-review.html.
7. Raychev, V., Bielik, P., Vechev, M. 2016. Probabilistic model for code with
decision trees. Proceedings of the ACM SIGPLAN International Conference on
Object-oriented Programming, Systems, Languages, and Applications;
http://dl.acm.org/citation.cfm?doid=2983990.2984041.
10. Silver, D., et al. 2016. Mastering the game of Go with deep neural networks
and tree search; https://gogameguru.com/i/2016/03/deepmind-mastering-
go.pdf.
18 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
Related Articles
Information Extraction
Distilling structured data from unstructured text
- Andrew McCallum
http://queue.acm.org/detail.cfm?id=1105679
Erik Meijer has been working on "democratizing the cloud" for the past 15
years. He is perhaps best known for his work on, amongst others, the Haskell,
C#, Visual Basic, and Dart programming languages, as well as for his
contributions to LINQ and the Reactive Framework (Rx).
19 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
Tweet
Related:
20 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
Comments
(newest first)
Post a Comment:
name
email
Comment: (Required - 4,000 character limit - HTML syntax is not allowed and will be removed)
Post
21 of 22 8/7/18, 10:34 PM
Making Money Using Math - ACM Queue https://queue.acm.org/detail.cfm?id=3055303
22 of 22 8/7/18, 10:34 PM