Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Building

Industrial-scale Real-world Recommender Systems












September 11, 2012

Xavier Amatriain
Personaliza8on Science and Engineering - Ne?lix @xamat
Outline
1. Anatomy of Netflix Personalization
2. Data & Models
3. Consumer (Data) Science
4. Architectures
Anatomy of
Netflix
Personalization
Everything is a Recommendation
Everything is personalized
Ranking

Note:
Recommendations
Rows

are per household,


not individual user

4
Top 10
Personalization awareness

All Dad Dad&Mom Daughter All All? Daughter Son Mom Mom

Diversity

5
Support for Recommendations

Social Support 6
Watch again & Continue Watching

7
Genres

8
Genre rows
Personalized genre rows focus on user interest
Also provide context and evidence
Important for member satisfaction moving personalized rows to top on
devices increased retention
How are they generated?
Implicit: based on users recent plays, ratings, & other interactions
Explicit taste preferences
Hybrid:combine the above
Also take into account:
Freshness - has this been shown before?
Diversity avoid repeating tags and genres, limit number of TV genres, etc.
Genres - personalization

10
Genres - personalization

11
Genres- explanations

12
Genres- explanations

13
Genres user involvement

14
Genres user involvement

15
Similars
Displayed in
many different
contexts
In response to
user actions/
context (search,
queue add)
More like rows
Anatomy of a Personalization - Recap
Everything is a recommendation: not only rating
prediction, but also ranking, row selection, similarity
We strive to make it easy for the user, but
We want the user to be aware and be involved in the
recommendation process
Deal with implicit/explicit and hybrid feedback
Add support/explanations for recommendations
Consider issues such as diversity or freshness
17
Data
&
Models
Plays
Big Data Behavior
Geo-Information
Time
Ratings
Searches
Impressions
Device info
Metadata
Social

19
Big Data 25M+ subscribers
@Netflix Ratings: 4M/day
Searches: 3M/day
Plays: 30M/day
2B hours streamed in Q4
2011
1B hours in June 2012

20
Models
Logistic/linear regression
Elastic nets
Matrix Factorization
Markov Chains
Clustering
LDA
Association Rules
Gradient Boosted Decision Trees

21
Rating Prediction

22
2007 Progress Prize
KorBell team (AT&T) improved by 8.43%
Spent ~2,000 hours
Combined 107 prediction algorithms with linear
equation
Gave us the source code
2007 Progress Prize
Top 2 algorithms
SVD - Prize RMSE: 0.8914
RBM - Prize RMSE: 0.8990
Linear blend Prize RMSE: 0.88
Limitations
Designed for 100M ratings, we have 5B ratings
Not adaptable as users add ratings
Performance issues
Currently in use as part of Netflix rating prediction component
SVD
X[n x m] = U[n x r] S [ r x r] (V[m x r])T

X: m x n matrix (e.g., m users, n videos)

U: m x r matrix (m users, r concepts)


S: r x r diagonal matrix (strength of each concept) (r: rank of the matrix)

V: r x n matrix (n videos, r concepts)


Simon Funks SVD
One of the most
interesting findings
during the Netflix
Prize came out of a
blog post
Incremental, iterative,
and approximate way
to compute the SVD
using gradient
descent
http://sifter.org/~simon/journal/20061211.html 26
SVD for Rating Prediction
Associate each user with a user-factors vector pu f
Associate each item with an item-factors vector qv f
Define a baseline estimate buv = + bu + bv to account for
user and item deviation from the average
Predict rating using the rule
' T
r = buv + p qv
uv u

27
SVD++
Koren et. al proposed an asymmetric variation that includes
implicit feedback:
$
1

1 '
' T
&
r = buv + q & R(u) 2
uv v (ruj buj )x j + N(u) 2
y j ))
% jR(u) jN (u) (

Where
qv , xv , yv f are three item factor vectors
Users are not parametrized, but rather represented by:
R(u): items rated by user u
N(u): items for which the user has given an implicit preference (e.g. rated
vs. not rated)

28
RBM
First generation neural networks (~60s)

Like Hate
Perceptrons (~1960) output units -
Single layer of hand-coded class labels
features
Linear activation function
Fundamentally limited in what non-adaptive
they can learn to do. hand-coded
features

input units -
features
Second generation neural networks (~80s)
Compare output to
Back-propagate correct answer to
compute error signal
error signal to
get derivatives
outputs
for learning
Non-linear
activation
function
hidden layers

input features
Belief Networks (~90s)
Directed acyclic graph stochas8c
composed of stochastic hidden
variables with weighted cause
connections.
Can observe some of the
variables
Solve two problems:
Inference: Infer the states of the
unobserved variables. visible
Learning: Adjust the eect
interactions between variables
to make the network more likely
to generate the observed data.
Restricted Boltzmann Machine
Restrict the connectivity to make learning easier.
Only one layer of hidden units.
Although multiple layers are possible hidden
No connections between hidden units. j
Hidden units are independent given the visible
states..
So we can quickly get an unbiased sample from
the posterior distribution over hidden causes i
when given a data-vector
visible
RBMs can be stacked to form Deep Belief
Nets (DBN)
RBM for the Netflix Prize

34
What about the final prize ensembles?
Our offline studies showed they were too
computationally intensive to scale
Expected improvement not worth the
engineering effort
Plus, focus had already shifted to other
issues that had more impact than rating
prediction...

35
Ranking Key algorithm, sorts titles in most
contexts
Ranking
Ranking = Scoring + Sorting + Filtering Factors
bags of movies for presentation to a user Accuracy
Goal: Find the best possible ordering of a Novelty
set of videos for a user within a specific Diversity
context in real-time Freshness
Objective: maximize consumption Scalability
Aspirations: Played & enjoyed titles have
best score
Akin to CTR forecast for ads/search results
Ranking
Popularity is the obvious baseline
Ratings prediction is a clear secondary data
input that allows for personalization
We have added many other features (and tried
many more that have not proved useful)
What about the weights?
Based on A/B testing
Machine-learned
Example: Two features, linear model
1
Predicted Rating

Final Ranking
3
4 Linear Model:
frank(u,v) = w1 p(v) + w2 r(u,v) + b
5

Popularity
39
Results

40
Learning to rank
Machine learning problem: goal is to construct ranking
model from training data
Training data can have partial order or binary judgments
(relevant/not relevant).
Resulting order of the items typically induced from a
numerical score
Learning to rank is a key element for personalization
You can treat the problem as a standard supervised
classification problem

41
Learning to Rank Approaches
1. Pointwise
Ranking function minimizes loss function defined on individual
relevance judgment
Ranking score based on regression or classification
Ordinal regression, Logistic regression, SVM, GBDT,
2. Pairwise
Loss function is defined on pair-wise preferences
Goal: minimize number of inversions in ranking
Ranking problem is then transformed into the binary classification
problem
RankSVM, RankBoost, RankNet, FRank
Learning to rank - metrics
Quality of ranking measured using metrics as
Normalized Discounted Cumulative Gain
n
DCG relevancei
NDCG = where DCG = relevance1 + and IDCG = ideal ranking
IDCG 2 log 2 i
Mean Reciprocal Rank (MRR)
1 1
MRR =
H
rank(h ) where hi are the positive hits from the user
hH i

Mean average Precision (MAP)


N

AveP(n) tp
MAP = n=1
where N can be number of users, items and P =
N tp + fp

43
Learning to rank - metrics
Quality of ranking measured using metrics as
Fraction of Concordant Pairs (FCP)
Given items xi and xj, user preference P and a ranking method R, a
concordant pair (CP) is { xi , x j } s.t.P(xi ) > P(x j ) R(xi ) < R(x j )
CP(x , x )
i j

Then FCP = i j
n(n 1)
Others 2
But, it is hard to optimize machine-learned models directly on
these measures
They are not differentiable
Recent research on models that directly optimize ranking
measures

44
Learning to Rank Approaches
3. Listwise
a. Directly optimizing IR measures (difficult since they are not differentiable)
Directly optimize IR measures through Genetic Programming
Directly optimize measures with Simulated Annealing
Gradient descent on smoothed version of objective function
SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
AdaRank uses boosting to optimize NDCG
b. Indirect Loss Function
RankCosine uses similarity between the ranking list and the ground truth as
loss function
ListNet uses KL-divergence as loss function by defining a probability
distribution
Problem: optimization in the listwise loss function does not necessarily optimize
IR metrics
Similars

Different similarities computed


from different sources: metadata,
ratings, viewing data
Similarities can be treated as
data/features
Machine Learned models
improve our concept of similarity

46
Data & Models - Recap
All sorts of feedback from the user can help generate better
recommendations
Need to design systems that capture and take advantage of
all this data
The right model is as important as the right data
It is important to come up with new theoretical models, but
also need to think about application to a domain, and practical
issues
Rating prediction models are only part of the solution to
recommendation (think about ranking, similarity)

47
Consumer
(Data) Science
Consumer Science
Main goal is to effectively innovate for customers
Innovation goals
If you want to increase your success rate, double
your failure rate. Thomas Watson, Sr., founder of
IBM
The only real failure is the failure to innovate
Fail cheaply
Know why you failed/succeeded

49
Consumer (Data) Science
1. Start with a hypothesis:
Algorithm/feature/design X will increase member engagement
with our service, and ultimately member retention
2. Design a test
Develop a solution or prototype
Think about dependent & independent variables, control,
significance
3. Execute the test
4. Let data speak for itself

50
Offline/Online testing process
days Weeks to months

Offline Online A/B Rollout


Feature to
testing [success]
testing [success] all users

[fail]

51
Offline testing process
Initial
Hypothesis
Decide
Reformulate Model Rollout
Hypothesis Prototype Rollout
Train Model Feature to
[no] Wait for
all users

Try
offline
Online A/B
Results

Analyze
different
model?
[yes]
Test
testing
Results

[success]
[no] Hypothesis Significant
validated improvement
offline? on users?
[yes]
[fail] 52
[no]
Offline testing
Optimize algorithms offline
Measure model performance, using metrics such as:
Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity

Offline performance used as an indication to make informed


decisions on follow-up A/B tests
A critical (and unsolved) issue is how offline metrics can
correlate with A/B test results.
Extremely important to define a coherent offline evaluation
framework (e.g. How to create training/testing datasets is not
trivial)

53
Online A/B testing process
Choose
Design A/
Control
B Test
Group
Decide
Reformulate Model Rollout
Hypothesis Prototype Rollout
Train Model Feature to
[no] Wait for
offline all users

Try
Offline Results

Analyze
different
model? testing
[yes]
Test
Results

Significant
Hypothesis [success] improvement
validated on users?
[no] offline? [yes]

[no] 54
Executing A/B tests
Many different metrics, but ultimately trust user
engagement (e.g. hours of play and customer retention)
Think about significance and hypothesis testing
Our tests usually have thousands of members and 2-20 cells
A/B Tests allow you to try radical ideas or test many
approaches at the same time.
We typically have hundreds of customer A/B tests running
Decisions on the product always data-driven

55
What to measure
OEC: Overall Evaluation Criteria
In an AB test framework, the measure of success is key
Short-term metrics do not always align with long term
goals
E.g. CTR: generating more clicks might mean that our
recommendations are actually worse
Use long term metrics such as LTV (Life time value)
whenever possible
In Netflix, we use member retention
56
What to measure
Short-term metrics can sometimes be informative, and
may allow for faster decision-taking
At Netflix we use many such as hours streamed by users or
%hours from a given algorithm
But, be aware of several caveats of using early decision
mechanisms
Initial effects appear to trend.
See Trustworthy Online
Controlled Experiments: Five
Puzzling Outcomes
Explained [Kohavi et. Al. KDD
12]

57
Consumer Data Science - Recap
Consumer Data Science aims to innovate for the
customer by running experiments and letting data speak
This is mainly done through online AB Testing
However, we can speed up innovation by experimenting
offline
But, both for online and offline experimentation, it is
important to choose the right metric and experimental
framework

58
Architectures

59
Technology

hQp://techblog.ne?lix.com 60
61
Event & Data
Distribution

62
Event & Data Distribution
UI devices should broadcast many
different kinds of user events
Clicks
Presentations
Browsing events

Events vs. data
Some events only need to be
propagated and trigger an action
(low latency, low information per
event)
Others need to be processed and
turned into data (higher latency,
higher information quality).
And there are many in between
Real-time event flow managed
through internal tool (Manhattan)
Data flow mostly managed through
Hadoop.

63
Offline Jobs

64
Offline Jobs
Two kinds of offline jobs
Model training
Batch offline computation of
recommendations/
intermediate results
Offline queries either in Hive or
PIG
Need a publishing mechanism
that solves several issues
Notify readers when result of
query is ready
Support different repositories
(s3, cassandra)
Handle errors, monitoring
We do this through Hermes
65
Computation

66
Computation
Two ways of computing personalized
results
Batch/offline
Online
Each approach has pros/cons
Offline
+ Allows more complex computations
+ Can use more data
- Cannot react to quick changes
- May result in staleness
Online
+ Can respond quickly to events
+ Can use most recent data
- May fail because of SLA
- Cannot deal with complex
computations
Its not an either/or decision
Both approaches can be combined

67
Signals & Models

68
Signals & Models

Both offline and online algorithms are


based on three different inputs:
Models: previously trained from
existing data
(Offline) Data: previously
processed and stored information
Signals: fresh data obtained from
live services
User-related data
Context data (session, date,
time)

69
Results

70
Results
Recommendations can be serviced
from:
Previously computed lists
Online algorithms
A combination of both
The decision on where to service the
recommendation from can respond to
many factors including context.
Also, important to think about the
fallbacks (what if plan A fails)
Previously computed lists/intermediate
results can be stored in a variety of
ways
Cache
Cassandra
Relational DB
71
Alerts and Monitoring
A non-trivial concern in large-scale recommender
systems
Monitoring: continuously observe quality of system
Alert: fast notification if quality of system goes below a
certain pre-defined threshold
Questions:
What do we need to monitor?
How do we know something is bad enough to alert

72
What to monitor
Did something go
Staleness wrong here?

Monitor time since last data update

73
What to monitor
Algorithmic quality
Monitor different metrics by comparing what users do and what
your algorithm predicted they would do

74
What to monitor
Algorithmic quality
Monitor different metrics by comparing what users do and what
your algorithm predicted they would do

Did something go
wrong here?

75
What to monitor
Algorithmic source for users
Monitor how users interact with different algorithms
Algorithm X

Did something go
wrong here?

New version

76
When to alert
Alerting thresholds are hard to tune
Avoid unnecessary alerts (the learn-to-ignore problem)
Avoid important issues being noticed before the alert happens
Rules of thumb
Alert on anything that will impact user experience significantly
Alert on issues that are actionable
If a noticeable event happens without an alert add a new alert
for next time

77
Conclusions

78
The Personalization Problem
The Netflix Prize simplified the recommendation problem
to predicting ratings
But
User ratings are only one of the many data inputs we have
Rating predictions are only part of our solution
Other algorithms such as ranking or similarity are very important

We can reformulate the recommendation problem


Function to optimize: probability a user chooses something and
enjoys it enough to come back to the service

79
More to Recsys than Algorithms
Not only is there more to algorithms than rating
prediction
There is more to Recsys than algorithms
User Interface & Feedback
Data
AB Testing
Systems & Architectures

80
More data +
Better models +
More accurate metrics +
Better approaches & architectures
Lots of room for improvement!
81
Were hiring!

Xavier Amatriain (@xamat)


xamatriain@netflix.com

You might also like