Crowdsourcinghumancomputationwww2011tutorial 110405135225 Phpapp01

Managing Crowdsourced Human
Computation
Managing Crowdsourced Human Computation
Panos Ipeirotis, New York University
Slides from the WWW2011 tutorial 29 March 2011

Slides from the WWW2011 tutorial, 29 March 2011
Outline
• Introduction: Human computation and crowdsourcing
• Managing quality for simple tasks
• Complex tasks using workflows
• Task optimization
• Incentivizing the crowd
Incentivizing the crowd
• Market design
• Behavioral aspects and cognitive biases
Behavioral aspects and cognitive biases
• Game design
• Case studies
d
Human Computation, Round 1
• Humans were the first
computers, used for
“computers,” used for
math computations
Grier, When computers were human, 2005
Grier, IEEE Annals 1998
• Humans were the first
“computers,” used for
math computations
• Organized computation:
– Clairaut, astronomy, 1758:
y
Computed the Halley’s
comet orbit (three‐body
problem) dividing the
problem) dividing the
labor of numeric
computations across 3
astronomers
– Maskelyne, astronomical almanac
with moon positions, used for
p ,
navigation, 1760. Quality
assurance by doing calculations
twice and compared by third
verifier.
verifier
– De Prony, 1794, hires hairdressers
(unemployed after French
(unemployed after French
revolution; knew only addition
and subtraction) to create
logarithmic and trigonometric
tables He managed the process
tables. He managed the process
by splitting the work into very
detailed workflows. (Hairdressers better
than mathematicians in arithmetic!)
– Clairaut, astronomy, 1758
– Maskelyne, 1760
– De Prony, log/trig tables, 1794
– Galton, biology, 1893
– Pearson, biology, 1899
– …
– Cowles, stock market, 1929
– Math Tables Project unskilled
Math Tables Project, unskilled
labor, 1938
• Patterns emerging
Patterns emerging
– Division of labor
– Mass production
Mass production
– Professional managers
• Then we got the
“automatic computers”
• Now we need humans
again for the “AI‐complete”
i f th “AI l t ”
tasks
– Tag images [ESP Game: voh Ahn and

Dabbish 2004, ImageNet]
– Determine if page relevant
Determine if page relevant
[Alonso et al., 2011]
– Determine song genre
– Check page for offensive
Check page for offensive
content
–…
ImageNet: http://www.image‐net.org/about‐publication
Focus of the tutorial
Focus of the tutorial
Examine cases where humans interact with
Examine cases where humans interact with
computers in order to solve a computational
problem (usually too hard to be solved by
(usually too hard to be solved by
computers alone)
Crowdsourcing and human computation
• Crowdsourcing: From macro to micro
g
– Netflix, Innocentive
– Quirky, Threadless
– oDesk, Guru, eLance, vWorker
D k G L W k
– Wikipedia et al.
– ESP Game, FoldIt, Phylo, …
S Ga e, o d t, y o,
– Mechanical Turk, CloudCrowd, …
• Crowdsourcing greatly facilitates human
computation (but they are not equivalent)
Micro‐Crowdsourcing Example:
L b li I
Labeling Images
using the ESP Game Luis von Ahn
MacArthur Fellowship
"genius grant"
• Two player online game
• Partners don’t know each other and can’t
communicate
• Object of the game: type the same word
• The only thing in common is an image
The only thing in common is an image
PLAYER 1 PLAYER 2
GUESSING: CAR GUESSING: BOY
GUESSING: HAT GUESSING: CAR
GUESSING: KID SUCCESS!

YOU AGREE ON CAR
SUCCESS!
YOU AGREE ON CAR
Paid Crowdsourcing: Amazon Mechanical Turk
Demographics of MTurk workers
http://bit.ly/mturk‐demographics
http://bit.ly/mturk demographics
Country of residence
Country of residence
• United States: 46.80%
• India: 34.00%
• Miscellaneous: 19.20%
http://bit.ly/mturk demographics
Outline
• Market design
• Game design
• Case studies
d
Managing quality for simple tasks
Managing quality for simple tasks
• Quality through redundancy: Combining votes
Quality through redundancy: Combining votes
– Majority vote
– Quality‐adjusted vote
Quality adjusted vote
– Managing dependencies
• Quality through gold data
Q lit th h ld d t
• Estimating worker quality (Redundancy + Gold)
• Joint estimation of worker quality and difficulty
• Active data collection
Active data collection
Example: Build an “Adult Web Site” Classifier
• Need a large number of hand‐labeled sites
• Get people to look at sites and classify them as:
G t l t l k t it d l if th
G (general audience) PG (parental guidance) R (restricted) X (porn)
22
Cost/Speed Statistics
Undergrad intern: 200 websites/hr, cost: $15/hr
Cost/Speed Statistics
Undergrad intern: 200 websites/hr, cost: $15/hr
MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
Majority Voting and Label Quality
Ask multiple labelers, keep majority label as “true” label
Quality is probability of being correct
1
p=1.0
0.9
p=0.9
09
0.8 p=0.7
Majority Vote
p=0.8
0.7 p=0.6
p is probability 0.6
of individual labeler
of individual labeler p=0 5
p=0.5
Quality for M
0.5
being correct
0.4 p=0.4
0.3
0.2
1 3 5 7 9 11 13
Number of labelers
Binary classification
Binary classification
26
Kuncheva et al., PA&A, 2003
What if qualities of workers are different?
3 workers, qualities: p‐d, p, p+d
Region where majority better
• Majority vote works best when workers have similar quality
• Otherwise better to just pick the vote of the best worker
j p
• …or model worker qualities and combine [coming next]
Combining votes with different quality
Clemen and Winkler, 1990
What happens if we have dependencies?
Clemen and Winkler, 1985
Positive dependencies decrease the number of effective labelers

What happens if we have dependencies?
YYule’s Q
l ’ Q
measure of correlation
Kuncheva et al., PA&A, 2003

Negative dependencies can improve results (unlikely both workers
to be wrong at the same time)
Vote combination: Meta‐studies
Vote combination: Meta studies
• Simple averages tend to work well
Simple averages tend to work well
• Complex models slightly better but less robust
C l d l li h l b b l b
[Clemen and Winkler, 1999, Ariely et al. 2000]

From aggregate labels to worker quality
Look at our spammer friend ATAMRO447HWJQ
h ih h 9 k
together with other 9 workers
After aggregation, we compute confusion matrix for each worker
After majority vote, confusion matrix for ATAMRO447HWJQ

P[G → G]=100% P[G → X]=0%
P[X → G]=100% P[X → X]=0%
Algorithm of Dawid & Skene, 1979
Iterative process to estimate worker error rates
1. Initialize by aggregating labels for each object (e.g., use majority vote)
2. Estimate confusion matrix for workers (using aggregate
labels)
l b l )
3. Estimate aggregate labels (using confusion matrix)
• Keep labels for “gold
gold data
data” unchanged
4. Go to Step 2 and iterate until convergence
Confusion matrix for ATAMRO447HWJQ

Our friend
O f i d ATAMRO447HWJQ
P[G → G]=99.947% P[G → X]=0.053%
marked almost all sites as G.
P[X → G]=99.153% P[X → X]=0.847%
Seems like a spammer…
And many variations…
And many variations…
• van der Linden et al, 1997: Item‐Response Theory
a de de et a , 99 : te espo se eo y
• Uebersax, Biostatistics 1993: Ordered categories
• Uebersax, JASA 1993: Ordered categories, with worker
Uebersax, JASA 1993: Ordered categories, with worker
expertise and bias, item difficulty
• Carpenter, 2008: Hierarchical Bayesian versions
p , y
And more recently at NIPS:
y
• Whitehill et al., 2009: Adding item difficulty
• Welinder et al., 2010: Adding worker expertise
et al., 2010: Adding worker expertise
Challenge: From Confusion Matrixes to Quality Scores
All the algorithms will generate “confusion matrixes” for workers
Confusion
C f i matrix
t i for
f
ATAMRO447HWJQ
P[X → X]=0.847% P[X → G]=99.153%
P[G → X]=0.053% P[G → G]=99.947%
How to check if a worker is a spammer

usingg the confusion matrix?
(hint: error rate not enough)
Challenge 1:
Spammers are lazy and smart!
Spammers are lazy and smart!
Confusion matrix for spammer Confusion matrix for good worker
P[X → X]
X]=0%
0% P[X → G]
G]=100%
100% P[X → X]=80% P[X → G]=20%
P[G → X]=0% P[G → G]=100% P[G → X]=20% P[G → G]=80%
• Spammers figure out how to fly under the radar…
• In reality, we have 85% G
I lit h 85% G sites and 15% X sites
it d 15% X it
• Errors of spammer
Errors of spammer = 0%
0% * 85% + 100% 15% = 15%
85% + 100% * 15% 15%
• Error rate of good worker = 85% * 20% + 85% * 20% = 20%
False negatives: Spam workers pass as legitimate
Challenge 2:
Humans are biased!
Humans are biased!
Error rates for CEO of AdSafe
P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0%

P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%
P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%
P[X → G]=0.0%
G] 0 0% P[X → P]=0.0%
P] 0 0% P[X → R]=0.0%
R] 0 0% P[X → X]=100.0%
X] 100 0%
In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites
Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%

Error rate of biased worker = 80% * 85% + 100% * 5% = 73%
False positives: Legitimate workers appear to be spammers

Solution: Reverse errors first, compute
f d
error rate afterwards
Error Rates for CEO of AdSafe
P[G → G]=20.0%
G]=20 0% P[G → P]=80.0%
P]=80 0% P[G → R]=0.0%
R]=0 0% P[G → X]=0.0%
X]=0 0%
P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%
P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%
[ → G]=0.0%
P[X ] [ → P]=0.0%
P[X ] [ → R]=0.0%
P[X ] [ → X]=100.0%
P[X ]
• When biased worker says G, it is 100% G
• Wh bi d
When biased worker says P, it is 100% G
k P it i 100% G
• When biased worker says R, it is 50% P, 50% R
• When biased worker says X, it is 100% X
Small ambiguity for “R‐rated” votes but other than that, fine!
Solution: Reverse errors first, compute
f d
error rate afterwards
Error Rates for spammer: ATAMRO447HWJQ
P[G → G]=100.0% P[G → P]=0.0% P[G → R]=0.0% P[G → X]=0.0%
P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0%
P[R → G]=100.0% P[R → P]=0.0% P[R → R]=0.0% P[R → X]=0.0%
P[X → G]=100.0%
G]=100 0% P[X → P]=0.0%
P]=0 0% P[X → R]=0.0%
R]=0 0% P[X → X]=0.0%
X]=0 0%
• When spammer says G, it is 25% G, 25% P, 25% R, 25% X
• When spammer says P, it is 25% G, 25% P, 25% R, 25% X
• When spammer says R, it is 25% G, 25% P, 25% R, 25% X
• When spammer says X, it is 25% G, 25% P, 25% R, 25% X
[note: assume equal priors]
The results are highly ambiguous. No information provided!
h l h hl b f
Quality Scores
• High cost when “soft” labels have probability spread across classes
• Low cost when “soft” labels have probability mass concentrated in one class
Assigned Label “Soft” Label Cost

G <G: 25%, P: 25%, R: 25%, X: 25%> 0.75
G <G: 99%, P: 1%, R: 0%, X: 0%> 0.0198
[***Assume equal misclassification costs]
Ipeirotis, Provost, Wang, HCOMP 2010
Quality Score
• A spammer is a worker who always assigns labels randomly,
regardless of what the true class is.
QualityScore = 1 ‐ ExpCost(Worker)/ExpCost(Spammer)
• Q
QualityScore
li S i
is useful for the purpose of blocking bad workers and
f lf h f bl ki b d k d
rewarding good ones
• Essentially a multi‐class, cost‐sensitive AUC metric
Essentially a multi class cost sensitive AUC metric
• AUC = area under the ROC curve
What about Gold testing?
Naturally integrated into the latent class model
1.
1 Initialize by aggregating labels for each object (e
(e.g.,
g use majority vote)
2. Estimate error rates for workers (using aggregate labels)
3. Estimate aggregate labels (using error rates, weight worker
votes according to quality)
• Keep labels for “gold data” unchanged
4 Go to Step 2 and iterate until convergence
4.
• 3 labels per example
• 2 categories, 50/50
Gold Testing
Gold Testing •
•
Quality range: 0.55:0.05:1.0
200 labelers
l b l
No significant advantage under “good conditions”
(balanced datasets, good worker quality)
http://bit.ly/gold‐or‐repeated
Wang, Ipeirotis, Provost, WCBI 2011
Gold Testing
Gold Testing •
•
Quality range: 0.55:1.0
200 labelers
l b l
Gold Testing
Gold Testing •
•
200 labelers
l b l
Gold Testing
Gold Testing •
•
Quality range: 0.55:0.1.0
l b l
200 labelers
Advantage under imbalanced datasets
Gold Testing
Gold Testing •
•
200 labelers
l b l
Advantage with bad worker quality
Gold Testing?
Gold Testing? •
•
200 labelers
l b l
Significant advantage under “bad conditions”
(imbalanced datasets, bad worker quality)
Testing workers
Testing workers
• An exploration
An exploration‐exploitation
exploitation scheme:
– Explore: Learn about the quality of the workers
– Exploit: Label new examples using the quality
Exploit: Label new examples using the quality
Testing workers
Testing workers
• An exploration
An exploration‐exploitation
exploitation scheme:
– Assign gold labels when benefit in learning better
quality of worker outweighs the loss for labeling a
quality of worker outweighs the loss for labeling a
gold (known label) example [Wang et al, WCBI 2011]
– Assign an already labeled example (by other
Assign an already labeled example (by other
workers) and see if it agrees with majority [Donmez et
al., KDD 2009]
– If worker quality changes over time, assume
f k l h
accuracy given by HMM and φ(τ) = φ(τ‐1) + Δ
[[Donmez et al., SDM 2010]
, ]
Example: Build an “Adult
Example: Build an Web Site” Classifier
Adult Web Site
Get people to look at sites and classify them as:
p p y
But we are not going to label the whole Internet…
Expensive
Slow
Integrating with Machine Learning
Integrating with Machine Learning
• Crowdsourcing is cheap but not free
Crowdsourcing is cheap but not free
– Cannot scale to web without help
• Solution: Build automatic classification models
using crowdsourced data
Simple solution
• Humans label training data
Humans label training data
• Use training data to build model
Data from existing

crowdsourced answers
N
New C
Case Automatic Model Automatic
(through machine learning) Answer
Quality and Classification Performance
Noisy labels lead to degraded task performance
Labeling quality increases classification quality increases
Labeling quality increases classification quality increases
Quality = 100%
100
Quality = 80%
Quality 80%
90
80
AUC
Quality = 60%
70
60
Quality = 50%
50
40
100
120
140
160
180
200
220
240
260
280
300
1
20
40
60
80
Number of examples ("Mushroom" data set) Single‐labeler quality

(probability of assigning
(p y g 54 g
correctly a binary label)
http://bit.ly/gold‐or‐repeated
Sheng, Provost, Ipeirotis, KDD 2008
Tradeoffs for Machine Learning Models
• Get more data Improve model accuracy
• Improve data quality
p q y Improve classification
p
Data Quality = 100%
100
Data Quality = 80%
Q li 80%
90
80
uracy
70 Data Quality = 60%
Q y
Accu
60
50 Data Quality = 50%
40
0
0
0
0
0
0
0
0
0
0
0
1
20
40
60
80
10
12
14
16
18
20
22
24
26
28
30
Number of examples (Mushroom)
55
Tradeoffs for Machine Learning Models
• Get more data: Active Learning, select which
unlabeled example to label [Settles, http://active‐learning.net/]
unlabeled example to label [S ttl htt // ti l i t/]
• Impro
Improve data quality:
e data q alit
Repeated Labeling, label again an already labeled
example [Sheng et al. 2008, Ipeirotis
example [Sheng et al 2008 Ipeirotis et al, 2010]
et al 2010]
56
Scaling Crowdsourcing: Iterative training
• Use model when confident, humans otherwise
• Retrain with new human input → improve
Retrain with new human input → improve
model → reduce need for humans
Automatic
Answer
New Case Automatic Model

(through machine learning)
Data from existing Get human(s) to

answer
Rule of Thumb Results
Rule of Thumb Results
• With high quality labelers (80% and above): One
g q y ( )
worker per case (more data better)
• With low quality labelers (~60%)
Multiple workers per case (to improve quality)
Multiple workers per case (to improve quality)
[Sheng et al, KDD 2008; Kumar and Lease, CSDM 2011]
et al KDD 2008; Kumar and Lease CSDM 2011]
58
Dawid & Skene
& Skene meets a Classifier
meets a Classifier
• [Raykar et al. JMLR 2010]: Use the
et al. JMLR 2010]: Use the
Dawid&Skene scheme but add a classifier as
an additional worker
an additional worker
• Classifier in each iteration learns from the
consensus labeling
59
Selective Repeated‐Labeling
Selective Repeated Labeling
• We do not need to label everything same number of times
• Key observation: we have additional information to guide
selection of data for repeated labeling
p g
the current multiset of labels
• Example: {+,‐,+,‐,‐,+}
Example: {+ ‐ + ‐ ‐ +} vs. {+,+,+,+,+,+}
vs {+ + + + + +}
60
Label Uncertainty: Focus on uncertainty
• If we know worker qualities, we can estimate log‐odds for each
q , g
example:
• Assign labels first to examples that are most uncertain (log‐
odds close to 0 for binary case)
+ + ‐ ‐‐ ‐
+ + + +
+ + ‐ ‐‐ ‐‐ ‐ + ‐ ‐
Model Uncertainty (MU) + +
+ +
++
+ +
+ +‐ ‐ ‐ ‐‐ ‐ ‐
‐‐‐‐
+ + ‐ ‐
+ + ‐ ‐ ‐ ‐
‐‐
• Learning models of the data provides an
alternative source of information about label
certainty
• M
Model uncertainty: get more labels for instances
d l t i t t l b l f i t Examples
that cause model uncertainty
• Intuition?
Models
– for modeling: why improve training data quality if “Self‐healing” process
model already is certain there? [Brodley et al, JAIR 1999]
[Ipeirotis et al, NYU 2010]
et al NYU 2010]
– for data quality, low‐certainty “regions” may be due to
incorrect labeling of corresponding instances
62
Adult content classification
Round Robin
Selective labeling 63
Too much theory?
Too much theory?
Open source implementation available at:
p p
http://code.google.com/p/get‐another‐label/
• Input:
– Labels from Mechanical Turk
– Cost of incorrect labelings (e.g., X G costlier than G X)
• Output:
Output:
– Corrected labels
– Worker error rates
– Ranking of workers according to their quality
Learning from imperfect data
Learning from imperfect data
100
• With
With inherently noisy
inherently noisy 90
80
Accuracy
data, good to have 70
60
learning algorithms that

learning algorithms that 50
are robust to noise. 40
0
0
0
0
0
0
1
20
40
60
80
28
30
10
12
14
16
18
20
22
24
26
Number of examples (Mushroom)
• Or use techniques
d i d h dl
designed to handle
explicitly noisy data
[Lugosi 1992; Smyth, 1995, 1996]
Outline
• Market design
• Game design
• Case studies
d
How to handle free‐form answers?
• Q: “My task does not have discrete answers….”
• A: Break into two HITs:
“C t ” HIT
– “Create”
– “Vote” HIT
Creation HIT Voting HIT:
(e.g. find a URL about a topic) Correct or not?
• Vote HIT controls quality of Creation HIT
Vote HIT controls quality of Creation HIT
• Redundancy controls quality of Voting HIT
• Catch: If “creation” very good, in voting workers just vote “yes”
– Solution: Add some random noise (e.g. add typos)
Example: Collect URLs
But my free‐form is Describe this
j g g
not just right or wrong…
• “Create” HIT
• “Improve” HIT
• “Compare” HIT
p
Creation HIT
((e.g.
g describe the image)
g )
Improve HIT Compare HIT (voting)
(e.g. improve description) Which is better?
TurkIt toolkit [Little et al., UIST 2010]: http://groups.csail.mit.edu/uid/turkit/
version 1:
A parial view of a pocket calculator together with
some coins and a pen.
version 2:
version 2:
A view of personal items a calculator, and some gold and
copper coins, and a round tip pen, these are all pocket
and wallet sized item used for business, writting, calculating
prices or solving math problems and purchasing items.
version 3:
A close‐up photograph of the following items: A CASIO
multi‐function calculator. A ball point pen, uncapped.
Various coins, apparently European, both copper and gold.
Seems to be a theme illustration for a brochure or document
b h ll f b h d
cover treating finance, probably personal finance.
version 4:
…Various British coins; two of £1 value, three of 20p value
and one of 1p value
and one of 1p value. …
version 8:
“A close‐up photograph of the following items: A
CASIO multi‐function, solar powered scientific
calculator. A blue ball point pen with a blue rubber
grip and the tip extended. Six British coins; two of £1
value, three of 20p value and one of 1p value. Seems
to be a theme illustration for a brochure or
document cover treating finance ‐ probably personal
finance."
Independence or Not?
• Building iteratively (lack of independent) allows better
outcomes for image description task…
• In the FoldIt
In the FoldIt game, workers built on each other
game workers built on each other’ss results
results
[Little et al, HCOMP 2010]
• But
But lack of independence
lack of independence
may cause high
dependence on starting
conditions and create
conditions and create
groupthink
• …but also prevents
p
disasters
[Little et al, HCOMP 2010]
Collective Problem Solving
• Exploration / exploitation tradeoff
– Can accelerate learning, by sharing good solutions
– But can lead to premature convergence on
suboptimal solution
[Mason and Watts, submitted to Science, 2011]
Individual search strategy affects group success
• More
More players copying
players copying
each other (i.e., fewer
exploring) in current
round
Lower probability of
finding peak on next
roundd
The role of Communication Networks
The role of Communication Networks
• Examine various “neighbor” structures
g
(who talks to whom about the oil levels)
Network structure affects individual search strategy
• Higher clustering
Higher probability of
neighbors guessing in
neighbors guessing in
identical location
• More neighbors guessing
in identical location
Higher probability of
copying
Diffusion of Best Solution
• No
No significant
significant
differences in % of
games in which peak
was found
• Network affects
willingness to explore
Network structure affects group success
Network structure affects group success
TurKontrol: Decision‐Theoretic
TurKontrol: Decision Theoretic Modeling
Modeling
• Optimizing
Optimizing workflow execution using decision‐
workflow execution using decision
theoretic approaches [Dai et al, AAAI 2010; Kern et al. 2010]
• Significant work in control theory
Si ifi ki l h [Montgomery, 2007]
http://www.workflowpatterns.com
Common Workflow Patterns

Common Workflow Patterns
Basic Control Flow
Basic Control Flow • Iteration
• Sequence • Arbitrary Cycles
(goto)
• Parallel Split • Structured Loop (for,
while, repeat)
while, repeat)
• Synchronization • Recursion
• Exclusive Choice
• Simple Merge
Soylent
• Word processor with crowd embedded [Bernstein et al, UIST 2010]
• “Proofread paper”: Ask workers to proofread each paragraph
– Lazy Turker: Fixes the minimum possible (e.g., single typo)
L T k Fi th i i ibl ( i l t )
– Eager Beaver: Fixes way beyond the necessary but adds
extra errors (e.g., inline suggestions on writing style)
• Find‐Fix‐Verify pattern
– Separate Find and Fix, does not allow Lazy Turker
– Separate Fix‐Verify ensured quality
Find “Identify at least one area
tthat
at ca
can be sshortened
o te ed
without changing the
meaning of the
paragraph.”
Independent agreement to identify patches
Fix “Edit the highlighted

g g
section to shorten its
length without changing Soyl ent , a pr ot ot ype. . .
the meaning
of the paragraph.”
Randomize order of suggestions
Verify “Choose at least one

rewrite that has style
errors, and
at least one rewrite that
changes the meaning
of the sentence.”
Crowd‐created Workflows: CrowdForge
• Map‐Reduce framework for crowds [Kittur et al, CHI
2011]
– Identify sights worth checking

out (one
( tip per worker))
• Vote and rank
– Brief tips for each monument
((one tip
ppper worker))
• Vote and rank
– Aggregate tips in meaningful
summary
It t to
• Iterate t improve…
i
My Boss is a Robot (mybossisarobot.com), Nikki Kittur (CMU) + Jim Giles (New Scientist)
Crowd‐created Workflows: TurkoMatic
• Crowd creates workflows
• Turkomatic [Kalkani et al, CHI 2011]:

1. Ask workers to decompose task into steps (Map)
2. Can step be completed within 10 minutes?
1. Yes: solve it.
2 No: decompose further (recursion)
2.
3. Given all partial solutions, solve big problem (Reduce)
Crowdsourcing Patterns
Crowdsourcing Patterns
• Generate / Create
• Find
i d C ti
Creation
• Improve / Edit / Fix
• Vote for accept‐reject
Q
Quality
y
• Vote up, vote down, to generate rank
Control
• Vote for best / select top‐k
• Split task
• Aggregate Flow Control
Flow Control
Outline
• Market design
• Game design
• Case studies
d
Defining Task Parameters
Defining Task Parameters
Three main goals:
Three main goals:
• Minimize Cost (cheap)
i i i C ( h )
• Maximize Quality (good)
• Minimize Completion Time (fast)
Effect of Payment: Quality
Effect of Payment: Quality
• Cost does not affect quality [Mason and Watts, 2009, AdSafe]
• Similar results for bigger tasks [Ariely et al, 2009]
0.45
0.40
0.35
0.30
Error Rate
0.25 2cents
5cents
0.20 10cents
0
0.15
0.10
0.05
0.00
0 10 20 30
Number of Labelers
Effect of Payment: #Tasks
Effect of Payment: #Tasks
• Payment incentives increase speed, though
[Mason and Watts, 2009]
Predicting Completion Time
• Model timing of individual task
[Yan, Kumar, Ganesan, 2010]
– Assume rate of task completion λ
– Exponential distribution for
single task
g
– Erlang distribution for sequential
tasks
– On‐the‐fly estimation of λ for
On the fly estimation of λ for
parallel
• Optimize using early
acceptance/termination
– Sequential experiment setting
– Stop early if confident
Stop early if confident
Prediction Completion Time
Prediction Completion Time
• For Freebase, workers use log‐normal time to
complete a task [Kochhar et al, HCOMP 2010]
Predicting Completion Time
• Exponential assumption usually not realistic
• Heavy‐tailed distribution
H il d di ib i [Ipeirotis, XRDS 2010]
Effect of #HITs: Monotonic, but sublinear
h(t) = 0.998^#HITs
• 10 HITs 2% slower than 1 HIT
• 100 HITs 19% slower than 1 HIT
• 1000 HITs
1000 HITs 87% slower than 1 HIT
87% slower than 1 HIT
or, 1 group of 1000 7 times faster than 1000 sequential groups of 1
[Wang et al, CSDM 2011]
HIT Topics
topic 1 : cw castingwords podcast transcribe english mp3 edit confirm snippet grade
topic 2:
i 2 data collection search image entry listings website review survey opinion
d ll i h i li i bi i i i
topic 3: categorization product video page smartsheet web comment website opinion
topic 4: easy quick survey money research fast simple form answers link
topic 5: question answer nanonano dinkle article write writing review blog articles
topic 6: writing answer article question opinion short advice editing rewriting paul
topic 7: transcribe transcription improve retranscribe edit answerly voicemail answer
Effect of Topic: The CastingWords Effect
topic 2: data collection search image entry listings website review survey opinion
topic 5:
p question answer nanonano dinkle article write writing review blog articles
q g g
topic 7: transcribe transcription improve retranscribe edit answerly voicemail query question answer
Effect of Topic: Surveys=fast (even with redundancy!)
topic 5:
q g g
Effect of Topic: Writing takes time
topic 5:
q g g
Optimizing Completion Time
• Workers
Workers pick tasks that have large number of
pick tasks that have large number of
HITs or are recent [Chilton et al., HCOMP 2010]
• VizWiz optimizations
optimizations [Bingham, UIST 2011]
[ i h S 20 ] :
– Posts HITs continuously (to be recent)
– Mkes
Mk big HIT groups (to be large)
bi HIT ( b l )
– HITs are “external HITs” (i.e., IFRAME hosted)
– HITs populated when the worker accepts them
• Completion
Completion rate varies with
rate varies with
time of day, depending on
the audience location (India
the audience location (India
vs US vs Middle East)
• Quality tends to remain the
same independent of
same, independent of
completion time
[Huang et al., HCOMP 2010]
[Huang et al., HCOMP 2010]
Other Optimizations
Other Optimizations
• Qurk [[Markus et al., CIDR 2011] ] and CrowdDB [Franklin et al., SIGMOD 2011]:
Treat humans as uncertain UDFs + apply relational
optimization, plus the “GoodEnough” and “StopAfter”
operator.
operator
• CrowdFlow [Quinn et al.]
[Quinn et al ]: Integrate crowd with machine
: Integrate crowd with machine
learning to reach balance of speed, quality, cost
• Ask humans for directions in a graph: [Parameswaran et
al., VLDB 2011]. See also [Kleinberg, Nature 2000;
Mitzenmacher XRDS 2010; Deng ECCV 2010]
Mitzenmacher, XRDS 2010; Deng, ECCV 2010]
Outline
• Market design
• Game design
• Case studies
d
Incentives
• Monetary
• Self‐serving
• Altruistic
l i i
Incentives: Money
Incentives: Money
• Money
Money does not improve quality but (generally)
does not improve quality but (generally)
increase participation [Ariely, 2009; Mason & Watts, 2009]
• But workers may be “target earners” (stop after
reaching their daily goal)
hi h i d il l) [Horton & Chilton, 2010 for MTurk;
Camerer et al. 1997, Farber 2008, for taxi drivers; Fehr and Goette 2007]
Incentives: Money and Trouble
Incentives: Money and Trouble
• Careful: Paying a little often worse than paying
y g p y g
nothing!
– “Pay enough or not at all” [Gneezy et al, 2000]
– Small pay now locks future pay
Small pay now locks future pay
– Payment replaces internal motivation (paying kids to collect
donations decreased enthusiasm; spam classification; “thanks for
dinner here is $100”))
dinner, here is $100
– Lesson: Be the Tom Sawyer (“how I like painting the
fence”), not the scrooge‐y boss…
• Paying a lot is a counter‐incentive:
– People focus on the reward and not on the task
People focus on the reward and not on the task
– On MTurk spammers routinely attack highly‐paying tasks
Incentives
• Monetary
• Self‐serving
• Altruistic
l i i
Incentives: Leaderboards
Incentives: Leaderboards
• Leaderboards ((“top
top participants
participants”)) frequent
frequent
motivator
– Should motivate correct behavior, not just
Should motivate correct behavior not just
measurable behavior
– Newcomers should have hope of reaching top
Newcomers should have hope of reaching top
– Whatever is measured, workers will optimize for
this (e.g., Orkut
this (e.g., Orkut country leaderboard; complaints for quality score drops)
country leaderboard; complaints for quality score drops)
– Design guideline: Christmas‐tree dashboard (Green / Red lights only)
[Farmer and Glass, 2010]
Incentives: Purpose of Work
• Contrafreeloading:
Contrafreeloading: Rats and animals prefer to
Rats and animals prefer to
“earn” their food
• Destroying work after production demotivates
workers. [Ariely
k [A i l et al, 2008]
l 2008]
• Showing result of “completed task” improves
satisfaction
• Workers enjoy learning new skills (oft cited reason for
j y g (
Mturk participation)
• Design tasks to be educational
– DuoLingo: Translate while learning new language [von Ahn et al,
duolingo.com]
– Galaxy Zoo, Clickworkers: Classify astronomical objects
[Raddick et al, 2010; http://en.wikipedia.org/wiki/Clickworkers]
– Citizen Science: Learn about biology gy
[http://www.birds.cornell.edu/citsci/]
– National Geographic “Field Expedition: Mongolia”, tag
potential archeological sites, learn about archeology
potential archeological sites, learn about archeology
Incentives: Credit and Participation
Incentives: Credit and Participation
• Public
Public credit contributes to sense of
credit contributes to sense of
participation
• Credit also a form of reputation
Credit also a form of reputation
• (Anonymity of MTurk‐like settings discourage this factor)
Incentives
• Monetary
• Self‐serving
• Altruistic
l i i
Incentive: Altruism
Incentive: Altruism
• Contributing
Contributing back (tit for tat): Early reviewers
back (tit for tat): Early reviewers
writing reviews because read other useful
review
• Effect
Effect amplified in social networks:
amplified in social networks: “If
If all my
all my
friends do it…” or “Since all my friends will see
this…”
• Contributing to shared goal
Contributing to shared goal
Incentives: Altruism and Purpose
Incentives: Altruism and Purpose
• On MTurk
On MTurk [Chandler and Kapelner, 2010]
[Chandler and Kapelner 2010]
– Americans [older, more leisure‐driven] work
harder for “meaningful
harder for meaningful work
work”
– Indians [more income‐driven] were not affected
– Quality unchanged for both groups
Quality unchanged for both groups
Incentives: Fair share
Incentives: Fair share
• Anecdote: Same HIT (spam classification)
Anecdote: Same HIT (spam classification)
– Case 1: Requester doing as side‐project, to “clean
the market”, would be out
the market would be out‐of‐pocket
of pocket expense, no
expense no
pay to workers
– Case 2: Requester researcher at university, spam
Case 2: Requester researcher at university, spam
classification now a university research project,
payment to workers
What setting worked best?
Incentives: FUN!
Incentives: FUN!
• Game‐ify the task (design details later)
• Examples
p
– ESP Game: Given an image, type the same
word (generated image descriptions)
– Phylo: aligned color blocks (used for genome
Phylo: aligned color blocks (used for genome
alignment)
– FoldIt: fold structures to optimize energy
(protein folding)
(protein folding)
• Fun factors [Malone 1980, 1982]:
– timed response
ti d
– score keeping
– player skill level
– high‐score lists
– and randomness
Outline
• Market design
• Game design
• Case studies
d
Market Design Organizes the Crowd
• Reputation Mechanisms
eputat o ec a s s
– Seller‐side: Ensure worker quality
– Buy‐side: Ensure employee trustworthiness
• Task organization for task discovery (worker finds
employer/task)
/ )
• Worker expertise recording for task assignment
(employer/task finds worker)
Lack of Reputation and Market for Lemons
• “When quality of sold good is uncertain and hidden before
transaction, prize goes to value of lowest valued good”
transaction, prize goes to value of lowest valued good
[Akerlof, 1970; Nobel prize winner]
Market evolution steps:
1. Employers pays $10 to good worker, $0.1 to bad worker
2 50% good workers, 50% bad;
2. 50% d k 50% b d indistinguishable from each other
d h bl f h h
3. Employer offers price in the middle: $5
4 Some good workers leave the market (pay too low)
4. Some good workers leave the market (pay too low)
5. Employer revised prices downwards as % of bad increased
6. More good workers leave the market… death spiral
g p
http://en.wikipedia.org/wiki/The_Market_for_Lemons
• Market for lemons also on the employer side:
– Workers distrust (good) newcomer employers: Charge risk premium,
or work only for little bit. Good newcomers get disappointed
– Bad newcomers have no downside (will not pay), continue to offer
work.
work
– Market floods with bad employers
• TurkOpticon, external reputation system
• “Mechanical Turk: Now with 40.92% spam” http://bit.ly/ew6vg4
• Gresham's Law: the bad drives out the good
• No
No‐trade
trade equilibrium: no good employer offers work in a
equilibrium: no good employer offers work in a
market with bad workers, no good worker wants to work for
bad employers…
• In reality, we need to take into consideration that this is a
In reality, we need to take into consideration that this is a
repeated game (but participation follows a heavy tail…)
http://en.wikipedia.org/wiki/The_Market_for_Lemons
Reputation systems
Reputation systems
• Significant
Significant number of reputation mechanisms
number of reputation mechanisms
[Dellarocas et al, 2007]
• Link analysis techniques [TrustRank, EigenTrust,
NodeRanking, NetProbe, Snare] often applicable
f li bl
Challenges in the Design of Reputation Systems
• Insufficient participation
p p
• Overwhelmingly positive feedback
• Dishonest reports
• Identity changes
Identity changes
• Value imbalance exploitation (“milking the
reputation”)
Insufficient Participation
• Free‐riding: feedback constitutes a public good. Once available,
everyone can costless‐ly benefit from it.
• Disadvantage of early evaluators: provision of feedback
p
presupposes that the rater will assume the risks of transacting with
pp g
the ratee (competitive advantage to others).
• [A
[Avery et al. 1999] propose a mechanism whereby early
l 1999] h i h b l
evaluators are paid to provide information and later evaluators
pay to balance the budget.
Overwhelmingly Positive Feedback (I)
More than 99% of all feedbacks posted on eBay are positive.
H I i d f 16% f ll
However, Internet auctions accounted for 16% of all consumer
fraud complaints received by the Federal Trade Commission in
2004. (http://www.consumer .gov/sentinel/)
Reporting Bias
The perils of reciprocity:
• Reciprocity: Seller evaluated buyer, buyer evaluates seller
• Exchange of courtesies
• Positive reciprocity: positive ratings are given in the hope
of getting a positive rating in return
• Negative reciprocity: negative ratings are avoided because
of fear of retaliation from the other party
Overwhelmingly Positive Feedback (II)
“The sound of silence”: No news, bad news…
• [Dellarocas and Wood 2008] Explore the frequency of
different feedback patterns and use the non‐reports to
compensate for reporting bias.
• eBay traders are more likely to post feedback when satisfied
than when dissatisfied
than when dissatisfied
• Support presence of positive and negative reciprocation
among eBay traders.
Dishonest Reports
• “Ballot stuffing” (unfairly high ratings): a seller colludes with a
group of buyers in order to be given unfairly high ratings by them
group of buyers in order to be given unfairly high ratings by them.
• “Bad‐mouthing” (unfairly low ratings): Sellers can collude with
buyers in order to “bad‐mouth” other sellers that they want to drive
y y
out the market.
• Design incentive‐compatible mechanism to elicit honest feedbacks
[
[Jurca and Faltings
g 2003: pay rater if report matches next;
p y p ;
Miller et al. 2005: use a proper scoring rule to price value of report;
Papaioannou and Stamoulis 2005: delay next transaction over time]
• U
Use “latent class” models described earlier in the tutorial
“l t t l ” d l d ib d li i th t t i l
(reputation systems is a form of crowdsourcing after all…)
Identity Changes
• “Cheap pseudonyms”: easy to disappear and re‐
pp y y pp
register under a new identity with almost zero cost.
[Friedman and Resnick 2001]
• IIntroduce opportunities to misbehave without paying
d ii ib h ih i
reputational consequences.
• Increase the difficulty of online identity changes
• Impose upfront costs to new entrants: allow new identities
(forget the past) but make it costly to create them
Value Imbalance Exploitation
Three men attempt to sell a fake painting on eBay for $US
135 805 Th
135,805. The sale was abandoned just prior to purchase when
l b d dj i h h
the buyer became suspicious.(http://news.cnet.com/2100‐
1017‐253848.html)
• Reputation
Reputation can be seen as an asset, not only to
can be seen as an asset not only to
promote oneself, but also as something that can be
cashed in through a fraudulent transaction with high
g g
gain.
“The Market for Evaluations”
The Market for Positive Feedbacks
A selling strategy that eBay users are actually using the
g gy y y g
feedback market for gains in other markets.
“Riddle
Riddle for a PENNY! No shipping
for a PENNY! No shipping‐Positive
Positive Feedback
Feedback”
• 29‐cent loss even in the event of a successful sale
• Price low, speed feedback accumulation
Possible solutions:
• Make the details of the transaction (besides the feedback itself) visible to other users
• Transaction‐weighted reputational statistics
T ti i ht d t ti l t ti ti
[Brown 2006]
Challenges for Crowdsourcing Markets (I)
• Two‐sided opportunistic behavior
• Reciprocal systems worse than one‐side evaluation. In e‐commerce
markets only sellers are likely to behave opportunistically No need for
markets, only sellers are likely to behave opportunistically. No need for
reciprocal evaluation!
• In crowdsourcing markets, both sides can be fraudulent. Reciprocal
systems are fraught with problems, though!
• Imperfect monitoring and heavy‐tailed participation
I f t it i dh t il d ti i ti
• In e‐commerce markets, buyers can assess the product quality directly
upon receiving.
• In crowdsourcing markets, verifying the answers is sometimes as costly as
providing them.
• Sampling
Sampling often does not work, due to heavy tailed participation
often does not work due to heavy tailed participation
distribution (lognormal, according to self‐reported surveys)
Challenges for Crowdsourcing Markets (II)
• Constrained capacity of workers
• In
In ee‐commerce
commerce markets, sellers usually have unlimited supply of
markets sellers usually have unlimited supply of
products.
• In crowdsourcing, workers have constrained capacity (cannot be
recommended continuously)
• No “price premium” for high‐quality workers
No “price premium” for high quality workers
• In e‐commerce markets, sellers with high reputation can sell their
products at a relatively high price (premium).
• In crowdsourcing, it is the requester who set the prices, which are
generally the same for all the workers.
eputat o ec a s s
employer/task)
/ )
The Importance of Task Discovery
• Heavy tailed distribution of completion times. Why?
Heavy tailed distribution of completion times Why?
[Ipeirotis, “Analyzing the Amazon Mechanical Turk marketplace”, XRDS 2010]
The Importance and Danger of Priorities
The Importance and Danger of Priorities
• [[Barabasi, Nature 2005] showed that human actions
, ]
have power‐law completion times
– Mainly result of prioritization
– When tasks ranked by priorities, power‐law results
Wh t k k db i iti l lt
• [Cobham
[Cobham, 1954] If queuing system completes tasks
1954] If queuing system completes tasks
with two priority queues, and λ=μ, then power‐law
completion times
• [Chilton et al., HCOMP 2010] Workers on Turk pick
tasks from “most
tasks from most HITs
HITs” or
or “most
most recent
recent” queues
queues
The UI hurts the market!
• Practitioners know that HITs in 3rd page and after,
p g ,
are not picked by workers.
• Many such HITs are left to expire after months,
Many such HITs are left to expire after months
never completed.
• Badly designed task discovery interface hurts every
participant in the market!
ti i t i th k t! (and the reason for scientific modeling…)
• Better modeling as a queuing system may
demonstrate other such improvements
eputat o ec a s s
employer/task)
/ )
Expert Search
Expert Search
• Find the best worker for a task, (or within
Find the best worker for a task (or within a task)
a task)
• For a task:
k
– Significant amount of research in the topic of expert
search
h [TREC track; Macdonald and Ounis, 2006]
– Check quality of workers across tasks
http://url annotator appspot com/Admin/WorkersReport
http://url‐annotator.appspot.com/Admin/WorkersReport
• Within a task:
t a tas : [[Donmez
o e etet al., 2009; Welinder, 2010]
a , 009; e de , 0 0]
Directions for future research
Directions for future research
• Optimize allocation of tasks to worker based on completion
time and expected quality
• Explicitly
Explicitly take into consideration competition in market and
take into consideration competition in market and
switch task for worker only when benefit outweighs
switching overhead (task switching in CPU from O/S)
• Recommender system for tasks (“workers like you
performed well in…”)
• Create a market with dynamic pricing for tasks, following
the pricing model of the stock market (prices increase for
task when work supply low and vice versa)
task when work supply low, and vice versa)
Outline
• Market design
• Game design
• Case studies
d
Human Computation
• Humans are not perfect mathematical models
Humans are not perfect mathematical models
• They exhibit noisy, stochastic behavior…
h hibi i h i b h i
• And exhibit common and systematic biases
Score the following from 1 to 10
1: not particularly bad or wrong
1 t ti l l b d
10: extremely evil
a) Stealing a towel from a hotel
p g y
b) Keeping a dime you find on the ground
g
c) Poisoning a barking dog
[Parducci, 1968]
Score the following from 1 to 10
1: not particularly bad or wrong
1 t ti l l b d
10: extremely evil
a) Testifying falsely for pay
gg g
b) Using guns on striking workers
c) Poisoning a barking dog
[Parducci, 1968]
Anchoring
Anchoring
• “Humans start with a first approximation (anchor) and
f pp ( )
then make adjustments to that number based on
additional information.” [Tversky & Kahneman, 1974]
• [Paolacci et al, 2010]
– Q1a: More or less than 65 African countries in UN?
Q1a: More or less than 65 African countries in UN?
– Q1b: More or less than 12 African countries in UN?
– Q2: How many countries in Africa?
– Group A mean: 42.6
– Group B mean: 18.5
Group B mean: 18 5
Anchoring
Anchoring
• Write down the last digit of their social security
g f y
number before placing bid for wine bottles. Users
with lower SSN numbers bid lower…
• In the Netflix contest, user with high ratings early
on, biased towards higher ratings later in a
on, biased towards higher ratings later in a
session…
• Crowdsourcing tasks can be affected by
anchoring. [Moren et al, NIPS 2010] describe
techniques for removing effects
techniques for removing effects
Priming
• Exposure to one stimulus influences another
Exposure to one stimulus influences another
• Stereotypes:
– Asian‐americans
Asian americans perform better in math
perform better in math
– Women perform worse in math
• [Shih et al., 1999] asked Asian‐American women:
– Questions about race: They did better in math test
Q ti b t Th did b tt i th t t
– Questions about gender: They did worse in math test
Exposure Effect
Exposure Effect
• Familiarity leads to liking...
Familiarity leads to liking
• [S
[Stone and Alonso, 2010]: Evaluators of Bing
d l 20 0] l f i
search engine increase their ratings of
relevance over time, for the same results
l i f h l
Framing
• Presenting
Presenting the same option in different
the same option in different
formats leads to different formats. People
avert options that imply loss [Tversky and
avert options that imply loss [Tversky and
Kahneman (1981)]
Framing:
600 people affected by deadly disease
600 people affected by deadly disease
Room 1
a) save 200 people
save 200 people'ss lives
lives
b) 33% chance of saving all 600 people and a 66% chance saving no one
• 72% of participants chose option A
72% of participants chose option A
• 28% of participants chose option B
Room 2
Room 2
c) 400 people die
d) 33% chance that no people will die; a 66% chance that all 600 will die
• 78% of participants chose option D (equivalent to option B)
• 22% of participants chose option C (equivalent to option A)
People avert options that imply loss
Very long list of cognitive biases…
Very long list of cognitive biases…
• http://en.wikipedia.org/wiki/List
p // p g/ / _of_cognitive
g _biases
• [Mozer et al., 2010] try to learn and remove sequential effects
from human computation data…
Outline
• Market design
• Game design
• Case studies
d
Games with a Purpose
[Luis von Ahn and Laura Dabbish, CACM 2008]
Three generic game structures
• Output agreement:
– Type same output
• Input agreement:
I
– Decide if having same input
• Inversion problem:
Inversion problem:
– P1 generates output from input
– P2 looks at P1‐output and guesses P1‐input
Output Agreement: ESP Game
Output Agreement: ESP Game
• Players look at common input
Players look at common input
• Need to agree on output
Improvements
• Game‐theoretic
Game theoretic analysis indicates that players
analysis indicates that players
will converge to easy words [Jain and Parkes]
• Solution 1: Add
Solution 1: Add “Taboo
Taboo words
words” to prevent
to prevent
guessing easy words
• Solution 2: KissKissBan, third player tries to
S l i 2 Ki Ki B hi d l i
guess (and block) agreement
Input Agreement: TagATune
p g g
• Sometimes difficult to type identical output
(e.g., “describe this song”)
• Show same of different input, let users
p ,
describe, ask players if they have same input
Inversion Problem: Peekaboom
Inversion Problem: Peekaboom
• Non symmetric players
Non‐symmetric players
• Input: Image with word
• Player 1 slowly reveals pic
l l l l i
• Player 2 tries to guess word
HINT
HINT
HINT
HINT
BUSH
HINT
Protein folding
Protein folding
• Protein folding: Proteins fold from long chains into
small balls, each in a very specific shape
• Shape
Shape is the lower‐energy setting, which the most
is the lower energy setting which the most
stable
• Fold shape is very important to understand interactions
with out molecules
• Extremely expensive computationally! (too many
degrees of freedom)
FoldIt Game
• Humans
Humans are very good at reducing the search
are very good at reducing the search
space
• Humans try to fold the protein into a minimal
energy state.
• Can leave protein unfinished and let others try
from there…
Outline
• Market design
• Game design
• Case studies
d
Case Study: Freebase
Case Study: Freebase
Praveen Paritosh, Google
Crowdsourcing Case Study
AdSafe
177
178
179
A few of the tasks in the past
• Detect pages that discuss swine flu
– Pharmaceutical firm had drug “treating” (off-label) swine flu
– FDA prohibited pharmaceutical company to display drug ad in pages
about swine flu
– Two days to build and go live
• Big fast-food chain does not want ad to

appear:
– In pages that discuss the brand (99% negative sentiment)
– In pages discussing obesity
– Three days to build and go live
180
Need to build models fast
Need to build models fast
• TTraditionally, modeling teams have invested substantial
diti ll d li t h i t d b t ti l
internal resources in data formulation, information
extraction, cleaning, and other preprocessing
No time for such things…
• However, now, we can outsource preprocessing tasks, such
as labeling, feature extraction, verifying information
l b li f t t ti if i i f ti
extraction, etc.
– using Mechanical Turk, oDesk, etc.
– quality may be lower than expert labeling (much?)
– but low costs can allow massive scale
18
1
AdSafe workflow
• Find URLs for a given topic (hate speech, gambling, alcohol
abuse, guns, bombs, celebrity gossip, etc etc)
b b b l bi i )
http://url‐collector.appspot.com/allTopics.jsp
• Classify
Classify URLs into appropriate categories
URLs into appropriate categories
http://url‐annotator.appspot.com/AdminFiles/Categories.jsp
• Mesure quality of the labelers and remove spammers
htt // t k
http://qmturk.appspot.com/
t /
• Get humans to “beat” the classifier by providing cases where
the classifier fails
http://adsafe‐beatthemachine.appspot.com/
18
2
Case Study: OCR and ReCAPTCHA
Case Study: OCR and ReCAPTCHA
Scaling Crowdsourcing: Use Machine Learning
Need to scale crowdsourcing

Basic idea: Build a machine learning model and use it
instead of humans
New case Automatic Model Automatic

(through machine learning) Answer
Existing data (through

crowdsourcing)
Scaling Crowdsourcing: Iterative training
Ti
Triage:
– machine when confident
– humans when not confident
Retrain using the new human input
Automatic
→ improve model
Answer
→ reduce need for human input

(through machine learning)

answer
Scaling Crowdsourcing: Iterative training, with noise
Machine when confident, humans otherwise
Ask as many humans as necessary to ensure quality
Automatic
Answer

Not confident
(th
(through
h machine
hi llearning)
i )
for quality?

crowdsourced answers answer
Confident for quality?
Scaling Crowdsourcing: Iterative training, with noise
Machine when confident, humans otherwise
Ask as many humans as necessary to ensure quality
– Or even get other machines…
machines
Automatic
Answer

Not confident
(th
(through
h machine
hi llearning)
i )
about quality?
Data from existing Get human(s) or

crowdsourced answers other machines
Confident about quality? to answer
Example: ReCAPTCHA + Google Books
portion distinguished
Fixes errors of Optical Character Recognition (OCR ~ 1% error rate,

rate 20%
20%-
30% for 18th and 19th century books, according to today’s NY Times article)
Improves further the OCR algorithm, reducing error rate
“40 million
illi ReCAPTCHAs
R CAPTCHA every day” d ” (2008) Fixing
Fi i 40,000
40 000 books
b k a day
d
– [Unofficial quote from Luis]: 400M/day (2010)
– All books ever written:100 million books (~12yrs??)
Thank you!
Thank you!
References
• David Alan Grier: When Computers Were Human, Princeton University Press, 2005
• [ESP Game] Luis von Ahn and Laura Dabbish, Labeling images with a computer game, CHI 2004
[ESP Game] Luis von Ahn and Laura Dabbish Labeling images with a computer game CHI 2004
• [ImageNet] http://www.image‐net.org/about‐publication
• O. Alonso, G. Kazai, and S. Mizzaro. Crowdsourcing for search engine evaluation, Springer 2011
• Kuncheva et al., Limits on the majority vote accuracy in classifier fusion, Pattern Analysis and Applications, 2003
• Robert T Clemen Robert L Winkler Limits for the Precision and Value of Information from Dependent Sources
Robert T. Clemen, Robert L. Winkler, Limits for the Precision and Value of Information from Dependent Sources,
Operations Research, 1985
• Robert T. Clemen and Robert L. Winkler, Unanimity and Compromise among Probability Forecasters,
Management Science, 1990
• J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration
J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration
of labels from labelers of unknown expertise. In NIPS, 2009
• Peter Welinder, Steve Branson, Serge Belongie, Pietro Perona, The Multidimensional Wisdom of Crowds, In
NIPS, 2010
• Wim J. van der Linden, Ronald K. Hambleton, Handbook of modern item response theory, Springer 1997
• Uebersax, A latent trait finite mixture model for the analysis of rating agreement, Biometrics 1993
• Uebersax, Statistical Modeling of Expert Ratings on Medical Treatment Appropriateness, JASA 1993
• Wang, Ipeirotis, Provost, Managing Crowdsourced Workers, Workshop on Business Intelligence 2011
• , y g y g p g,
Donmez et al, Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling, KDD 2009
• Donmez et al., A Probabilistic Framework to Learn from Multiple Annotators with Time‐Varying Accuracy, SDM
2010
References
• Ariely et al., The Effects of Averaging Subjective Probability Estimates Between and Within Judges,
Journal of Experimental Psychology 2000
Journal of Experimental Psychology 2000
• Clemen and Winkler Combining probability distributions from experts in risk analysis, Risk Analysis 1999
• Burr Settles, Active Learning Literature Survey, Computer Science Technical Report 1648, University of
Wisconsin‐Madison
• Kumar and Lease Modeling Annotator Accuracies for Supervised Learning CSDM 2011
Kumar and Lease, Modeling Annotator Accuracies for Supervised Learning, CSDM 2011
• Ipeirotis et al., Repeated Labeling Using Multiple Noisy Labelers, NYU Working Paper CeDER‐10‐03
• Carla E. Brodley , Mark A. Friedl, Identifying Mislabeled Training Data JAIR 1999
• Little, Chilton, Goldman, Miller: TurKit: human computation algorithms on mechanical turk. UIST 2010
• Dai, Mausam, Weld, Decision‐Theoretic Control of Crowd‐Sourced Workflows, AAAI 2010
• Kern, Thies, and Satzger, Statistical Quality Control for Human‐based Electronic Services, ICSOC 2010
• Montgomery, Introduction to statistical quality control, Wiley 2007
• Aniket Kittur, Boris Smus, Robert E. Kraut, CrowdForge: Crowdsourcing Complex Work, CHI 2011
, , , g g p ,
• Anand Kulkarni, Matthew Can, Turkomatic: Automatic Recursive Task Design for Mechanical Turk,
CHI2011
• Haoqi Zhang, Eric Horvitz, Rob C. Miller, David C. Parkes, Crowdsourcing General Computation, CHI
g
crowdsourcing workshop 2011p
• Eric Huang, Haoqi Zhang, David Parkes, Krzysztof Gajos, and Yiling Chen, Toward Automatic Task Design:
A Progress Report, HCOMP 2010
References
• Mason and Watts, Financial Incentives and the “Performance of Crowds”, HCOMP 2009
• Yan Kumar Ganesan “CrowdSearch:
Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real‐time Image Search on Mobile
Exploiting Crowds for Accurate Real time Image Search on Mobile
Phones”, MobiSys 2010
• Ipeirotis, Analyzing the Mechanical Turk Marketplace, XRDS 2010
• Wang, Faridani, Ipeirotis, Estimating Completion Time for Crowdsourced Tasks Using Survival Analysis
Models CSDM 2010
Models. CSDM 2010
• Chilton et al, Task search in a human computation market, HCOMP 2010
• Bingham et al, VizWiz: nearly real‐time answers to visual questions, UIST 2011
• Horton and Chilton: The labor economics of paid crowdsourcing. EC 2010
• Ariely et al., Large stakes and big mistakes, Review of Economic Studies, 2009
• Huang et al., Toward Automatic Task Design: A Progress Report, HCOMP 2010
• Quinn, Bederson, Yeh, Lin.: CrowdFlow: Integrating Machine Learning with Mechanical Turk for Speed‐
Cost‐Quality Flexibility
• Franklin et al.: Answering Queries with Crowdsourcing, SIGMOD 2011
• Parameswaran et al.: Human‐assisted Graph Search: It's Okay to Ask Questions, VLDB 2011
• Mitzenmacher, An introduction to human‐guided search, XRDS 2010
• Marcus et al Crowdsourced Databases: Query Processing with People, CIDR 2011
Marcus et al, Crowdsourced Databases: Query Processing with People CIDR 2011
• Deng, Berg, Li and Fei‐Fei. What does classifying more than 10,000 image categories tell us? ECCV 2010
• Kleinberg, Navigation in a small world, Nature 2000
References
• Lugosi, Learning with an unreliable teacher. Pattern Recognition 1992
• Smyth, Learning with probabilistic supervision. Selecting Good Models, MIT Press, Apr. 1995
h i ih b bili i ii l i d d l
• Smyth, Bounds on the mean classification error rate of multiple experts, Pattern Recognition Let. 1996
• Gneezy and Rustichini, Pay enough or don't pay at all, The Quarterly Journal of Economics, 2000
• Camerer et al Labor supply of New York City cabdrivers: One day at a time The Quarterly Journal of
Camerer, et al. Labor supply of New York City cabdrivers: One day at a time. The Quarterly Journal of
Economics, 1997
• Farber. Reference‐dependent preferences and labor supply: The case of New York City taxi drivers.
American Economic Review, 2008.
• F h
Fehr and Goette. Do workers work more if wages are high?: Evidence from a randomized field
dG D k k if hi h? E id f d i d fi ld
experiment. American Economic Review, 2007
• Chandler and Kepelner, Breaking Monotony with Meaning: Motivation in Crowdsourcing Markets , 2010
• Ariely, Kamenica, Prelec. Man’s Search for Meaning: The Case of Legos. Journal of Economic Behavior &
y, , g g
Organization 2008
• Farmer and Glass, Building Web Reputation Systems, O’Reilly 2010
• Raykar, Yu, Zhao, Valadez, Florin, Bogoni, and Moy. Learning from crowds. JMLR 2010.
• M
Mason and Watts, Collective problem solving in networks, 2011
d W tt C ll ti bl l i i t k 2011
• Dellarocas, Dini and Spagnolo. Designing Reputation Mechanisms. Chapter 18 in Handbook of
Procurement, Cambridge University Press, 2007
References
• Mozer et al, Decontaminating Human Judgments by Removing Sequential Dependencies, NIPS 2010
• Tversky, & Kahneman, Judgment under uncertainty: Heuristics and biases. Science, 1974
• Parducci. The relativism of absolute judgment. Scientific
d i h l i i f b l j d i ifi American, 1968
i
• Paolacci, Chandler, Ipeirotis. Running Experiments using Amazon Mechanical Turk. Judgment and
Decision Making, 2010
• Shih, Pittinsky, Ambady, Stereotype susceptibility: Identity salience and shifts in quantitative
, y, y, yp p y y q
performance, Psychological Science, 1999
• Bernstein et al, Soylent: A Word Processor with a Crowd Inside, UIST 2010
• Malone, Heuristics for designing enjoyable user interfaces: Lessons from computer games. , 1982,
• M l
Malone, What makes things fun to learn? Heuristics for designing instructional computer games. 1980
Wh k hi f l ?H i i f d i i i i l 1980
• Luis von Ahn, Laura Debbish, Designing Games with a Purpose, CACM 2008
• Jain and Parkes, A Game‐Theoretic Analysis of the ESP Game, Working Paper
• Cooper et al Predicting protein structures with a multiplayer online game Nature August 2010
Cooper et al, Predicting protein structures with a multiplayer online game, Nature, August 2010
Books, Courses, and Surveys
Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/
• Omar Alonso, Gabriella Kazai, and Stefano Mizzaro. Crowdsourcing for Search Engine
Evaluation: Why and How. To be published by Springer, 2011.
• Edith Law and Luis von Ahn Human Computation: An Integrated Approach to Learning
Edith Law and Luis von Ahn. Human Computation: An Integrated Approach to Learning
from the Crowd. To be published in Morgan & Claypool Synthesis Lectures on Artificial
Intelligence and Machine Learning, 2011. http://www.morganclaypool.com/toc/aim/1/1
• Deepak Ganesan. CS691CS: Crowdsourcing ‐ Opportunities & Challenges (Fall 2010).

UMass Amherst http://www.cs.umass.edu/~dganesan/courses/fall10
• Matt Lease. CS395T/INF385T: Crowdsourcing: Theory, Methods, and Applications (Spring
2011). UT Austin. http://courses.ischool.utexas.edu/Lease_Matt/2011/Spring/INF385T
• Yuen, Chen, King: A Survey of Human Computation Systems, SCA 2009
• Quinn, Bederson: A Taxonomy of Distributed Human Computation, CHI 2011
• Doan, Ramakrishnan, Halevy: Crowdsourcing Systems on the World‐Wide Web, CACM
2011
Tutorials
• AAAI 2011 (w HCOMP 2011): Human Computation: Core Research Questions and State
of the Art (Edith Law and Luis von Ahn)
• WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Omar
WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to Work for You (Omar
Alonso and Matthew Lease) http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf
• LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and
Massimo Poesio) http://lingpipe‐blog.com/2010/05/17/lrec‐2010‐tutorial‐modeling‐
Massimo Poesio) http://lingpipe blog.com/2010/05/17/lrec 2010 tutorial modeling
data‐annotation/
• ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso)
http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html
• CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐Fei Li)
http://sites.google.com/site/turkforvision/
• CIKM 2008: Crowdsourcing for Relevance Evaluation (Daniel Rose)
htt // id l t
http://videolectures.net/cikm08_rose_cfre/
t/ ik 08 f /

Crowdsourcinghumancomputationwww2011tutorial 110405135225 Phpapp01

Uploaded by

Copyright:

Available Formats

You might also like

Crowdsourcinghumancomputationwww2011tutorial 110405135225 Phpapp01

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Crowdsourcinghumancomputationwww2011tutorial 110405135225 Phpapp01

Uploaded by

Copyright:

Available Formats

Managing Crowdsourced Human

Slides from the WWW2011 tutorial 29 March 2011

– Tag images [ESP Game: voh Ahn and

GUESSING: CAR GUESSING: BOY

GUESSING: HAT GUESSING: CAR

GUESSING: KID SUCCESS!

Positive dependencies decrease the number of effective labelers

Positive dependencies decrease the number of effective labelers

[Clemen and Winkler, 1999, Ariely et al. 2000]

After aggregation, we compute confusion matrix for each worker

After majority vote, confusion matrix for ATAMRO447HWJQ

Confusion matrix for ATAMRO447HWJQ

How to check if a worker is a spammer

P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0%

In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites

Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%

False positives: Legitimate workers appear to be spammers

Assigned Label “Soft” Label Cost

G <G: 99%, P: 1%, R: 0%, X: 0%> 0.0198

QualityScore = 1 ‐ ExpCost(Worker)/ExpCost(Spammer)

Data from existing

Number of examples ("Mushroom" data set) Single‐labeler quality

New Case Automatic Model

Data from existing Get human(s) to

learning algorithms that

Improve HIT Compare HIT (voting)

(e.g. improve description) Which is better?

Common Workflow Patterns

Fix “Edit the highlighted

Verify “Choose at least one

– Identify sights worth checking

• Turkomatic [Kalkani et al, CHI 2011]:

topic 5: question answer nanonano dinkle article write writing review blog articles

topic 7: transcribe transcription improve retranscribe edit answerly voicemail answer

• Big fast-food chain does not want ad to

Need to scale crowdsourcing

New case Automatic Model Automatic

Existing data (through

New Case Automatic Model

Data from existing Get human(s) to

New Case Automatic Model

Data from existing Get human(s) to

New Case Automatic Model

Data from existing Get human(s) or

Fixes errors of Optical Character Recognition (OCR ~ 1% error rate,

• Deepak Ganesan. CS691CS: Crowdsourcing ‐ Opportunities & Challenges (Fall 2010).

You might also like