Professional Documents
Culture Documents
Snorkel: Rapid Training Data Creation With Weak Supervision
Snorkel: Rapid Training Data Creation With Weak Supervision
External
Subset A CONTEXT HIERARCHY
LABEL
MODELING
OPTIMIZER Λ# 𝑌 𝑌&
Subset B MATRIX
KBs Subset C PROBABILISTIC
Ontology(ctd, [A, B, -C])
TRAINING DATA
Patterns & “causes”, “induces”, “linked
dictionaries to”, “aggravates”, … Pattern(“{{0}}causes{{1}}”) Λ"
CustomFn(x,y : heuristic(x,y)) DISCRIMINATIVE
Domain “Chemicals of type A
should be harmless…”
GENERATIVE MODEL
Heuristics MODEL
LABELING FUNCTION INTERFACE
WEAK SUPERVISION SOURCES
Figure 2: An overview of the Snorkel system. (1) SME users write labeling functions (LFs) that express
weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over
unlabeled data and learns a generative model to combine the LFs’ outputs into probabilistic labels. (3)
Snorkel uses these labels to train a discriminative classification model, such as a deep neural network.
source datasets that are representative of other Snorkel de- Setup: Our goal is to learn a parameterized classification
ployments, including bioinformatics, medical image analysis, model hθ that, given a data point x ∈ X , predicts its la-
and crowdsourcing; on which Snorkel beats heuristics by an bel y ∈ Y, where the set of possible labels Y is discrete.
average 153% and comes within an average 3.60% of the For simplicity, we focus on the binary setting Y = {−1, 1},
predictive performance of large hand-curated training sets. though we include a multi-class application in our experi-
ments. For example, x might be a medical image, and y
2. SNORKEL ARCHITECTURE a label indicating normal versus abnormal. In the relation
Snorkel’s workflow is designed around data programming extraction examples we look at, we often refer to x as a can-
[5, 38], a fundamentally new paradigm for training machine didate. In a traditional supervised learning setup, we would
learning models using weak supervision, and proceeds in learn hθ by fitting it to a training set of labeled data points.
three main stages (Figure 2): However, in our setting, we assume that we only have access
to unlabeled data for training. We do assume access to a
1. Writing Labeling Functions: Rather than small set of labeled data used during development, called the
hand-labeling training data, users of Snorkel write label- development set, and a blind, held-out labeled test set for
ing functions, which allow them to express various weak evaluation. These sets can be orders of magnitudes smaller
supervision sources such as patterns, heuristics, external than a training set, making them economical to obtain.
knowledge bases, and more. This was the component The user of Snorkel aims to generate training labels by
most informed by early interactions (and mistakes) with providing a set of labeling functions, which are black-box
users over the last year of deployment, and we present a functions, λ : X → Y ∪ {∅}, that take in a data point
flexible interface and supporting data model. and output a label where we use ∅ to denote that the la-
2. Modeling Accuracies and Correlations: Next, beling functions abstains. Given m unlabeled data points
Snorkel automatically learns a generative model over the and n labeling functions, Snorkel applies the labeling func-
labeling functions, which allows it to estimate their accu- tions over the unlabeled data to produce a matrix of label-
racies and correlations. This step uses no ground-truth ing function outputs Λ ∈ (Y ∪ {∅})m×n . The goal of the
data, learning instead from the agreements and disagree- remaining Snorkel pipeline is to synthesize this label matrix
ments of the labeling functions. We observe that this step Λ—which may contain overlapping and conflicting labels for
improves end predictive performance 5.81% over Snorkel each data point—into a single vector of probabilistic train-
with unweighted label combination, and anecdotally that ing labels Ỹ = (ỹ1 , ..., ỹm ), where ỹi ∈ [0, 1]. These training
it streamlines the user development experience by provid- labels can then be used to train a discriminative model.
ing actionable feedback about labeling function quality. Next, we introduce the running example of a text relation
3. Training a Discriminative Model: The output of extraction task as a proxy for many real-world knowledge
Snorkel is a set of probabilistic labels that can be used to base construction and data analysis tasks:
train a wide variety of state-of-the-art machine learning
models, such as popular deep learning models. While the Example 2.1. Consider the task of extracting mentions
generative model is essentially a re-weighted combination of adverse chemical-disease relations from the biomedical lit-
of the user-provided labeling functions—which tend to be erature (see CDR task, Section 4.1). Given documents with
precise but low-coverage—modern discriminative models mentions of chemicals and diseases tagged, we refer to each
can retain this precision while learning to generalize be- co-occuring (chemical, disease) mention pair as a candidate
yond the labeling functions, increasing coverage and ro- extraction, which we view as a data point to be classified as
bustness on unseen data. either true or false. For example, in Figure 2, we would have
two candidates with true labels y1 = True and y2 = False:
Next we set up the problem Snorkel addresses and de-
scribe its main components and design decisions. x 1 = Causes ( " magnesium " , " quadriplegic " )
x 2 = Causes ( " magnesium " , " preeclampsia " )
Document Hand-Defined Labeling Functions: In its most gen-
Candidate(A,B) eral form, a labeling function is just an arbitrary snippet
Sentence
of code, usually written in Python, which accepts as input
a Candidate object and either outputs a label or abstains.
Span
Often these functions are similar to extract-transform-load
Entity scripts, expressing basic patterns or heuristics, but may use
supporting code or resources and be arbitrarily complex.
CONTEXT HIERARCHY
Writing labeling functions by hand is supported by the ORM
Figure 3: Labeling functions take as input a layer, which maps the context hierarchy and associated meta-
Candidate object, representing a data point to be data to an object-oriented syntax, allowing the user to easily
classified. Each Candidate is a tuple of Context ob- traverse the structure of the input data.
jects, which are part of a hierarchy representing the
local context of the Candidate. Example 2.3. In our running example, we can write a
labeling function that checks if the word “causes” appears
Data Model: A design challenge is managing complex, between the chemical and disease mentions. If it does, it
unstructured data in a way that enables SMEs to write la- outputs True if the chemical mention is first and False if
beling functions over it. In Snorkel, input data is stored in a the disease mention is first. If “causes” does not appear, it
context hierarchy. It is made up of context types connected outputs None, indicating abstention:
by parent/child relationships, which are stored in a rela- def L F c a u s e s ( x ) :
tional database and made available via an object-relational cs , ce = x . chemical . g e t w o r d r a n g e ()
mapping (ORM) layer built with SQLAlchemy.2 Each con- ds , de = x . disease . g e t w o r d r a n g e ()
text type represents a conceptual component of data to be if ce < ds and " causes " in x . parent . words [ ce +1: ds ]:
return True
processed by the system or used when writing labeling func- if de < cs and " causes " in x . parent . words [ de +1: cs ]:
tions; for example a document, an image, a paragraph, a sen- return False
tence, or an embedded table. Candidates—i.e., data points return None
x—are then defined as tuples of contexts (Figure 3).
We could also write this with Snorkel’s declarative interface:
Example 2.2. In our running CDR example, the input L F c a u s e s = l f s e a r c h ( " {{1}}.∗\ Wcauses \W .∗{{2}} " ,
documents can be represented in Snorkel as a hierarchy con- r e v e r s e a r g s = False )
sisting of Documents, each containing one or more
Sentences, each containing one or more Spans of text. These
Spans may also be tagged with metadata, such as Entity Declarative Labeling Functions: Snorkel includes a li-
markers identifying them as chemical or disease mentions brary of declarative operators that encode the most common
(Figure 3). A candidate is then a tuple of two Spans. weak supervision function types, based on our experience
with users over the last year. These functions capture a
2.1 A Language for Weak Supervision range of common forms of weak supervision, for example:
Snorkel uses the core abstraction of a labeling function • Pattern-based: Pattern-based heuristics embody the
to allow users to specify a wide range of weak supervi- motivation of soliciting higher information density input
sion sources such as patterns, heuristics, external knowledge from SMEs. For example, pattern-based heuristics en-
bases, crowdsourced labels, and more. This higher-level, less compass feature annotations [51] and pattern-bootstrapping
precise input is more efficient to provide (see Section 4.2), approaches [18, 20] (Example 2.3).
and can be automatically denoised and synthesized, as de-
• Distant supervision: Distant supervision generates train-
scribed in subsequent sections.
ing labels by heuristically aligning data points with an
In this section, we describe our design choices in building
external knowledge base, and is one of the most popular
an interface for writing labeling functions, which we envi-
forms of weak supervision [4, 22, 32].
sion as a unifying programming language for weak supervi-
sion. These choices were informed to a large degree by our • Weak classifiers: Classifiers that are insufficient for our
interactions—primarily through weekly office hours—with task—e.g., limited coverage, noisy, biased, and/or trained
Snorkel users in bioinformatics, defense, industry, and other on a different dataset—can be used as labeling functions.
areas over the past year.3 For example, while we initially • Labeling function generators: One higher-level ab-
intended to have a more complex structure for labeling func- straction that we can build on top of labeling functions
tions, with manually specified types and correlation struc- in Snorkel is labeling function generators, which generate
ture, we quickly found that simplicity in this respect was multiple labeling functions from a single resource, such as
critical to usability (and not empirically detrimental to our crowdsourced labels and distant supervision from struc-
ability to model their outputs). We also quickly discovered tured knowledge bases (Example 2.4).
that users wanted either far more expressivity or far less of
it, compared to our first library of function templates. We Example 2.4. A challenge in traditional distant supervi-
thus trade off expressivity and efficiency by allowing users to sion is that different subsets of knowledge bases have differ-
write labeling functions at two levels of abstraction: custom ent levels of accuracy and coverage. In our running exam-
Python functions and declarative operators. ple, we can use the Comparative Toxicogenomics Database
(CTD)4 as distant supervision, separately modeling different
2 subsets of it with separate labeling functions. For example,
https://www.sqlalchemy.org/
3 4
http://snorkel.stanford.edu#users http://ctdbase.org/
we might write one labeling function to label a candidate 2.3 Discriminative Model
True if it occurs in the “Causes” subset, and another to la- The end goal in Snorkel is to train a model that gen-
bel it False if it occurs in the “Treats” subset. We can write eralizes beyond the information expressed in the labeling
this using a labeling function generator, functions. We train a discriminative model hθ on our prob-
L F s C T D = Ontology ( ctd , abilistic labels Ỹ by minimizing a noise-aware variant of the
{" Causes " : True , " Treats " : False }) loss l(hθ (xi ), y), i.e., the expected loss with respect to Ỹ :
m
which creates two labeling functions. In this way, generators X
can be connected to large resources and create hundreds of θ̂ = arg min Ey∼Ỹ [l(hθ (xi ), y)] .
θ
labeling functions with a line of code. i=1
2.2 Generative Model A formal analysis shows that as we increase the amount
of unlabeled data, the generalization error of discriminative
The core operation of Snorkel is modeling and integrat- models trained with Snorkel will decrease at the same asymp-
ing the noisy signals provided by a set of labeling func- totic rate as traditional supervised learning models do with
tions. Using the recently proposed approach of data pro- additional hand-labeled data [38], allowing us to increase
gramming [5, 38], we model the true class label for a data predictive performance by adding more unlabeled data. In-
point as a latent variable in a probabilistic model. In the tuitively, this property holds because as more data is pro-
simplest case, we model each labeling function as a noisy vided, the discriminative model sees more features that co-
“voter” which is independent—i.e., makes errors that are occur with the heuristics encoded in the labeling functions.
uncorrelated with the other labeling functions. This defines
a generative model of the votes of the labeling functions as Example 2.5. The CDR data contains the sentence,
noisy signals about the true label. “Myasthenia gravis presenting as weakness after magnesium
We can also model statistical dependencies between the administration.” None of the 33 labeling functions we devel-
labeling functions to improve predictive performance. For oped vote on the corresponding Causes(magnesium,
example, if two labeling functions express similar heuristics, myasthenia gravis) candidate, i.e., they all abstain. How-
we can include this dependency in the model and avoid a ever, a deep neural network trained on probabilistic training
“double counting” problem. We observe that such pairwise labels from Snorkel correctly identifies it as a true mention.
correlations are the most common, so we focus on them in
this paper (though handling higher order dependencies is
straightforward). We use our structure learning method for Snorkel provides connectors for popular machine learning
generative models [5] to select a set C of labeling function libraries such as TensorFlow [2], allowing users to exploit
pairs (j, k) to model as correlated (see Section 3.2). commodity models like deep neural networks that do not
Now we can construct the full generative model as a factor require hand-engineering of features and have robust pre-
graph. We first apply all the labeling functions to the unla- dictive performance across a wide range of tasks.
beled data points, resulting in a label matrix Λ, where Λi,j =
λj (xi ). We then encode the generative model pw (Λ, Y ) us-
ing three factor types, representing the labeling propensity, 3. WEAK SUPERVISION TRADEOFFS
accuracy, and pairwise correlations of labeling functions: We study the fundamental question of when—and at what
level of complexity—we should expect Snorkel’s generative
φLab
i,j (Λ, Y ) = 1{Λi,j 6= ∅} model to yield the greatest predictive performance gains.
φAcc
i,j (Λ, Y ) = 1{Λi,j = yi } Understanding these performance regimes can help guide
users, and introduces a tradeoff space between predictive
φCorr
i,j,k (Λ, Y ) = 1{Λi,j = Λi,k } (j, k) ∈ C performance and speed. We characterize this space in two
For a given data point xi , we define the concatenated vector parts: first, by analyzing when the generative model can be
of these factors for all the labeling functions j = 1, ..., n and approximated by an unweighted majority vote, and second,
potential correlations C as φi (Λ, Y ), and the corresponding by automatically selecting the complexity of the correlation
vector of parameters w ∈ R2n+|C| . This defines our model: structure to model. We then introduce a two-stage, rule-
m
! based optimizer to support fast development cycles.
−1
X T
pw (Λ, Y ) = Zw exp w φi (Λ, yi ) , 3.1 Modeling Accuracies
i=1
The natural first question when studying systems for weak
where Zw is a normalizing constant. To learn this model supervision is, “When does modeling the accuracies of
without access to the true labels Y , we minimize the negative sources improve end-to-end predictive performance?” We
log marginal likelihood given the observed label matrix Λ: study that question in this subsection and propose a heuris-
tic to identify settings in which this modeling step is most
X
ŵ = arg min − log pw (Λ, Y ) .
w
Y
beneficial.
We optimize this objective by interleaving stochastic gra- 3.1.1 Tradeoff Space
dient descent steps with Gibbs sampling ones, similar to We start by considering the label density dΛ of the label
contrastive divergence [21]; for more details, see [5, 38]. We matrix Λ, defined as the mean number of non-abstention
use the Numbskull library,5 a Python NUMBA-based Gibbs labels per data point. In the low-density setting, sparsity
sampler. We then use the predictions, Ỹ = pŵ (Y |Λ), as of labels will mean that there is limited room for even an
probabilistic training labels. optimal weighting of the labeling functions to diverge much
5
https://github.com/HazyResearch/numbskull from the majority vote. Conversely, as the label density
Table 1: Modeling advantage Aw attained us-
Low-Density Bound ing a generative model for several applications in
0.20 Snorkel (Section 4.1), the upper bound Ã∗ used by
Optimizer (A * )
Optimal (A * ) our optimizer, the modeling strategy selected by the
0.15 Gen. Model (Aw) optimizer—either majority vote (MV) or generative
Modeling Advantage
Φ(Λi , y) = 1 {cy (Λi )wmax > c−y (Λi )wmin } Modeling such dependencies is important because they affect
m
our estimates of the true labels. Consider the extreme case
1 X X in which not accounting for dependencies is catastrophic:
Ã∗ (Λ) = 1 {yf1 (Λi ) ≤ 0} Φ(Λi , y)σ(2fw̄ (Λi )y)
m i=1 y∈±1
Example 3.1. Consider a set of 10 labeling functions,
where σ(·) is the sigmoid function, fw̄ is majority vote with where 5 are perfectly correlated, i.e., they vote the same way
on every data point, and 5 are conditionally independent
all weights set to the mean w̄, and Ã∗ (Λ) is the predicted
given the true label. If the correlated labeling functions have
modeling advantage used by our optimizer. Essentially, we
accuracy α = 50% and the uncorrelated ones have accuracy
are taking the expected counts of instances in which a
β = 99%, then the maximum likelihood estimate of their ac-
weighted majority vote could possibly flip the incorrect pre-
curacies according to the independent model is α̂ = 100%
dictions of unweighted majority vote under best case condi-
and β̂ = 50%.
tions, which is an upper bound for the expected advantage:
Specifying a generative model to account for such depen-
Proposition 2. (Optimizer Upper Bound) Assume dencies by hand is impractical for three reasons. First, it
that the labeling functions have accuracy parameters (log- is difficult for non-expert users to specify these dependen-
odds weights) wj ∈ [wmin , wmax ], and have E[w] = w̄. Then: cies. Second, as users iterate on their labeling functions,
their dependency structure can change rapidly, like when a
Ey,w∗ [A∗ | Λ] ≤ Ã∗ (Λ) (3) user relaxes a labeling function to label many more candi-
dates. Third, the dependency structure can be dataset spe-
Proof Sketch: We upper-bound the modeling advantage by
cific, making it impossible to specify a priori, such as when
the expected number of instances in which WMV* is correct
a corpus contains many strings that match multiple regular
and MV is incorrect. We then upper-bound this by using
expressions used in different labeling functions. We observed
the best-case probability of the weighted majority vote be-
users of earlier versions of Snorkel struggling for these rea-
ing correct given (wmin , wmax ).
sons to construct accurate and efficient generative models
with dependencies. We therefore seek a method that can
We apply Ã∗ to a synthetic dataset and plot in Figure 6.
quickly identify an appropriate dependency structure from
Next, we compute Ã∗ for the labeling matrices from ex-
the labeling function outputs Λ alone.
periments in Section 4.1, and compare with the empirical
Naively, we could include all dependencies of interest, such
advantage of the trained generative models (Table 1). We
as all pairwise correlations, in the generative model and per-
see that our approximate quantity Ã∗ serves as a correct form parameter estimation. However, this approach is im-
guide in all cases for determining which modeling strategy practical. For 100 labeling functions and 10,000 data points,
to select, which for the mature applications reported on is estimating parameters with all possible correlations takes
indeed most often the generative model. However, we see roughly 45 minutes. When multiplied over repeated runs of
that while EHR and Chem have equivalent label densities, hyperparameter searching and development cycles, this cost
our optimizer correctly predicts that Chem can be modeled greatly inhibits labeling function development. We therefore
with majority vote, speeding up each pipeline execution by turn to our method for automatically selecting which depen-
1.8×. We find in our applications that the optimizer can dencies to model without access to ground truth [5]. It uses
save execution time especially during the initial stages of it- a pseudolikelihood estimator, which does not require any
erative development (see full version). sampling or other approximations to compute the objective
gradient exactly. It is much faster than maximum likelihood
estimation, taking 15 seconds to select pairwise correlations
to be modeled among 100 labeling functions with 10,000
8
We fix these at defaults of (wmin , w̄, wmax ) = (0.5, 1.0, 1.5), data points. However, this approach relies on a selection
which corresponds to assuming labeling functions have accuracies threshold hyperparameter which induces a tradeoff space
between 62% and 82%, and an average accuracy of 73%. between predictive performance and computational cost.
Simulated Labeling Functions Chemical-Disease Labeling Functions All User Study Labeling Functions
57.5
Performance 70.0
50 # of Correlations 80 400 4000
Number of Correlations
Number of Correlations
Elbow Point 70
40 300 69.0 3000
60 56.5
30 50 68.5
200 2000 56.0
20 40 68.0
67.5 55.5
10 30 100 1000
20 67.0 55.0
0 0 0
0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0
Correlation Threshold Correlation Threshold Correlation Threshold
Figure 5: Predictive performance of the generative model and number of learned correlations versus the
correlation threshold . The selected elbow point achieves a good tradeoff between predictive performance and
computational cost (linear in the number of correlations). Left: simulation of structure learning correcting
the generative model. Middle: the CDR task. Right: all user study labeling functions for the Spouses task.
3.2.1 Tradeoff Space takes 57 minutes. Further, parameter estimation is often run
Such structure learning methods, whether pseudolikeli- repeatedly during development for two reasons: (i) fitting
hood or likelihood-based, crucially depend on a selection generative model hyperparameters using a development set
threshold for deciding which dependencies to add to the requires repeated runs, and (ii) as users iterate on their la-
generative model. Fundamentally, the choice of deter- beling functions, they must re-estimate the generative model
mines the complexity of the generative model.9 We study to evaluate them.
the tradeoff between predictive performance and computa-
tional cost that this induces. We find that generally there is 3.2.2 Automatically Choosing a Model
an “elbow point” beyond which the number of correlations Based on our observations, we seek to automatically
selected—and thus the computational cost—explodes, and choose a value of that trades off between predictive perfor-
that this point is a safe tradeoff point between predictive mance and computational cost using the labeling functions’
performance and computation time. outputs Λ alone. Including as a hyperparameter in a grid
search over a development set is generally not feasible be-
Predictive Performance: At one extreme, a very large cause of its large effect on running time. We therefore want
value of will not include any correlations in the generative to choose before other hyperparameters, without perform-
model, making it identical to the independent model. As ing any parameter estimation. We propose using the num-
is decreased, correlations will be added. At first, when is ber of correlations selected at each value of as an inex-
still high, only the strongest correlations will be included. pensive indicator. The dashed lines in Figure 5 show that
As these correlations are added, we observe that the gen- as decreases, the number of selected correlations follows a
erative model’s predictive performance tends to improve. pattern. Generally, the number of correlations grows slowly
Figure 5, left, shows the result of varying in a simula- at first, then hits an “elbow point” beyond which the num-
tion where more than half the labeling functions are corre- ber explodes, which fits the assumption that the correlation
lated. After adding a few key dependencies, the generative structure is sparse. In all three cases, setting to this el-
model resolves the discrepancies among the labeling func- bow point is a safe tradeoff between predictive performance
tions. Figure 5, middle, shows the effect of varying for the and computational cost. In cases where performance grows
CDR task. Predictive performance improves as decreases consistently (left and right), the elbow point achieves most
until the model overfits. Finally, we consider a large number of the predictive performance gains at a small fraction of
of labeling functions that are likely to be correlated. In our the computational cost. For example, on Spouses (right),
user study (described in Section 4.2), participants wrote la- choosing = 0.08 achieves a score of 56.6 F1—within one
beling functions for the Spouses task. We combined all 125 point of the best score—but only takes 8 minutes for pa-
of their functions and studied the effect of varying . Here, rameter estimation. In cases where predictive performance
we expect there to be many correlations since it is likely eventually degrades (middle), the elbow point also selects
that users wrote redundant functions. We see in Figure 5, a relatively small number of correlations, giving an 0.7 F1
right, that structure learning surpasses the best performing point improvement and avoiding overfitting.
individual’s generative model (50.0 F1). Performing structure learning for many settings of is in-
expensive, especially since the search needs to be performed
Computational Cost: Computational cost is correlated only once before tuning the other hyperparameters. On the
with model complexity. Since learning in Snorkel is done large number of labeling functions in the Spouses task, struc-
with a Gibbs sampler, the overhead of modeling additional ture learning for 25 values of takes 14 minutes. On CDR,
correlations is linear in the number of correlations. The with a smaller number of labeling functions, it takes 30 sec-
dashed lines in Figure 5 show the number of correlations in- onds. Further, if the search is started at a low value of and
cluded in each model versus . For example, on the Spouses increased, it can often be terminated early, when the num-
task, fitting the parameters of the generative model at = ber of selected correlations reaches a low value. Selecting the
0.5 takes 4 minutes, and fitting its parameters with = 0.02 elbow point itself is straightforward. We use the point with
greatest absolute difference from its neighbors, but more
9
Specifically, is both the coefficient of the `1 regularization term sophisticated schemes can also be applied [43]. Our full op-
used to induce sparsity, and the minimum absolute weight in log timization algorithm for choosing a modeling strategy and
scale that a dependency must have to be selected. (if necessary) correlations is shown in Algorithm 1.
Algorithm 1 Modeling Strategy Optimizer
Table 2: Number of labeling functions, fraction of
Input: Label matrix Λ ∈ (Y ∪ {∅})m×n , positive labels (for binary classification tasks), num-
advantage tolerance γ, structure search resolution η ber of training documents, and number of training
Output: Modeling strategy candidates for each task.
if Ã∗ (Λ) < γ then Task # LFs % Pos. # Docs # Candidates
return MV Chem 16 4.1 1,753 65,398
Structures ← [ ] EHR 24 36.8 47,827 225,607
1
for i from 1 to 2η do CDR 33 24.6 900 8,272
←i·η Spouses 11 8.3 2,073 22,195
C ← LearnStructure(Λ, ) Radiology 18 36.0 3,851 3,851
Structures.append(|C|, ) Crowd 102 - 505 505
← SelectElbowPoint(Structures)
return GM
that are representative of other deployments in information
extraction, medical image classification, and crowdsourced
4. EVALUATION sentiment analysis. Summary statistics of the tasks are pro-
We evaluate Snorkel by drawing on deployments devel- vided in Table 2.
oped in collaboration with users. We report on two real-
world deployments and four tasks on open-source data sets
representative of other deployments. Our evaluation is de- Discriminative Models: One of the key bets in Snorkel’s
signed to support the following three main claims: design is that the trend of increasingly powerful, open-source
machine learning tools (e.g., models, pre-trained word em-
• Snorkel outperforms distant supervision baselines. beddings and initial layers, automatic tuners, etc.) will
In distant supervision [32], one of the most popular forms only continue to accelerate. To best take advantage of this,
of weak supervision used in practice, an external knowl- Snorkel creates probabilistic training labels for any discrim-
edge base is heuristically aligned with input data to serve inative model with a standard loss function.
as noisy training labels. By allowing users to easily incor- In the following experiments, we control for end model se-
porate a broader, more heterogeneous set of weak super- lection by using currently popular, standard choices across
vision sources, Snorkel exceeds models trained via distant all settings. For text modalities, we choose a bidirectional
supervision by an average of 132%. long short term memory (LSTM) sequence model [17], and
• Snorkel approaches hand supervision. We see that for the medical image classification task we use a 50-layer
by writing tens of labeling functions, we were able to ap- ResNet [19] pre-trained on the ImageNet object classifica-
proach or match results using hand-labeled training data tion dataset [14]. Both models are implemented in Tensor-
which took weeks or months to assemble, coming within flow [2] and trained using the Adam optimizer [24], with
2.11% of the F1 score of hand supervision on relation ex- hyperparameters selected via random grid search using a
traction tasks and an average 5.08% accuracy or AUC on small labeled development set. Final scores are reported on
cross-modal tasks, for an average 3.60% across all tasks. a held-out labeled test set. See full version for details.
• Snorkel enables a new interaction paradigm. We A key takeaway of the following results is that the discrim-
measure Snorkel’s efficiency and ease-of-use by reporting inative model generalizes beyond the heuristics encoded in
on a user study of biomedical researchers from across the the labeling functions (as in Example 2.5). In Section 4.1.1,
U.S. These participants learned to write labeling functions we see that on relation extraction applications the discrimi-
to extract relations from news articles as part of a two- native model improves performance over the generative
day workshop on learning to use Snorkel, and matched model primarily by increasing recall by 43.15% on average.
or outperformed models trained on hand-labeled training In Section 4.1.2, the discriminative model classifies entirely
data, showing the efficiency of Snorkel’s process even for new modalities of data to which the labeling functions can-
first-time users. not be applied.
We now describe our results in detail. First, we describe 4.1.1 Relation Extraction from Text
the six applications that validate our claims. We then show We first focus on four relation extraction tasks on text
that Snorkel’s generative modeling stage helps to improve data, as it is a challenging and common class of problems
the predictive performance of the discriminative model, that are well studied and for which distant supervision is
demonstrating that it is 5.81% more accurate when trained often considered. Predictive performance is summarized in
on Snorkel’s probabilistic labels versus labels produced by Table 3. We briefly describe each task.
an unweighted average of labeling functions. We also val-
Scientific Articles (Chem): With modern online reposi-
idate that the ability to incorporate many different types
tories of scientific literature, such as PubMed10 for biomed-
of weak supervision incrementally improves results with an
ical articles, research results are more accessible than ever
ablation study. Finally, we describe the protocol and results
before. However, actually extracting fine-grained pieces of
of our user study.
information in a structured format and using this data to
4.1 Applications answer specific questions at scale remains a significant open
challenge for researchers. To address this challenge in the
To evaluate the effectiveness of Snorkel, we consider sev-
10
eral real-world deployments and tasks on open-source datasets https://www.ncbi.nlm.nih.gov/pubmed/
Table 3: Evaluation of Snorkel on relation extraction tasks from text. Snorkel’s generative and discriminative
models consistently improve over distant supervision, measured in F1, the harmonic mean of precision (P)
and recall (R). We compare with hand-labeled data when available, coming within an average of 1 F1 point.
Distant Supervision Snorkel (Gen.) Snorkel (Disc.) Hand Supervision
Task P R F1 P R F1 Lift P R F1 Lift P R F1
Chem 11.2 41.2 17.6 78.6 21.6 33.8 +16.2 87.0 39.2 54.1 +36.5 - - -
EHR 81.4 64.8 72.2 77.1 72.9 74.9 +2.7 80.2 82.6 81.4 +9.2 - - -
CDR 25.5 34.8 29.4 52.3 30.4 38.5 +9.1 38.8 54.3 45.3 +15.9 39.9 58.1 47.3
Spouses 9.9 34.8 15.4 53.5 62.1 57.4 +42.0 48.4 61.6 54.2 +38.8 47.8 62.5 54.2
0.08 P (yi = y 0 , Λi | w)
P (yi = y 0 | Λi , w) = P 00
0.07 y 00 ∈±1 P (yi = y , Λi | w)
exp w φi (Λi , yi = y 0 )
T
0.06 = P
Modeling Advantage
0.02 MV GM
where σ(·) is the sigmoid function. Note that we are consider-
0.01 Optimizer (A * ) ing a simplified independent generative model with only accuracy
0.00 Gen. Model (Aw) factors; however, in this discriminative formulation the labeling
propensity factors would drop out anyway since they do not de-
5 10 15 20 25 30 pend on y, so their omission is just for notational simplicity.
# of Labeling Functions Putting this all together by removing the yi0 placeholder, sim-
plifying notation to match the main body of the paper, we have:
Figure 6: The advantage of using the generative la-
beling model (GM) over majority vote (MV) as pre- Ew∗ ,y [A∗ (Λ, y) | Λ]
dicted by our optimizer (Ã∗ ), and empirically (Aw ), m
1 X X
on the CDR application as the number of LFs is in- ≤ 1 {yf1 (Λi ) ≤ 0} Φ(Λi , y)σ (2yfw̄ (Λi ))
m i=1 y∈±1
creased. We see that the optimizer correctly chooses
MV during early development stages, and then GM = Ã∗ (Λ) .
in later ones.
Table 7: Number of candidates in the training, de-
A.5 Modeling Advantage Notes
In Figure 6, we measure the modeling advantage of the gener-
velopment, and test splits for each dataset. ative model versus a majority vote of the labeling functions on
Task # Train. # Dev. # Test random subsets of the CDR labeling functions of different sizes.
We see that the modeling advantage grows as the number of la-
Chem 65,398 1,292 1,232 beling functions increases, indicating that the optimizer can save
EHR 225,607 913 604 execution time especially during the initial stages of iterative de-
CDR 8,272 888 4,620 velopment.
Spouses 22,195 2,796 2,697 Note that in Section 4, due to known negative class imbalance
in relation extraction problems, we count instances in which the
Radiology 3,851 385 385
generative model emits no label—i.e., a 0 label—as negatives,
Crowd 505 63 64 as is common practice (essentially, we are giving the generative
model the benefit of the doubt given the known class imbalance).
Thus our reported F1 score metric hides instances in which the
is incorrect (note that for tie votes, we simply upper bound by generative model learns to apply a -1 label where majority vote
trivially assuming an expected advantage of one): applied 0. In computing the empirical modeling advantage, how-
ever, we do count such instances as improvements over majority
Ew∗ ,y [A∗ (Λ, y) | Λ] vote, as these instances do have an effect on the training of the
= Ew∗ ,y∼P (· | Λ,w∗ ) [Aw∗ (Λ, y)] end discriminative model.
m
1 X
1 yi 6= yi0 1 yi0 fw∗ (Λi ) ≤ 0 B. ADDITIONAL EVALUATION DETAILS
≤ Ew∗ ,y∼P (· | Λi ,w∗ )
m i=1
1 X
m B.1 Data Set Details
1 yi 6= yi0 1 yi0 fw∗ (Λi ) ≤ 0
= Ew∗ Ey∼P (· | Λi ,w∗ ) Additional information about the sizes of the datasets are in-
m i=1 cluded in Table 7. Specifically, we report the size of the (unla-
m beled) training set and hand-labeled development and test sets, in
1 X
Ew∗ P (yi 6= yi0 | Λi , w∗ )1 yi0 fw∗ (Λi ) ≤ 0 terms of number of candidates. Note that the development and
=
m i=1 test sets can be orders of magnitude smaller that the training
sets. Labeled development and test sets were either used when
Next, define: already available as part of a benchmark dataset, or labeled with
the help of our SME collaborators, limited to several hours of
Φ(Λi , y 00 ) = 1 cy00 (Λi )wmax − c−y00 (Λi )wmin
labeling time maximum.
i.e. this is an indicator for whether WMV* could possibly output
y 00 as a prediction under best-case circumstances. We use this in B.2 User Study
turn to upper-bound the expected modeling advantage again: Figures 7 and 8 show the distribution of scores by participant,
and broken down by participant background, compared against
Ew∗ ,y∼P (· | Λ,w∗ ) [Aw∗ (Λ, y)] the baseline models trained with hand-labeled data. Figure 8
1
m
X shows descriptive statistics of user factors broken down by their
Ew∗ P (yi 6= yi0 | Λi , w∗ )Φ(Λi , −yi0 )
≤ end model’s predictive performance.
m i=1
m
=
1 X
Φ(Λi , −yi0 )Ew∗ P (yi 6= yi0 | Λi , w∗ )
C. IMPLEMENTATION DETAILS
m i=1 Note that all code is open source and available—with tutorials,
m
blog posts, workshop lectures, and other material—at snorkel.
1 X stanford.edu.
≤ Φ(Λi , −yi0 )P (yi 6= yi0 | Λi , w̄)
m i=1
0.5 0.5
0.4 0.4
F1
F1
0.3 0.3
0.2 0.2
0.4 0.4
F1
F1
0.3 0.3
0.2 0.2
70
only the closure of the labeling functions and the resulting labels
40 60 need to be communicated to and from the workers. This is par-
50 ticularly helpful in Snorkel’s iterative workflow. Distributing a
40 large unstructured data set across a cluster is relatively expen-
20 30 sive, but only has to be performed once. Then, as users refine
20 their labeling functions, they can be rerun efficiently.
10 This same execution model is supported for preprocessing
0
utilities—such as natural language processing for text and candi-
0 20 40 60 80 100 date extraction—via a common class interface. Snorkel provides
wrappers for Stanford CoreNLP (https://stanfordnlp.github.
Precision io/CoreNLP/) and SpaCy (https://spacy.io/) for text prepro-
cessing, and supports automatically defining candidates using
their named-entity recognition features.
Figure 7: Predictive performance of our 14 user
study participants. The majority of users matched
or exceeded the performance of a model trained on
7 hours (2500 instances) of hand-labeled data.