Professional Documents
Culture Documents
Span
Span
e.g. "Tripping over pets causes more than 86,000 falls each year in the United States
The span makes general claims about the world or contains common knowledge
e.g. "Most major cities have activities and entertainment for people of all ages."
The span contains advice and instructions, e.g., "To make pour over coffee, first
The span contains phrases that structure the response, including phrases that
introduce a list, summaries of other parts of the response, or conclusions that follow from
e.g., for the response "Benefits of coffee include: 1) Brain function: Coffee can improve
The span only contains links - We shouldn't verify whether the URLs exist or contain
The span only contains fictional or creative content that the system made up
NOTE: If, fictional content is mixed with factual content, or referred to in spans mainly
containing factual information, consider the span as conveying factual information, e.g.:
there is a mix of creative content and factual assertions: "Jack time-traveled back to
the span contains non-fictional facts written in artistic styles, e.g., generated poems
the span contains verifiable facts about established fictional works, e.g., plot points in
If the span is deemed not to contain factual information, select No factual information as the
rating and move on to the next span; if it does contain factual information, select one of the
Step 2
Factuality Ratings
If the span has factual claim
Use this label when there is any contradicting URL and no supporting URLs for that
claim.
Some claims contradict common knowledge (e.g. "white is a dark color") or are simple to
disprove (e.g., 'dream' is a 4 letter word). In such cases, rate as "Inaccurate" even when
"Unsupported": At least one claim in the span is likely made up. If you have spent a reasonable
amount of time (e.g., 10 minutes) researching and found no supporting or contradicting URLs,
and the claim is not clearly correct or incorrect, use this label.
"Disputed" (e.g., controversial / mixed opinion / pros and cons): At least one claim has both
supporting and contradicting evidence that cannot be reconciled and does not have a universal
consensus.
Important: Not all subjective claims and opinions should be labeled as "Disputed". If the
statement is agreed upon by virtually all experts ("the earth is round"), mark the claim as
The main difference between "Unsupported" and "Disputed" is that for "Disputed" you
must have found equally acceptable URLs both supporting and contradicting the span;
whereas for "Unsupported", you must have found NO evidence whatsoever supporting or
Don't use Unsupported or Disputed when you yourself are unsure about the answer
Some claims are common knowledge (e.g. "black is a dark color") or simple to verify
The sentence contains claims that require a high level of expertise on the matter to
verify their accuracy, or are too complex (e.g. high-level mathematical operations,
coding, etc.).
The sentence contains foreign language that’s other than your assigned language.
Please note!
assess.
enough to clearly indicate that the claim is wrong (e.g. example above about the latest
"Unsupported": you would normally be able to find the information on the web for similar
claims, but in this case the search results are not enough to determine whether the claim
is true or false.
"Can't confidently assess": you don't know where to find the right information because of
the complexity of the claim, or when it is not "searchable" information (e.g. the result of a
complex mathematical operation that you can't solve by your own means).
Important: Note that because one span can contain several claims, we have a hierarchy of
which label to assign if the labels are different for the different claims. The hierarchy is as follows:
2. Otherwise, if one claim in the span is “Unsupported”, the overall label is “Unsupported”.
3. Otherwise, if one claim in the span is “Disputed”, the overall label is “Disputed”.
4. Otherwise (i.e. all claims in the span are “Accurate”), the overall label is “Accurate”.
When you choose "Inaccurate" as rating for a span, a new question to assess the severity of this
inaccuracy will emerge: Are the non-accurate claims in this sentence providing misleading
information either (1) directly related to addressing the prompt or (2) potentially harmful
Examples:
The concept of "harmful" is debatable, but if you're in doubt mark the response as "True" and
Example 1
The source found that the official website of the US Bureau of Labor Statistics (a trustworthy,
official source), explicitly states that there is an anticipated decline in job outlook: this contradicts
inaccurate information in a sentence that responds directly to the main question in the prompt.
Example 2
From the response, we see that this bulleted item appears in the context of "a list of acting
credits for Christopher Daniel Barnes". The claims should be interpreted with respect to this
context.
Note that two claims are being made in the span: 1) Christopher Daniel Barnes played Spider-
Man in Ultimate-Spider-man 2) Ultimate Spider-Man ran from 2012 to 2017. The first claim has
contradicting evidence, as shown by the URL provided. The role of Spiderman was voiced by
Drake Bell, whereas Christopher Daniel Barnes voiced a different character (Electro), which
The second claim (‘Ultimate Spider-man ran from 2012 to 2017’) is actually correct if you look at
since not all claims are supported, we do not need to provide this URL.
Regarding the severe inaccuracy question, this is rated as "False" as even if the information is
inaccurate, it's not either harmful or responding directly to the question in the prompt.
Accurate
Consideration: Mark a span as “Accurate” if all claims within it are factually correct and
Here is an example of a correct claim with supporting evidence from the Web:
In this example, we find a trustworthy source (Wikipedia) that explicitly states that Elizabeth
Warren was a Republican from 1991 to 1996. Therefore, we can state that the claims in the span
are "Accurate."
Unsupported
Consideration: Use “Unsupported” if a claim seems fabricated or if no supporting or
"hallucinated" by the chatbot: we can find sources on the issue but we do not find explicit
A web search looking for movies/TV shows filmed in the restaurant Peter Luger Steak House
does not explicitly mention anywhere that "When Harry Met Sally" or "Seinfeld" were filmed in
this restaurant. Therefore, we can conclude that the claims in the span are hallucinated. URLs
researched for hallucinations should NOT be put in the "Supporting URLs" field. Feel free to note
Disputed
Consideration: Label a span as “Disputed” if it contains a claim with both supporting and
Revolution … ended with the formation of the French Consulate in November 1799.")
and contradicting evidence ("The French Revolution began in 1789 and lasted until
1794."), depending on which event should mark the end of the period.
e.g., "X was a good president." or "X University has a strong academic reputation."
trustworthy evidence for the consensus (e.g., scientific findings or recognition awards),
e.g., "Harvard University has a strong academic reputation" (Accurate). In such cases, it
is likely that either the supporting or contradicting URLs fields would be empty.
possibilities open for other circumstances. Supporting URLs should be provided in this
case.
Require specific expertise to verify or are too complex to assess with available
resources
Are HARD punts (see next page for details on hard vs. soft punts)
medical claim that is beyond general knowledge and requires expert verification.
Hard punts: The model replies with a very short answer, usually no more than one or
two short sentences, stating that it cannot reply to the question. Examples:
“I'm still learning how to answer this question. In the meantime, try Google Search.”
“I'm sorry, but I can't answer that question. It's a very sensitive topic, and I don't want to
offend anyone. Perhaps you could ask a historian or a religious scholar instead.”
Soft punt: The model actively refuses to answer the question or ignores it, but provides
a longer answer than a hard punt, that may include, for instance, more on the reasons
why it’s refusing to answer, or additional information about topics related to the question
(but that doesn't provide a definitive answer). Responding with a soft punt is also known
as "hedging".
Cheat Sheet
Please use this picture as a cheat sheet.
You can also find the link to the summary of the instructions here.