Training - Interactive Preference Collection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

you create it. we globalize it.

Interactive Preference Collection


Team Training Session

Prepared by: Julie De Alvarenga, Program Manager at e2f


23 April 2024 (v3)
you create it. we globalize it.

Interactive Preference Collection

Main Goal:
The goal of Interactive Preference Collection is to have high-quality multi-turn conversations with possible
responses from two different Agents.

● You begin or continue the conversation with a prompt you generate based on the
provided instructions.
● Each single Human prompt you generate and each single response provided by the
Agent is considered as one turn. Both Agents will provide a response in each turn.
● Evaluate both responses and determine which response is better and by how much,
then provide an overall quality score for both responses.

The use of ChatGPT or any other LLM or AI Assistant is completely prohibited in this project.
Anyone found using ChatGPT will be removed immediately.
you create it. we globalize it.

Accessing the UI Platform


you create it. we globalize it.

Interactive Preference Collection

Accessing the Platform


In the annotation/review/QA main page, full guidelines and FAQ spreadsheet will be attached for
your reference.

● Log on to https://dashboard.e2f.ai/ with your Gmail account.


● Please set that account as default when you first log into the dashboard and remove all
other accounts and cookies before logging in.
● Click the e2f logo to see the side Menu of the dashboard
● Select Annotations
● Select <INTERACTIVE PREFERENCE COLLECTION>
you create it. we globalize it.
you create it. we globalize it.

Who to Contact
Platform support (e2f Support): For any questions or concerns related
to context:
● Via email: ● PMs: Ngoc & Julie
vendorsupport@e2f.com ● Email: annotation@e2f.com

● Via chat:
For questions about rates/contracts:
https://jobs.e2f.io/
● Chi (kig@e2f.com),
● Michelle (mlt@e2f.com)
● Email: vendors@e2f.com
you create it. we globalize it.

Simplified Instructions
you create it. we globalize it.

Interactive Preference Collection

Simplified Instructions
Read the instructions and begin or continue the conversation. Evaluate which Agent
response is better in each turn and by how much, then provide an overall score
evaluation for both responses.

Note:

● Agent responses should be helpful, factually accurate, and not contain any potentially
sensitive or harmful content.

● 1 Human prompt + 1 Agent response = 1 Turn

● If both responses shown convey the necessary information, are factually correct, and not
harmful, the better response is the one that best satisfies the original request, has better style,
better wording, or is more concise.
you create it. we globalize it.

Response Quality Score


you create it. we globalize it.

Quality Scoring 7 Great Truthful, Non-Toxic, Helpful,


Neutral, Comprehensive,
Detailed, and reaches beyond
Zero spelling, grammar, or punctuation
errors.

the surface-level.
In the Response Quality Score
evaluation step, quality scores Mediocre Truthful, Non-Toxic, Helpful and Zero spelling, grammar, or punctuation
5
range from 1 to 7, where 7 is Neutral in tone. Although it does errors. Could be a little more comprehensive,
Great and 1 is Terrible, not fully answer the question, it but is still helpful and fully satisfies the
is still relevant, factually correct, Human's request.
according to the definitions and and helpful.
requirements in the Response
Quality Score table. 3 Bad Does not completely fulfill the A response with a 3-rating has at least one of
ask or adhere to the the following violations:
instructions. Is unhelpful or is ● At least (1) spelling or grammar error.
factually incorrect. Contains ● Does not contain a disclaimer, if one
grammatical, stylistic errors. should have been included.
In general, responses with ● Does not meet all parameters.
higher scores should be helpful, ● Provides false information or advice.
relevant, engaging, and ● Is not helpful to the Human or does not
adhere to instructions.
factually correct. Responses
that convey incorrect 1 Terrible Is irrelevant to the dialog Assign a 1-rating automatically if:
information, are off-topic, or history, or nonsensical. Contains ● The response is empty.
are nonsensical, should receive sexual, violent, harmful content,
or personal data. The response


The response is nonsensical.
The response is irrelevant.
lower scores. is empty, wrong, or nonsensical. ● Contains any sexual, violent, harmful, or
personal info.
you create it. we globalize it.

Comparing Responses in a Task


you create it. we globalize it.
you create it. we globalize it.

What to look for…


Consider the following:
● Spelling, grammar, and sentence
structure.
Consider how would your perception of
the Agent be influenced by each of the ● How factually accurate the
two responses, either positively or response is.
negatively, due to the quality of the ● How disruptive the response is
response. to the stream of conversation.
● Is the response harmless,
Quality can be impacted by several helpful, and honest?
factors, including helpfulness, clarity,
informativeness, and whether the ● Does the response go beyond
the surface, and does it provide a
response makes you want to continue comprehensive, complete
interacting with the Agent. response?
you create it. we globalize it.

Interactive Preference Collection

Overall Ranking Tenets


The best response addresses all of the specifications or requests in your utterance, and is the most relevant,
informative, thoughtful, logical, and well-formed.

● The best response should be ranked the highest. The worst response should be ranked lowest. If two
responses are similar in quality, you can select tie.
● Agents sometimes generate factually incorrect responses. Try to fact-check the information in the response
with a quick Google search. If a response has factually incorrect information, rank it lower.
● Ties are allowed: this means that there may be occasions where there is a tie in response quality. That said, they
should not be frequent occurrences.
● In a scenario where both responses are of similar quality, rank them based on which response fully answers or
satisfies the prompt in the most helpful, well-formed, clear, logical, and natural manner. Keep in mind, as
stated above, some responses may be ranked the same.
● If a response is truncated, meaning it appears to stop in the middle of a sentence or word, rank subjectively
based on its quality compared to the rest of the responses as normal.
you create it. we globalize it.

Harmless, Helpful, Honest Tenets

Harmless Helpful Honest


1. Do not make 1. The information in the
1. Does not encourage
assumptions about the response is accurate and
hate, intolerance,
Human. details can be proven.
misinformation, or
insults.
2. Responses are 2. The information is from a
2. Does not provide comprehensive, and Training trustworthy source,
responses to requests US-Centric provided in an unbiased
Training
that the Agent is not On tools and manner.
qualified to answer. 3. All parameters ofasa
processes, as well
guidelines,
request must
terminology andbestyle. 3. If an answer cannot be
3. Does not respond with answered, addressed, or found, the response must
subjective opinions or fulfilled and all state that while providing
preference. requirements of a prompt some information.
are followed or met.
Training
On tools and
processes, as well as
guidelines,
terminology and style.
you create it. we globalize it.

Examples
you create it. we globalize it.

Example 1:
Prompt: “How many miles away is the Earth from the Moon?”

Response 1 Response 2 Which is better?

"On average, the Earth


Response 2 is barely better.
"On average, the is approximately
Earth is 238,855 miles from the Training

approximately Moon. AtTraining


its closest
238,855 miles from point, processes,
it is 225,623
On tools and
as well as It is the more accurate and
the Moon." miles terminology
away, and
guidelines,
at its
and style. comprehensive response
furthest 252,088 miles that goes beyond a
away." surface-level answer.

Training
On tools and
processes, as well as
guidelines,
terminology and style.
you create it. we globalize it.

Example 2:
Prompt: “How many miles away is the Earth from the Moon?”

Response 1 Response 2 Which is better?

Response 1 is slightly better.


Training
"The earth is far from "MercuryTraining
is not near
moon." the Earth."
On tools and
processes, as well as Although neither response
guidelines,
terminology and style. answers the question and
satisfies the Human request,
Response 1 provides some
relevant information.
Training
On tools and
processes, as well as
guidelines,
terminology and style.
you create it. we globalize it.

Example 3:
Prompt: “How many miles away is the Earth from the Moon?”

Response 1 Response 2 Which is better?


"On average, the
Earth is Response 1 is much better.
approximately
Training
238,855 miles from
the Moon. At its "MercuryTraining
is not near
closest point, it is the Earth."
On tools and
processes, as well as Response 2 does not answer
225,623 miles away,
guidelines,
terminology and style. or address the question.
and at its furthest Response 1 completely and
252,088 miles away." comprehensively answers
the question.
Training
On tools and
processes, as well as
guidelines,
terminology and style.
you create it. we globalize it.

Writing Realistic Human Prompts


you create it. we globalize it.
you create it. we globalize it.

Writing Tips
Realistic and diverse Human behavior. Create realistic, Think of other Human behaviors.
meaningful, challenging Human utterances that you can
imagine a real Human saying during a conversation. A list of How would you engage with the Agent
some potential Human behaviors that can be used (do not in conversation? The ultimate goal is to
limit to these): generate realistic dialogs.

● Dive deeper. Ask increasingly more difficult


questions. ------------
● Repeat. If the Agent fails to understand a question, a
real Human is likely to repeat themselves.
● Request changes. Request specific changes to the Don’t repeat any past utterances or
existing response. phrases.
● Follow-up. Ask about a related topic or a question.
● Refer to a previous turn. Refer to something that was Try to avoid repeating utterances across
discussed in previous turns (for example: "You different conversations you have on the
mentioned X, can you explain that in more detail?"). project.
● Shift topics. Ideally this should be done in a natural
way.
you create it. we globalize it.

Dialogue Example:
Human: I'd like to talk about Taylor Swift
Agent: Sure, I've heard that she's a very talented singer and songwriter. What would you like to know about her?

Good Bad Very Bad


● Human: What are
some of Taylor Swift's ● Human: Tell me ● Human: Who is she
most popular songs? more currently dating?
● Human: I love her song ● Human: Who is Training
Blank Space! Are there she? Training
other songs she has On tools and
● Human: Is she
that are like that?
processes, as well as
● Human: I want to
guidelines,
playing any
terminology and style.
learn to play concerts nearby?
● Human: Has Taylor
Swift won any awards guitar
for her music?

Training
On tools and
processes, as well as
guidelines,
terminology and style.
you create it. we globalize it.

Reviewing and QA Checks


you create it. we globalize it.

Interactive Preference Collection

Quality Assurance
After the initial annotator’s work has been completed, the task will be sent to another party, who will act as a
reviewer. They will have a chance to check the ratings of the first annotator and provide feedback or adjust the
ratings. When that step is complete, it will be reviewed once more by a QA checker.

Annotator Reviewer QA Checker


you create it. we globalize it.

Review and QA Instructions

Review QA Check
● Ensure the file has good quality, is ● Ensure the file has good quality, is
free of errors, and ready for free of errors, and ready for
delivery. If the Annotators’ answers delivery.
do not meet the requirements, ● Evaluate the Annotators’ work and
Reviewers must update and/or provide feedback. The focus is on
rewrite them accordingly. On tools and high quality and adherence to the
processes, as well as
guidelines, guidelines. Fix any errors.
● Evaluate the Annotators’ answers. terminology and style.
● Evaluate the Reviewer’s work and
The focus is on high quality and provide feedback in regards to
adherence to the guidelines. quality, accuracy against guidelines,
and constructive ratings.
On tools and
processes, as well as
guidelines,
terminology and style.
you create it. we globalize it.

Thank you

You might also like