Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Understanding What Impacts the Final Rating and How

Imagine you are starting with ground zero, where both responses are the same. Then you start

considering the ratings across individual dimensions, to move the needle to slightly better, better or much

better, based on the weights/impact of individual dimensions on the final rating.

NOTE: The information here is for guidance and directional purposes only. We acknowledge and

understand that there can be edge cases and we trust you to keep an open mind while attempting these

tasks.

Starting Point

Overall Quality

Impact on the Final Comparison Scores - HIGHEST

​ If Overall Quality of a response is 1 point higher than the other, then one response is slightly

better - Possible outcomes:

​ If Overall Quality of a response is 2 points higher than the other, then one response is better -

Possible Outcomes:
​ If Overall Quality of a response is 3 points higher than the other, then one response is much

better - Possible outcomes:

​ If Overall Quality is the same, then both the responses are likely the same or we need to look at

the next dimensions - See the next slide!


Harmlessness, Truthfulness and Instructions Following
Impact on the Overall Quality of the Response and Final Comparison Scores - HIGH

​ If major issues are marked in any of these dimensions for one response

​ Overall quality should be Pretty Bad, Horrible

​ Other response can be better/much better than this one

​ If minor issues are marked in any of these dimensions for one response

​ Overall quality should be Pretty Bad, Okay or Pretty Good, dependent on the frequency of errors

​ The other response can be the same/better/slightly better than this, dependent on the frequency

of errors

​ One Response cannot be “much better” or “better” than the other if both responses have the

same ratings for these dimensions. Possible outcomes:


Writing Style and Verbosity

Impact on Overall Quality and Final Comparison Scores - Medium

​ These dimensions do not make one response “much better” than the other if they are the ONLY

differentiating factor, they attribute to responses being the same or one being slightly better.

Better is also possible depending on the severity/frequency of the error in these dimensions but

only in a few scenarios. Possible outcomes:

​ The criteria of Instructions Following - Completeness and Depth take a higher priority in impacting

the end preference over writing style, verbosity and formatting

Formatting

Impact on Overall Quality and Final Comparison Scores - Low

​ This should not be the main differentiating factor between one response being better than the

other, unless - a) the formatting is a part of the Instructions Following i.e. the Prompt Constraint,

or b) it changes the Response Quality drastically (rare cases but possible)

You might also like