Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

12/10/2018 Indic_TEST_SET

**This document is con dential, do not redistribute**

Indic Written Domain


Conventions
You will listen to some audio les in your language and transcribe exactly what you hear. In
general, we ask that you transcribe audio according to your language/’s grammar and this
documentation. It is essential that you read through this entire document before beginning
the task, so that you can correctly identify phrases that will need special formatting, and
how to format them.

DIFFICULT UTTERANCES AGREED SPELLING

Skipping a prompt Spelling out


Hesitations and truncations Interjections
Background and foreground speech Proper names
Foreign language Brand and product
Accents Media title
Multiple spellings

PUNCTUATION FORMAT

Fragments versus sentences Number


Commas Currency and unit
Intonation marks Date and time
Colon and quotation Address
Other symbols Web
Spoken punctuation Abbreviation

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 1/10
12/10/2018 Indic_TEST_SET

**This document is con dential, do not redistribute**

Di cult utterances
Everything relating to problematic utterances (background noise, false starts, etc.) or
di erent language varieties.

Skipping a prompt

If utterances contain both human speech and speech from a machine (e.g. the
voices of Google Now or Siri or TV/radio), only transcribe the human speech

If the audio contains both speech and laughter, transcribe only the speech.

Profanity should be fully transcribed. Under very rare circumstances, extremely


offensive profanity can be skipped.

Only audio that meet one or more of the following criteria may be skipped:

You can't hear People in the audio After reading ALL of You can't
anything / The are not speaking to the foreign understand one or
audio doesn't load. the device. language section, more words in the
you determine that audio.
all or part of the
speech is in a
di erent language
that is not
commonly
understood by
speakers of your
language.
The audio only has The audio only has The audio only has The audio has one
speech from a human sounds (e.g. sounds (car of the following:
machine (TV, people singing, only sounds, wind, credit card
phone, recording, laughter or kitchen sounds, information,
radio). babbling) animal sounds) passport numbers,
driver licence,
national ids, health
care information.

Hesitations and truncations

Do not transcribe false starts unless they are complete words.

If a very small part of a word (at most one syllable) has been cut off, and you know
what the word is supposed to be, transcribe the entire word. If you are not sure
what the word should be, do not transcribe the word at all. Do not put punctuation
after words that have been cut off.

If a quotation is cut off in the middle, use an end quotation mark anyway.

If a word is repeated ve times or less, transcribe each repetition. But if a word is


repeated more than ve times, skip the prompt.

Transcribe only numbers that you hear even if the speaker didn't nish saying the
entire number.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 2/10
12/10/2018 Indic_TEST_SET

Do not transcribe ller words unless intended by the speaker to be transcribed.


Never lengthen them.

Background and foreground speech

Only transcribe foreground speech. A user's speech may go from the foreground
to the background based on the volume of their speech and who they are speaking
to.

If two people take turns, without overlap, and are both in the foreground at
roughly the same volume, transcribe the speech of both speakers as different
sentences. Separate different speakers' sentences with end punctuation.

If one person clearly speaks in the foreground and someone interrupts at roughly
the same volume with a brief (less than a second) overlapping speech segment,
transcribe the main speaker and ignore the rest.

If two or more people are speaking at once with no one clearly in the foreground,
tag as [overlapping]. Do this for overlaps longer than one second. Use this tag even
when one person is a bit louder than the other(s) and you can tell what they're
saying.

Foreign language

Do not skip utterances that contain words in English. They should all be
transliterated using your script's characters. English letters should only be used for
the following: measurement units, URLs, tech words and certain company names

If words in any other foreign language (besides English) are included in a sentence
of your target language, transliterate them only if they can be commonly
understood by speakers of your language. Otherwise, skip the utterance.
Categorize each utterance with foreign words as "Contains English:" and choose
"Yes", "No" or "Unsure".

Accents

If you hear a word with non-standard pronunciation, transcribe the word using
the standard spelling according the of cial dictionary of your language.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 3/10
12/10/2018 Indic_TEST_SET

**This document is con dential, do not redistribute**

Agreed spelling
Spelling conventions for words where several options are thinkable, as well as proper
names.

Spelling out

If a word is spelled out, write it with spaces in between. This rule does not apply to
acronyms, URLs or email addresses.

Interjections

Transcribe words representing laughter or other non-speech vocalizations with up


to three syllables, but no more.

Ignore actual laughter that is included within speech. If the entire audio contains
only laughter, skip the audio.

Proper names

For proper names, always use the of cial spelling and punctuation.

If a personal name could have multiple spellings and context does not help choose
a spelling, use the spelling that yields the most Google search hits when you search
for the name followed by the word "name" (without quotation marks) (e.g. "Anna
name").

Brand and product

Format brand and company names as they are formatted on their Wikipedia page
in your language. Do not transliterate if the formatting on the Wikipedia page is
not transliterated.

Use the spelling "Ok" for the phrase "Ok Google", as well and related phrases like
"Ok Google Now" and "Ok Glass". For all other caes, transcribe the word as "okay".

Media title

Write media titles as they are formatted on Wikipedia in your language, if


available.

Do not use quotation marks for media titles.

Transcribe all media titles with their original punctuation. If punctuation from the
title occurs at the end of a sentence, do not transcribe another punctuation mark
(a period, question mark, or exclamation mark) for end of the sentence.

Multiple spellings

If you hear a word that does not sound like a standard word of your language
because there is a small sound change (i.e. accent, speech error, speech
impairment, etc), transcribe the intended word.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 4/10
12/10/2018 Indic_TEST_SET

Transcribe onomatopoeia when clearly spoken. Otherwise, skip the sentence.

If you hear a word that does not sound like a standard word of your language, but
it is obviously based on real words, suf xes, or pre xes, transcribe as is.

If you hear a word that does not sound like a standard word of your language
because it appears to be nonsense, rst perform a Google search for the word. If
there is a clear candidate, transcribe that word.

If a word appears to be nonsense and a Google search returns no clear results but
it is easy to spell and articulated clearly, transcribe it anyway.

If a prompt contains nonsense words, search them on the internet. If you still
cannot gure how the word should be spelled, skip the prompt.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 5/10
12/10/2018 Indic_TEST_SET

**This document is con dential, do not redistribute**

Punctuation
Follow the punctuation regulations of your locale. Additional conventions are outlined in
this section.

Fragments versus sentences

Only add punctuation when it is grammatically required.

Answers to questions and sentences with dropped subjects should be transcribed


with ending punctuation.

Interjections, greetings, and farewells said in isolation should be transcribed with


ending punctuation.

Transcribe web searches without ending punctuation.

Incomplete sentences in the audio may be a result of cut-off audio samples. Add
end punctuation to incomplete sentences that sound like the end of a sentence. If
they do not clearly sound like the end of a sentence, leave out ending punctuation.

Transcribe all voice actions with ending punctuation. A voice action is a request
spoken to a device and includes a verb.

Do not add ending punctuation to web search queries. Web search queries are
spoken versions of what a person might type into a google search bar. Note: Web
search queries are different from voice action requests because they do not
include a command directed to a device.

Commas

Only use commas if they are required according to your language grammar.

Put a comma after a discourse word, interjection, or yes/no word if it is followed


by a sentence. However, if there is a long pause after the discourse word,
interjection, or yes/no word, put ending punctuation after it.

Use commas in lists.

Use commas in sign-offs, such as those at the end of a message. Do not use end
punctuation.

Do not use commas in sentences that consist only of a greeting and an addressee.
If a greeting occurs at the beginning of a sentence or fragment, place a comma
after the greeting. If the greeting includes an addressee, place the comma after the
addressee.

Except in greetings, sentence-initial and sentence- nal addressees should be


separated by a comma.

The phrase "Ok Google" in isolation is transcribed without a comma or end


punctuation. When the phrase appears before longer utterances, place a comma
after "Google".

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 6/10
12/10/2018 Indic_TEST_SET

Intonation marks

Punctuate the following as questions: 1) All utterances syntactically built as


questions, regardless of intonation. 2) All utterances which sound like they are
being used as questions, regardless of sentence structure.

Only use an exclamation point if the speaker clearly uses a loud or excited tone of
voice.

Colon and quotation

If a speaker is quoting another person, transcribe the quoted speech following the
punctuation rules of your language.

If the quoted text is a complete sentence, transcribe ending puntuation inside the
quotation marks. In cases like these, do not add an additional ending punctuation
after the main sentence.

If a quote appears at the end of a voice action, use a colon. If the quote appears in
the middle of a voice action, put quotation marks around the quote.

Do not use quotation marks for metalinguistic uses of words or phrases. These
uses include de ning the word, talking about the spelling of the word, or any other
type of reference to the word itself as a thing.

Other symbols

Apart from standard letters, you should not use any other symbol than: 0-9
äâàæÆçÇéèëêïîñÑôöŒœüûùμÿÄÂÀÉÈËÊÏÎÔÖÜÛÙŸ²³,?!'"_°:.()<>{}[]√/@#$€£₹
+=%*&-.;

When two opposing teams are mentioned, include a hyphen between their
names.

Include a hyphen between locations in ight itineraries.

Spoken punctuation

For sentence-level spoken punctuation, write out the full word or words between
curly brackets. Do not add punctuation symbols after spoken punctuation. Be
careful with homonyms. (See exceptions in the next rule.)

However, don't spell out punctuation if it contradicts the established transcription


conventions of a certain phrase like web pages, email addresses, addresses, phone
numbers, etc.

If a word that can refer to a punctuation mark is spoken in isolation, it should be


written out between curly brackets.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 7/10
12/10/2018 Indic_TEST_SET

**This document is con dential, do not redistribute**

Format
Transcribe numbers, abbreviations etc. following the formatting conventions in this section.

Number

Only Western Arabic numerals should be used.

Cardinals and ordinals from 0 to 9 are written with letters (except for measures
and currency - see Currency and Unit). Use digits for cardinals and ordinals 10 and
above, even if they are coordinated with numbers under 10. Transcribe all decimal
numbers as digits.

When two or more numbers refer to the same noun, and at least one number is 10
or greater, transcribe both as numerals.

Write lists of numbers with digits and without commas.

For long numbers (4+ digits) indicating quantity, use a comma to separate groups
of three digits

In math expressions or units & measures, transcribe fraction words using


numerals and slashes. Be careful not to use pre-combined fractions like "¼".

For mixed numbers in math expressions and units & measures, transcribe them
using numerals.

When referring to items (not units or measures), write fractions out in words. With
mixed numbers, write the whole number part out in words if it is under ten,
otherwise write it with numerals.

For mixed numbers that represent currency amounts, always use decimals.

Transcribe percentages using numerals followed by the "%" sign. In the unlikely
case that you encounter a number of a million or greater used as a percentage,
spell it out.

Use Roman numerals only when part of an of cial name or title.

Transcribe seasons and episodes of television shows with numerals.

If it is a product type or statistic, use the common written form.

Transcribe phone numbers using the most common format in the transcription
language.

Transcribe phone numbers as you would write them down in their natural groups.
Do not use hyphens between groups. When applicable, the STD code should be
surounded by spaces.

Transcribe alpha-digit sequences (product codes, car models, etc.) in their most
natural way (there may be more than one way to transcribe). Do not transcribe
credit card numbers or any other personal information numbers.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 8/10
12/10/2018 Indic_TEST_SET

Math expressions should be transcribed with numerals and math symbols with
spaces in between them.

Currency and unit

Transcribe currencies as commonly written in the transcription language.

Write speci c ranges or multiple amounts of currency with numerals. If the range
is not speci ed, write the numbers out as spoken.

Abbreviate all units that follow numeric values.

If it is clear from context that a number or number sequence refers to currency or


time, format it as such.

Common technical abbreviations

For megabytes use For kilobytes use For gigabytes use For terabytes use
MB KB GB TB

Common measurements of distance and rate

For milimeter use For centimeter use For meter use m For kilometer use
mm cm km
For miles per hour For kilometers per
use mph hour use km/h

Common measurements of area

For square
kilometers or
kilometers squared
use km²

Common measurements of weight and volume

For grams use g For miligrams use For kilograms ाम


mg use kg

Date and time

Use the natural form for transcribing dates.

Exception: When the date is spoken as a sequence of numbers, transcribe as such.

Use the natural form for transcribing times whenever possible.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 9/10
12/10/2018 Indic_TEST_SET

Write times in hh:mm format whenever possible, unless it would look unnatural to
do so.

Address

Write out the full names of locations, roads, states, etc. Only use abbreviations
when explicitly spoken.

Transcribe entities and locations by using a comma between them "ENTITY,


LOCATION"

Web

Write URLs, email addresses, and Twitter hashtags as they are spoken and don't
capitalize them.

Do not correct speaker errors such as transcribing a slash when the user actually
says "backslash".

If the speaker drops a "w" or dots and it's an obvious URL, you should correct these
errors. If the speaker doesn't say the "w"s at all, do not add them.

If a URL is spelled out in individual letters, transcribe without spaces between


individual letters.

Abbreviation

Do not abbreviate unless the speaker says an abbreviated form.

In acronyms, do not use periods between letters.

If a brand name uses periods, include the periods.

https://speech.google.com/annotation/guidelines/indic_test_set/index.html 10/10

You might also like