Professional Documents
Culture Documents
Indic Written Domain Conversion
Indic Written Domain Conversion
PUNCTUATION FORMAT
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 1/10
12/10/2018 Indic_TEST_SET
Di cult utterances
Everything relating to problematic utterances (background noise, false starts, etc.) or
di erent language varieties.
Skipping a prompt
If utterances contain both human speech and speech from a machine (e.g. the
voices of Google Now or Siri or TV/radio), only transcribe the human speech
If the audio contains both speech and laughter, transcribe only the speech.
Only audio that meet one or more of the following criteria may be skipped:
You can't hear People in the audio After reading ALL of You can't
anything / The are not speaking to the foreign understand one or
audio doesn't load. the device. language section, more words in the
you determine that audio.
all or part of the
speech is in a
di erent language
that is not
commonly
understood by
speakers of your
language.
The audio only has The audio only has The audio only has The audio has one
speech from a human sounds (e.g. sounds (car of the following:
machine (TV, people singing, only sounds, wind, credit card
phone, recording, laughter or kitchen sounds, information,
radio). babbling) animal sounds) passport numbers,
driver licence,
national ids, health
care information.
If a very small part of a word (at most one syllable) has been cut off, and you know
what the word is supposed to be, transcribe the entire word. If you are not sure
what the word should be, do not transcribe the word at all. Do not put punctuation
after words that have been cut off.
If a quotation is cut off in the middle, use an end quotation mark anyway.
Transcribe only numbers that you hear even if the speaker didn't nish saying the
entire number.
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 2/10
12/10/2018 Indic_TEST_SET
Only transcribe foreground speech. A user's speech may go from the foreground
to the background based on the volume of their speech and who they are speaking
to.
If two people take turns, without overlap, and are both in the foreground at
roughly the same volume, transcribe the speech of both speakers as different
sentences. Separate different speakers' sentences with end punctuation.
If one person clearly speaks in the foreground and someone interrupts at roughly
the same volume with a brief (less than a second) overlapping speech segment,
transcribe the main speaker and ignore the rest.
If two or more people are speaking at once with no one clearly in the foreground,
tag as [overlapping]. Do this for overlaps longer than one second. Use this tag even
when one person is a bit louder than the other(s) and you can tell what they're
saying.
Foreign language
Do not skip utterances that contain words in English. They should all be
transliterated using your script's characters. English letters should only be used for
the following: measurement units, URLs, tech words and certain company names
If words in any other foreign language (besides English) are included in a sentence
of your target language, transliterate them only if they can be commonly
understood by speakers of your language. Otherwise, skip the utterance.
Categorize each utterance with foreign words as "Contains English:" and choose
"Yes", "No" or "Unsure".
Accents
If you hear a word with non-standard pronunciation, transcribe the word using
the standard spelling according the of cial dictionary of your language.
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 3/10
12/10/2018 Indic_TEST_SET
Agreed spelling
Spelling conventions for words where several options are thinkable, as well as proper
names.
Spelling out
If a word is spelled out, write it with spaces in between. This rule does not apply to
acronyms, URLs or email addresses.
Interjections
Ignore actual laughter that is included within speech. If the entire audio contains
only laughter, skip the audio.
Proper names
For proper names, always use the of cial spelling and punctuation.
If a personal name could have multiple spellings and context does not help choose
a spelling, use the spelling that yields the most Google search hits when you search
for the name followed by the word "name" (without quotation marks) (e.g. "Anna
name").
Format brand and company names as they are formatted on their Wikipedia page
in your language. Do not transliterate if the formatting on the Wikipedia page is
not transliterated.
Use the spelling "Ok" for the phrase "Ok Google", as well and related phrases like
"Ok Google Now" and "Ok Glass". For all other caes, transcribe the word as "okay".
Media title
Transcribe all media titles with their original punctuation. If punctuation from the
title occurs at the end of a sentence, do not transcribe another punctuation mark
(a period, question mark, or exclamation mark) for end of the sentence.
Multiple spellings
If you hear a word that does not sound like a standard word of your language
because there is a small sound change (i.e. accent, speech error, speech
impairment, etc), transcribe the intended word.
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 4/10
12/10/2018 Indic_TEST_SET
If you hear a word that does not sound like a standard word of your language, but
it is obviously based on real words, suf xes, or pre xes, transcribe as is.
If you hear a word that does not sound like a standard word of your language
because it appears to be nonsense, rst perform a Google search for the word. If
there is a clear candidate, transcribe that word.
If a word appears to be nonsense and a Google search returns no clear results but
it is easy to spell and articulated clearly, transcribe it anyway.
If a prompt contains nonsense words, search them on the internet. If you still
cannot gure how the word should be spelled, skip the prompt.
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 5/10
12/10/2018 Indic_TEST_SET
Punctuation
Follow the punctuation regulations of your locale. Additional conventions are outlined in
this section.
Incomplete sentences in the audio may be a result of cut-off audio samples. Add
end punctuation to incomplete sentences that sound like the end of a sentence. If
they do not clearly sound like the end of a sentence, leave out ending punctuation.
Transcribe all voice actions with ending punctuation. A voice action is a request
spoken to a device and includes a verb.
Do not add ending punctuation to web search queries. Web search queries are
spoken versions of what a person might type into a google search bar. Note: Web
search queries are different from voice action requests because they do not
include a command directed to a device.
Commas
Only use commas if they are required according to your language grammar.
Use commas in sign-offs, such as those at the end of a message. Do not use end
punctuation.
Do not use commas in sentences that consist only of a greeting and an addressee.
If a greeting occurs at the beginning of a sentence or fragment, place a comma
after the greeting. If the greeting includes an addressee, place the comma after the
addressee.
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 6/10
12/10/2018 Indic_TEST_SET
Intonation marks
Only use an exclamation point if the speaker clearly uses a loud or excited tone of
voice.
If a speaker is quoting another person, transcribe the quoted speech following the
punctuation rules of your language.
If the quoted text is a complete sentence, transcribe ending puntuation inside the
quotation marks. In cases like these, do not add an additional ending punctuation
after the main sentence.
If a quote appears at the end of a voice action, use a colon. If the quote appears in
the middle of a voice action, put quotation marks around the quote.
Do not use quotation marks for metalinguistic uses of words or phrases. These
uses include de ning the word, talking about the spelling of the word, or any other
type of reference to the word itself as a thing.
Other symbols
Apart from standard letters, you should not use any other symbol than: 0-9
äâàæÆçÇéèëêïîñÑôöŒœüûùμÿÄÂÀÉÈËÊÏÎÔÖÜÛÙŸ²³,?!'"_°:.()<>{}[]√/@#$€£₹
+=%*&-.;
When two opposing teams are mentioned, include a hyphen between their
names.
Spoken punctuation
For sentence-level spoken punctuation, write out the full word or words between
curly brackets. Do not add punctuation symbols after spoken punctuation. Be
careful with homonyms. (See exceptions in the next rule.)
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 7/10
12/10/2018 Indic_TEST_SET
Format
Transcribe numbers, abbreviations etc. following the formatting conventions in this section.
Number
Cardinals and ordinals from 0 to 9 are written with letters (except for measures
and currency - see Currency and Unit). Use digits for cardinals and ordinals 10 and
above, even if they are coordinated with numbers under 10. Transcribe all decimal
numbers as digits.
When two or more numbers refer to the same noun, and at least one number is 10
or greater, transcribe both as numerals.
For long numbers (4+ digits) indicating quantity, use a comma to separate groups
of three digits
For mixed numbers in math expressions and units & measures, transcribe them
using numerals.
When referring to items (not units or measures), write fractions out in words. With
mixed numbers, write the whole number part out in words if it is under ten,
otherwise write it with numerals.
For mixed numbers that represent currency amounts, always use decimals.
Transcribe percentages using numerals followed by the "%" sign. In the unlikely
case that you encounter a number of a million or greater used as a percentage,
spell it out.
Transcribe phone numbers using the most common format in the transcription
language.
Transcribe phone numbers as you would write them down in their natural groups.
Do not use hyphens between groups. When applicable, the STD code should be
surounded by spaces.
Transcribe alpha-digit sequences (product codes, car models, etc.) in their most
natural way (there may be more than one way to transcribe). Do not transcribe
credit card numbers or any other personal information numbers.
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 8/10
12/10/2018 Indic_TEST_SET
Math expressions should be transcribed with numerals and math symbols with
spaces in between them.
Write speci c ranges or multiple amounts of currency with numerals. If the range
is not speci ed, write the numbers out as spoken.
For megabytes use For kilobytes use For gigabytes use For terabytes use
MB KB GB TB
For milimeter use For centimeter use For meter use m For kilometer use
mm cm km
For miles per hour For kilometers per
use mph hour use km/h
For square
kilometers or
kilometers squared
use km²
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 9/10
12/10/2018 Indic_TEST_SET
Write times in hh:mm format whenever possible, unless it would look unnatural to
do so.
Address
Write out the full names of locations, roads, states, etc. Only use abbreviations
when explicitly spoken.
Web
Write URLs, email addresses, and Twitter hashtags as they are spoken and don't
capitalize them.
Do not correct speaker errors such as transcribing a slash when the user actually
says "backslash".
If the speaker drops a "w" or dots and it's an obvious URL, you should correct these
errors. If the speaker doesn't say the "w"s at all, do not add them.
Abbreviation
https://speech.google.com/annotation/guidelines/indic_test_set/index.html 10/10