Professional Documents
Culture Documents
ro-RO TEST SET PDF
ro-RO TEST SET PDF
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 1/20
10/4/2017 ro-RO_TEST_SET
Transcription quality
Comply with the standard rules of the writing system.
Typo
Avoid making any typographical errors. Carefully check your work before marking
items as "complete".
Caută-mă pe Facebook.
NOT: Caută-mă pe Facebok.
Caută pe Google.
NOT: Caută pe Gogle.
Context error
A context error occurs when a real word is used incorrectly or when the incorrect
form of a word is used. This includes homophones and punctuation, among other
things.
El ia autobuzul.
NOT: El i-a autobuzul.
Do not transcribe words that are not spoken, even if they are obviously intended by
the speaker. Avoid putting words in the speaker's mouth. However, do transcribe
implied times and units of currency.
Transcribe all words spoken, even if they are not intended by the speaker. For
interjections and non-speech vocalizations, refer to Agreed Spelling > Interjections
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 2/20
10/4/2017 ro-RO_TEST_SET
YouTube YouTube
YouTube
Substitution
Spacing
For most types of punctuation, do not put a space between the preceding word and
the punctuation.
Vii?
NOT: Vii ?
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 3/20
10/4/2017 ro-RO_TEST_SET
Punctuation
Follow the punctuation regulations of your locale. Additional conventions are outlined in this
section.
Add punctuation where needed, but err on the side of keeping it minimal.
La naiba. interjection
Bună. greeting
Noroc prietenului meu cel mai bun. Entire phrase is being used as an interjection.
Capitalize sentence fragments that sound like the beginning of a sentence. Add end
punctuation to sentence fragments that sound like the end of a sentence. For
fragments that do not clearly sound like the beginning or end of a sentence, leave out
capitalization and punctuation. Note that sentence fragments may be a result of cut-
off audio samples.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 4/20
10/4/2017 ro-RO_TEST_SET
If an utterance is not clearly a sentence according to the above rules and examples,
do not capitalize or punctuate it as a sentence.
Commas
Only use commas where required. Err on the side of minimal punctuation. Do not
rely on intonation.
Use a comma when a sentence starts with a discourse word, interjection, or yes/no
word. However: If there is a long pause between a discourse word, interjection, or
yes/no word and a full sentence that follows it, treat that initial word as a separate
sentence.
Bine. Asta e
Substantial pause after "bine".
foarte plăcut.
Ok Google
Intonation marks
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 5/20
10/4/2017 ro-RO_TEST_SET
Capitalize and punctuate the following as questions: 1) All queries syntactically built
as questions, regardless of intonation. 2) All queries which sound like they are being
used as questions, regardless of sentence structure.
vremea în Query uses rising intonation, but is most likely a web search rather
Tucson than a true question.
Use a colon between reported speech verbs and direct quotations. When the
quotation is a full sentence, it should be capitalized.
Prietenul meu a
spus: „aligator
crocodil”.
NOT: Prietenul meu
a spus, „aligator
The word "spune" is the most common reported speech verb in
crocodile”.
Romanian, but other words ("cere", "răspunde") can be used for
NOT: Prietenul meu
reported speech.
a spus „aligator
crocodile.”
NOT: Prietenul meu
a spus „aligator
crocodile”.
Spune
„onomatopee”.
Omit the colon if the verb is in the imperative.
NOT: Spune:
„onomatopee”.
When the sentence starts with the quotation, use a comma between the quotation
and reported speech verbs.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 6/20
10/4/2017 ro-RO_TEST_SET
„Ana știe ce îmi place”, se miră Use a comma between reported speech verbs and
Andrei. direct quotation.
NOT: „Ana știe ce îmi place“ se
miră Andrei.
When the quotation quali es as a sentence, question marks and exclamation marks
should be placed inside the quotation marks. Periods, on the other hand, should be
kept outside the quotes.
Use a colon but no quotation marks in quotative voice actions when the quote
follows the command. Use quotation marks when the quote is in the middle of the
sentence.
Tradu „Care este numele tău?” în The quote is in the middle of a sentence, so
franceză. use quotation marks and omit the colon.
Other symbols
Apart from the Latin letters a through z, you should not use any other symbol than:
0-9 âàăäéèëêîșțüùÂÀĂÄÉÈËÊÎȘȚÜÙ²³,?!~^\'"„”_°:.()<>{}[]√/@#$€£+=%*&-.;
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 7/20
10/4/2017 ro-RO_TEST_SET
The characters s-cedilla (ş) and t-cedilla (ţ) should not be used. The authorized
characters are s-comma (ș) and t-comma (ț).
Spoken punctuation
For sentence-level spoken punctuation, write out the full word or words between
curly brackets. Do not add punctuation symbols after spoken punctuation. Be careful
with homonyms. (See exceptions in the next rule.)
Don't spell out internal punctuation like hyphens in web pages, email addresses,
addresses, phone numbers, or other word-level punctuation.
E actriță/model.
NOT: E actriță {slash} model. "e actriță slash model"
NOT: E actriță slash model.
{punct}
Treat spoken punctuation as you would regular symbols, and capitalize the following
sentence as normal.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 8/20
10/4/2017 ro-RO_TEST_SET
Format
Transcribe numbers, abbreviations etc. following the formatting conventions in this section.
Number
Cardinals and ordinals from 0 to 9 are written with letters (except for measures and
currency - see Currency and Unit). Use digits for cardinals and ordinals 10 and above,
even if they are coordinated with numbers under 10. Transcribe all decimal numbers
as digits.
In math expressions or units & measures, transcribe fraction words using numerals
and slashes.
Au nevoie de
1/4 de kg de
zahăr.
"au nevoie de un sfert de kilogram de zahăr"
NOT: Au nevoie
Here, the "un" before "sfert" is part of the fraction, so don't include
de ¼ de kg de
it in the transcription. Also, be careful not to include spaces or pre-
zahăr.
combined fraction characters.
NOT: Au nevoie
de 1 / 4 de kg
de zahăr.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 9/20
10/4/2017 ro-RO_TEST_SET
NOT: În trei
sferturi de milă,
fă dreapta.
For mixed numbers that represent currency amounts, always use decimals.
Poți să îmi împrumuți 2,50 $? "poți să îmi împrumuți doi dolari jumate"
A cumpărat casa de pe plajă pentru "a cumpărat casa de pe plajă pentru șapte
7,5 miliarde $. milioane jumate de dolari"
Transcribe percentages using numerals and the % sign. (In the unlikely case that you
encounter a number of a million or greater used as a percentage, spell it out.)
1 milion la sută
Transcribe phone numbers using the most common format in the transcription
language.
"plus patruzeci doi unu trei trei cinci patru patru cinci cinci"
+40 21 335 44 55
landline with country code (the leading "0" is removed)
"zero doi unu trei doi unu unu unu unu unu"
021 321 11 11 landline with two-digit area code preceded by the leading
"0"
If it really sounds like a math expression, then transcribe it with numbers and
symbols, with spaces in between.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 10/20
10/4/2017 ro-RO_TEST_SET
Cât înseamnă 8 ore * 12 $? "cât înseamnă opt ore ori doispreze dolari"
Cât înseamnă trei crocodili împarțiți la Does not sound like a true math expression
două iguane? with useful units.
10 $ "zece dolari"
For all other currencies and slang terms for money, spell out the words.
200 yeni
"două sute de yeni"
NOT: 200 ¥
Familia mea a cumpărat 10 L de suc de "familia mea a cumpărat zece litri de suc de
portocale. portocale"
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 11/20
10/4/2017 ro-RO_TEST_SET
Write times in hh:mm format whenever possible, unless it would look unnatural to
do so.
Address
Favor full spellings over abbreviations where natural, but use abbreviations when
explicitly spoken.
Craigslist, Detroit
Web
Write URLs, email addresses, and Twitter hashtags as they are spoken and don't
capitalize them.
http://123.com "h t t p două puncte bară bară un doi trei punct com"
If the speaker drops a "w" or dots and it's an obvious URL, you should correct these
errors. If the speaker doesn't say the "w"s at all, do not add them.
Abbreviation
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 12/20
10/4/2017 ro-RO_TEST_SET
Capitalize and abbreviate titles for people only when they precede proper names.
Joacă la juniori.
A văzut un OZN.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 13/20
10/4/2017 ro-RO_TEST_SET
Agreed spelling
Spelling conventions for words where several options are thinkable, as well as proper names.
Spelling out
If a word is spelled or obvious pauses are made between letters, spell it into letters as
it is said (often done for foreign names or businesses, for example). Use lowercase
letters for the spelled-out portion. This rule does not apply to acronyms or
initialisms, or to spelled-out web or email addresses.
Interjections
Ignore actual laughter that is included within speech. If the entire audio contains only
laughter, use the [skip] tag in PeraPera or select the appropriate reason from the
'Cannot transcribe' menu in Crowd Compute.
Proper names
Use of cial spelling, capitalization, and punctuation for proper names. Google them
and pay attention to the correct format. Of cial format and spelling of a proper name
may supercede the usual written transcription conventions detailed in this
document.
Vasile Alecsandri
NOT: Vasile Romanian poet
Alexandri
will.i.am
Kristin The celebrity spells her name di erently than the more common
Chenoweth "Kristen".
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 14/20
10/4/2017 ro-RO_TEST_SET
If a personal name could have multiple spellings and context does not help choose a
spelling, use the spelling that yields the most Google search hits when you search for
the name followed by the word "name" (without quotation marks) (e.g. "Anna name").
Format proper names as they are most commonly formatted on the entity's website
(especially of cial documents), if available, or the Wikipedia or IMDb page. In cases of
ambiguity, defer to their privacy policy. If no other sources, use top Google hits.
Lucrează la Amazon.
Toys "R" Us
YouTube
TAROM
IBM
PayPal
The phrase "Ok Google", as well as possible derivatives such as "Ok Google Now" and
"Ok Glass", require their own particular spelling of "okay". This spelling is unique to
these cases.
Ok Google
Ok Google Now
Ok Google, dovleci
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 15/20
10/4/2017 ro-RO_TEST_SET
Media title
Refer to the Google Play Store for of cial spellings of media titles. For lm/television,
IMDb is also available. If an utterance is ambiguous between a media title and a
sentence or web search, use your judgment for which is more likely; if truly unclear,
default to media title.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 16/20
10/4/2017 ro-RO_TEST_SET
Multiple spellings
When multiple spellings are attested, use the rst spelling used in the reference
dictionary for your language. If there is no entry, Google the word and use the form
with the most hits.
Transcribe onomatopoeia when clearly spoken. Otherwise, use the [skip] tag in
PeraPera or select the appropriate reason from the 'Cannot transcribe' menu in
Crowd Compute.
Sunt și eu p-acilea.
"sunt și eu p-acilea"
NOT: Sunt și eu pe aici.
Mergem s-o vedem Speaker said "s-o". Transcribe "să o" only when the speaker
pe bunica. actually says two distinct words.
Speaker said "nu-mi". Transcribe "nu îmi" only when the speaker
Nu-mi pasă.
actually says two distinct words.
Use standard spelling for reductions that commonly occur in normal running speech,
like "lui" for "lu'" or "l-elisions" in neutral and masculine nouns endings.
Am pierdut trenul.
"am pierdut trenu"
NOT: Am pierdut trenu'.
If you hear a word that does not sound like a standard word of your language, but it is
obviously based on real words, suf xes, or pre xes, transcribe as is.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 17/20
10/4/2017 ro-RO_TEST_SET
Di cult utterances
Everything relating to problematic utterances (background noise, false starts, etc.) or di erent
language varieties.
Skipping a prompt
The instructions in this section are for PeraPera. In Crowd Compute, instead of
tagging as [skip] the utterances that cannot be transcribed, click in the 'Cannot
transcribe' button and select the appropriate reason.
Skip the utterance if it: contains at least some word(s) that cannot be understood; is in
a different language typically not understood; contains no speech; contains only
laughter; contains singing; contains only synthesized speech (e.g. the voices of Google
Now or Siri) and/or pre-recorded speech (e.g. TV or radio).
Cum e vremea User asks, "Cum e vremea în Oakland?" Machine responds, "Sunt 70
în Oakland? de grade Fahrenheit și este soare în Oakland."
If the speaker sings, [skip]. Use the tag [music] if an entire utterance is music from an
instrument, radio, TV, etc.
[skip] if audio contains only laughter. Ignore laughter that is interspersed with speech
(transcribe only the speech).
Profanity should be fully transcribed. However, feel free to skip a sentence that you
feel uncomfortable transcribing.
Complete words that have been truncated only if a very small portion of the word is
missing (one syllable or less in a multisyllable word) and it is obvious what the word
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 18/20
10/4/2017 ro-RO_TEST_SET
should be. In cases of ambiguity, do not transcribe the cut-off word. Do not put
punctuation at the end of truncated words.
"inamo bucurești"
Dinamo București
Initial sound "d" was cut o .
Transcribe repeated words as many times as uttered, but [skip] if a phrase is repeated
more than ve times.
Only transcribe foreground speech. A user's speech may go from the foreground to
the background or vice versa (determined by change in volume) and can be
accompanied by change in speaker audience.
If two people take turns, without overlap, and are both in the foreground at roughly
the same volume, transcribe the speech of both speakers. Separate the dialogue of
different speakers with end punctuation.
Vii și tu? First foreground speaker asked "Vii și tu?", other foreground speaker
Da. answered "Da."
If two or more people are speaking at once with no one clearly in the foreground, tag
as [overlapping]. Do this for overlaps longer than one second. Use this tag even when
one person is a bit louder than the other(s) and you can tell what they're saying.
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 19/20
10/4/2017 ro-RO_TEST_SET
Foreign language
If an utterance is in a foreign language, tag with [skip], unless it is an easily identi able
media title or a foreign language phrase commonly understood in the transcription
language. Stick to the capitalization and punctuation conventions of your target
language.
Accents
https://speech.google.com/annotation/guidelines/ro_ro_test_set/index.html 20/20