User Guide - Colloquial Video Annotation

I.
Description Of The Collection Task
This is a spoken language annotation project
The data must meet the following requirements:
Data 300 hours

Volume
Language Malay
II. If you encounter the following situations, you need to click the data invalid, if you
encounter uncertain data, you can send it to the project manager in advance to confirm
whether it can be marked.
1. There are many non-target languages and the spoken words are not Malay
2. Obviously reading aloud or having a strong accent
3. Don't be noisy when background music or environment appears for a long time
4. If it is a telephone communication or playback recording
5. If there is a truncated or nuisance current.
III. Data Annotation Requirements

Annotation enables us to collect qualified data. We will use the Shujiajia online
annotation platform for clipping and transcription.
1. Annotation Requirements:
(1)Validity Judgment:
1) If two people speak in overlapping voices with similar volumes in a sentence and the
overlapping part is significant, the speech should be annotated as invalid. If the
overlapping part is small (only one or two words) and the content of the main speaker
can be heard clearly, the speech should be transcribed normally.
2) If a part of a sentence cannot be heard clearly and the content cannot be determined,
the sentence should be considered invalid.
3) If there is strong noise in a sentence (environmental noise, equipment noise) that

makes it difficult to hear the main speaker's content, the sentence should be judged as
invalid.
4) If a sentence has missing frames (such as "Have you had dinner?" becoming "Have
you dinner?"), it should be considered invalid.
5) If the voice is not a normal human voice (such as a machine customer service,
synthetic voice, or TV broadcast voice), the sentence should be considered invalid.
6) If a sentence contains non-native language parts, it should be considered invalid.
7) If a sentence involves sensitive information (political, religious, pornographic,

violent), it should be considered invalid.
(2 ) Effective Speech Extraction

1) Annotators need to consider semantic coherence and extract speech in sentence
units. Long sentences can be divided into phrases, and each sentence should not
exceed 8 seconds, but it should not be too short either. Based on annotation
experience, each natural language segment should be about 5-6 seconds on average.
2) The best position for each time boundary is at the lowest point of the waveform.
3) Speech from different speakers cannot be extracted in the same sentence.
4) When extracting, try to leave 0. 2-0.3 seconds of silence around the annotated speech
segment. If there is no such long silent segment, it is not necessary to force it. Try to
extract speech segments without sudden noise. To avoid sudden noise, the reserved
time before and after the speech can be shortened, but it cannot result in cut-off
speech.
5) Even if there is only one word to respond, it still needs to be extracted, and adjacent
sentences should be merged as much as possible.
6) If there is a pause in a sentence due to the speaker that lasts more than 2 seconds, it
needs to be divided into two phrases without considering the sentence meaning. If
the pause time is less than two seconds and the sentence length does not exceed 8
seconds, it can be extracted as one sentence.
7) If a person pauses in the middle of the speech for no more than two seconds, and
there is noise during the pause, resulting in incoherent or incomplete semantics
after extraction, it is not necessary to split it.
(3) Speaker Attribute Selection
1) If there are two people speaking, each person should be given a separate ID, and their
gender should be selected.
(4) Content transcription: Transcribe the content according to the audio heard, and
the transcribed content must be exactly the same as the spoken speech, without
any extra, missing, or wrong words.
1) Capitalization: If a word is usually capitalized, transcribe it according to normal

writing conventions, for example: China, Microsoft.
2) Numbers: When numbers appear in the text, do not transcribe them directly into
Arabic numerals. Instead, transcribe them into the written form of that language.
I’m 15 years old. I’m fifteen years old.
The last four digits of my phone number The last four digits of my phone number
are six, five, three and four.
are 6543.
3) Spelling words: separate the letters with a space and capitalize each letter. For
example:
Original Article Transcription
five thirty pm five thirty P M
FBI FBI
NFC NFC
4) Abbreviations: When transcribing, do not use abbreviations of words. You must use
the full word spelled out phonetically. For example:
Original Article Transcription
This is Dr. Smith this is doctor Smith

5) Punctuation
a. Use punctuation marks according to grammar rules.
b . Punctuation spoken by the speaker needs to be transcribed. For example, " @ " is
transcribed as "at", ".com" is transcribed as "dot com".
c. Only commas (,), hyphens (-) (can only appear in the middle of words), periods
(.), exclamation marks (!), single quotes ('), and question marks (?) are allowed in
the transcription process. No other punctuation marks should be added, and any
added marks need to conform to the rules of grammar. All punctuation
marks should be in normal English input mode.
6) Modal particles: Modal particles should be transcribed accurately

according to their pronunciation and semantics.
7) Others
a. Dirty language should be transcribed normally, and do not use letters to replace
them.
b. Network buzzwords and common Internet words should be transcribed

according to common usage.
c. If there are repeated words in the speech, all of them should be transcribed.
d. If the pronunciation can be determined but the semantics is uncertain, such as
common names, homophonic words can be selected to replace them, but it is

necessary to ensure that the text and pronunciation are correct. In the case of a
clear context, choose words that match the pronunciation and meaning for
annotation.
e. If a word is not finished, add "-" after it and there should be a space between it
and the next word, for example: "I want to go to s- school." Note that the end of
the sentence must be a complete word. If an unfinished word is at the end
of a sentence, it should be discarded without transcription.
Acceptance Criteria:
2. The accuracy rate of word annotation is 98% or higher.
3. The format of the annotation transcription symbols must be 100% accurate .
4. Non-target language content appearing in the collection is unacceptable.
5. Repeated submission of collected data (including duplication within the

same center) will result in data invalidation and no payment.
6. Data that requires more than three annotations for correction will be released
automatically and will not be eligible for payment.

User Guide - Colloquial Video Annotation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

User Guide - Colloquial Video Annotation

Uploaded by

Copyright:

Available Formats

I.

Description Of The Collection Task

This is a spoken language annotation project

The data must meet the following requirements:

Data 300 hours

III. Data Annotation Requirements

3) If there is strong noise in a sentence (environmental noise, equipment noise) that

6) If a sentence contains non-native language parts, it should be considered invalid.

7) If a sentence involves sensitive information (political, religious, pornographic,

(2 ) Effective Speech Extraction

3) Speech from different speakers cannot be extracted in the same sentence.

(3) Speaker Attribute Selection

1) Capitalization: If a word is usually capitalized, transcribe it according to normal

I’m 15 years old. I’m fifteen years old.

Original Article Transcription

five thirty pm five thirty P M

Original Article Transcription

This is Dr. Smith this is doctor Smith

a. Use punctuation marks according to grammar rules.

transcribed as "at", ".com" is transcribed as "dot com".

6) Modal particles: Modal particles should be transcribed accurately

b. Network buzzwords and common Internet words should be transcribed

d. If the pronunciation can be determined but the semantics is uncertain, such as

common names, homophonic words can be selected to replace them, but it is

3. The format of the annotation transcription symbols must be 100% accurate .

4. Non-target language content appearing in the collection is unacceptable.

5. Repeated submission of collected data (including duplication within the

You might also like