Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Shujiajia Annotation Cum Transcription & QA

Transcription is a process to convert Audio to Text

Project This is Audio Annotation project, where you will get the audio to listen and write exactly the same
Description in target language. This is online work and you should have laptop/computer with decent internet
to work on.

You should know the language very well and should be your native.

You will get the full audio which you need to make the segments (cut into small valid portion) and
write what you hear. You also need to mark and identify each speaker gender

The work is quite simple and doable if you show little dedication, you can earn a good amount of
earning at end of every month. The project is quite huge and we are looking for team who can work
for long run.
Segments or You need to create segments according to the below guidelines.
Segmentation Segments can be easily created by dragging and clicking the mouse till where you want to create
the segments.

Segments can be of two type: 1. Valid 2. Invalid


Below we will understand how to judge Valid and Invalid Segments.

____________ ________________________________________________________________

Valid Portion:

Judgement of 1) You need to intercept speeches in single sentences and keep semantic coherence. The
Valid Portion maximum length of each segment should not exceed 8 seconds, but it should not be too short,
of audio either. According to experience, each segment can take 5-6 seconds on average. Sentences
segments. that are too long can be cut into several clauses or parts.

2) The best position to intercept sentences is at the lowest point of the waveform. If there are
only a few words that cannot be included, it is recommended to discard these words.

3) If there are two speakers, please intercept their speeches separately and the voices of two
speakers cannot appear in the same segment;
This means 1 segment only 1 speaker

4) Try to leave a 0.2-0.3s silent section before and after each intercepted speech. If there is no
sufficient silent length, it is not required. Silent sections that are too long or too short are not
acceptable.

If possible, intercept segments without sudden noises. You can shorten the silent part to avoid
sudden noises (such as ambient noises or oral noises like coughing, slurping, etc.), but make sure
the word is finished;
5) If there is only one word to indicate response, it does not need to be intercepted as a single
segment. If it can be merged with the latter sentence as one segment, then merge it; if not,
mark it as invalid;

6) If the speaker pauses in the middle for more than 2s, this sentence should be cut into two
segments, regardless of semantic coherence; if the pause is less than two seconds and the
whole sentence does not exceed 8s, intercept it as one segment.

________ ________________________________________________________________________
Judgement of Invalid Portion:
Invalid Portion
of audio The following situations are judged to be unqualified and there is no need to intercept or
segments. transcript or annotate unqualified sentences:

1) When two people's speech overlaps in a sentence and their voices are close in volume, if the
overlapping part is too long, then mark it as invalid speech; if the overlapping part contains only
one or two words and the main speaker can be heard clearly, then transcribe the content;

2) If there is an incoherent part that cannot be transcribed according to the context, the
sentence is invalid;

3) If there are strong noises like environmental noises or equipment noises that make it
difficult to hear the main speaker, the sentence is invalid;

4) If there is frame loss in a sentence, it is invalid;

5) If the voice is not from human but is synthesized or from customer service machine or TV
broadcast, the sentence is invalid;

6) If a sentence contains large parts that are not spoken in the required language, it is invalid,
but common English words like Facebook and YouTube are acceptable;

7) If a sentence involves sensitive information (politically or religiously sensitive topic,


pornography, violence), it is considered as invalid.

(invalid data can be deleted directly without marking as invalid and no need to create those
segments)

Speaker Different speakers should be marked with different identifications and their gender for each
Identification segment

Content The transcription must be consistent with heard speech without adding, missing or misspelling
transcription words.
rules and
explanation The guidelines are as follows:

1) Number: The numbers in the speech need to be transcribed into corresponding words
instead of in the form of Arabic numerals.

2) English: If the sentence contains English words or letters, transcribe it according to the
pronunciation. While letters need to be capitalized, and there should be no spaces left between,
words should be transcribed in lowercase with a space between word and word. For example, if
the speaker pronounces “O-P-P-O”, then transcribe “OPPO”, if it is pronounced as a word, then
type "oppo".
3) ) Modal particles: Modal particles should be transcribed accurately according to
pronunciation and semantics.

4) Punctuation: The punctuation spoken out by the speaker needs to be transcribed.

e.g. "@" is written as "at", ".com" is written as "dot com"

Only commas( , ), hyphens( - ), periods( . ), exclamation marks( ! ), apostrophes( ’ ), and question


marks( ? ) are allowed in the transcription. Only these 6 punctuations are allowed and can be
used. All punctuation needs to be typed in the English input state and comply with the
grammatical rules.

It’s not necessary to put a period at the end of the segment.

5) Slurring words: If speakers have slurred words, transcribe them according to the correct
semantics and grammar.

e.g. “del elefante” may pronounce like “delefante”, instead of simply transcribing by pronunciation,
it should be written as “del elefante”.

6) Stuttering: If the word is not finished, add a hyphen(-) after it to replace the
unpronounced syllable, and there must be a space between the unfinished word and the
following word.

e.g. If “estudiar” stutters, first pronouncing an "es" sound, then "estudiar" again, it can be
transcribed as “es- estudiar”.

Note: The end of the sentence must be a complete word. If not, just discard that unfinished
word.

7) Other things
· Swear words should be transcribed correctly, do not use abbreviations.
Proper nouns should be written in their standard form.

e.g. TikTok is the official name and cannot be transcribed as “tiktok”

· Internet slang should be transcribed according to common usage.

· If there are repetitive words in the speech, do not omit any word.

· If the word can be heard clearly but the semantics are uncertain, such as ordinary people’s
names, homophones are acceptable when the spelling and pronunciation are correct. In the
case with a clear contextual meaning, choose words that match the pronunciation as well as
the semantics.
Capitalize the first letter of the sentence. But if a long sentence is divided into two segments, the
first letter of the second segment does not need to be capitalized.
Special symbols / TAGs

In case of the following situations during the marking process, the corresponding special labels shall
be added, and the labels must be legal: avoid missing pairs of labels, inconsistent case, and unequal
brackets.

Update: [OVERLAP/] [/OVERLAP] The label is normal. For other labels, only the corresponding
attributes are selected. They are not added in the text transfer. Subsequent technical batch
processing is performed.

No noise No Transcribe according to O1 I went to dinner today


what you hear or
O2…
[N] A sentence containing O1 I went to dinner today
noise needs to be or [N]
marked with [N] at the O2…
end of the sentence, but
Valid the noise type does not
data need to be distinguished.
[HM] The speaker's rap O1 一人我饮酒醉[HM]
content needs to be or
marked at the end of the O2…
sentence [HM]
[OVERLAP/] The voice overlaps, and O1 I went to [OVERLAP/]
[/OVERLAP] one of them is or dinner today
particularly clear. Only O2… [/OVERLAP]
the voice of the person
who speaks clearly is
transcribed. The role
marks the speaker, and
the affected text is
marked with a label.
Invalid Invalid [IVS] Only paragraphs with N [IVS]
data voice noise greater than 0.5
segment seconds will be marked.
of For example, the voice
recorded overlaps and the voice
human volume is about the
voice same;
Voice frame loss;
Voice truncation;
Voice echo;
The tone of voice is not
normal: such as singing,
speaking with the throat
pinched, etc;
Non target language;
Some words in the voice
segment cannot be heard
clearly or cannot be
transcribed due to
noise。
Invalid [OIVS] Only paragraphs with N [OIVS]
voice noise greater than 0.5
segment seconds will be marked.
of non- For example:
recorded TV voice;
human Voice over advertising in
voice the broadcast cavity;
Music with voice; etc.
Sensitive [PIL] Voice contains the N [PIL]
informati- private information of
on the recorder.
Detailed address, mobile
phone number, ID
number, bank card
number, social security
number, passport
number, etc

Quality  The accuracy of the transcription should be 95% or above (100% is required for
Requirements the punctuation)
and  If a segment has errors of validity, transcription, etc., then it is considered as
Accuracy incorrect.
 The accuracy rate = 1 - (number of incorrect segments / number of all segments)

FAQ? Q: When the end of one sentence and the A: In this case, the best position to
beginning of the following sentence are intercept is at the lowest point of the
relatively close, is that okay to omit silent waveform. A small number of segments
section? can be allowed to exceed 8 seconds to
make sure the word is finished.
Q: If there is one second of breathing A: If the breathing noise is continuous
following the speech, can we include a with the speech, it’s tolerable to contain a
little breathing noise in order to avoid little bit.
unfinished word?

Q: When there’s a sentence about 10 A: This kind of situation should be


seconds long with a short noise in the relatively rare. Generally speaking,
middle that can be cut off, but with sentences over ten seconds can be
disconnection it’s difficult to keep divided into at least two segments. If this
semantic coherence, should we maintain is the case, then the priority is to keep
semantic coherence or ensure that the semantic coherence.
length does not exceed 8s?
Q: To keep semantic coherence, there will A: Keep the breathing noise if the
be segments about 7-9s long with some transcription will not be affected.
breathing noise in the middle that does
not affect the pronunciation. In this case,
should we cut off the noise and turn it
into two 3-4s speech, or keep that
breathing noise?
Summarize

1. When there is breathing noise in a 10s long sentence, generally it should be segmented
according to semantics.

2. For the breathing noise at the beginning or the end of a sentence, if it is connected to the
speech, then only leave 0.05s of noise for silent section, and cut off other noise; if there is silent section
between the noise and the speech, even if it’s very little, as long as the word is not unfinished, cut off all the
noise.
e.g. The following sample should be intercepted at the lowest points of the waveform like the red frame
indicates.

3. For those unavoidable sudden noises (such as ambient noises or oral noises like inhalation, exhalation,
smack of lips, etc.) in the middle of the segment, a n error rate of 20% is tolerable.

4. Labelling priority: one speaker in one segment> semantic coherence> requirements of segment length

5. The authorization at the beginning of each audio does not need to be intercepted or transcribed. The
annotator needs to mark the speaker 1 and 2 according to the order in which they give the authorization.

6. Every ID need to complete atleast 1hours of audio transcription and the quality should be at or above
95%

7. Every team need to complete Transcription and Annotation both for their team and others

8. Payment will be made on monthly basis via PAYPAL/Payonner/NEFT/UPI/Western Union

You might also like