Professional Documents
Culture Documents
Shujiajia Audio Transcription & QA
Shujiajia Audio Transcription & QA
Project This is Audio Annotation project, where you will get the audio to listen and write exactly the same
Description in target language. This is online work and you should have laptop/computer with decent internet
to work on.
You should know the language very well and should be your native.
You will get the full audio which you need to make the segments (cut into small valid portion) and
write what you hear. You also need to mark and identify each speaker gender
The work is quite simple and doable if you show little dedication, you can earn a good amount of
earning at end of every month. The project is quite huge and we are looking for team who can work
for long run.
Segments or You need to create segments according to the below guidelines.
Segmentation Segments can be easily created by dragging and clicking the mouse till where you want to create
the segments.
____________ ________________________________________________________________
Valid Portion:
Judgement of 1) You need to intercept speeches in single sentences and keep semantic coherence. The
Valid Portion maximum length of each segment should not exceed 8 seconds, but it should not be too short,
of audio either. According to experience, each segment can take 5-6 seconds on average. Sentences
segments. that are too long can be cut into several clauses or parts.
2) The best position to intercept sentences is at the lowest point of the waveform. If there are
only a few words that cannot be included, it is recommended to discard these words.
3) If there are two speakers, please intercept their speeches separately and the voices of two
speakers cannot appear in the same segment;
This means 1 segment only 1 speaker
4) Try to leave a 0.2-0.3s silent section before and after each intercepted speech. If there is no
sufficient silent length, it is not required. Silent sections that are too long or too short are not
acceptable.
If possible, intercept segments without sudden noises. You can shorten the silent part to avoid
sudden noises (such as ambient noises or oral noises like coughing, slurping, etc.), but make sure
the word is finished;
5) If there is only one word to indicate response, it does not need to be intercepted as a single
segment. If it can be merged with the latter sentence as one segment, then merge it; if not,
mark it as invalid;
6) If the speaker pauses in the middle for more than 2s, this sentence should be cut into two
segments, regardless of semantic coherence; if the pause is less than two seconds and the
whole sentence does not exceed 8s, intercept it as one segment.
________ ________________________________________________________________________
Judgement of Invalid Portion:
Invalid Portion
of audio The following situations are judged to be unqualified and there is no need to intercept or
segments. transcript or annotate unqualified sentences:
1) When two people's speech overlaps in a sentence and their voices are close in volume, if the
overlapping part is too long, then mark it as invalid speech; if the overlapping part contains only
one or two words and the main speaker can be heard clearly, then transcribe the content;
2) If there is an incoherent part that cannot be transcribed according to the context, the
sentence is invalid;
3) If there are strong noises like environmental noises or equipment noises that make it
difficult to hear the main speaker, the sentence is invalid;
5) If the voice is not from human but is synthesized or from customer service machine or TV
broadcast, the sentence is invalid;
6) If a sentence contains large parts that are not spoken in the required language, it is invalid,
but common English words like Facebook and YouTube are acceptable;
(invalid data can be deleted directly without marking as invalid and no need to create those
segments)
Speaker Different speakers should be marked with different identifications and their gender for each
Identification segment
Content The transcription must be consistent with heard speech without adding, missing or misspelling
transcription words.
rules and
explanation The guidelines are as follows:
1) Number: The numbers in the speech need to be transcribed into corresponding words
instead of in the form of Arabic numerals.
2) English: If the sentence contains English words or letters, transcribe it according to the
pronunciation. While letters need to be capitalized, and there should be no spaces left between,
words should be transcribed in lowercase with a space between word and word. For example, if
the speaker pronounces “O-P-P-O”, then transcribe “OPPO”, if it is pronounced as a word, then
type "oppo".
3) ) Modal particles: Modal particles should be transcribed accurately according to
pronunciation and semantics.
5) Slurring words: If speakers have slurred words, transcribe them according to the correct
semantics and grammar.
e.g. “del elefante” may pronounce like “delefante”, instead of simply transcribing by pronunciation,
it should be written as “del elefante”.
6) Stuttering: If the word is not finished, add a hyphen(-) after it to replace the
unpronounced syllable, and there must be a space between the unfinished word and the
following word.
e.g. If “estudiar” stutters, first pronouncing an "es" sound, then "estudiar" again, it can be
transcribed as “es- estudiar”.
Note: The end of the sentence must be a complete word. If not, just discard that unfinished
word.
7) Other things
· Swear words should be transcribed correctly, do not use abbreviations.
Proper nouns should be written in their standard form.
· If there are repetitive words in the speech, do not omit any word.
· If the word can be heard clearly but the semantics are uncertain, such as ordinary people’s
names, homophones are acceptable when the spelling and pronunciation are correct. In the
case with a clear contextual meaning, choose words that match the pronunciation as well as
the semantics.
Capitalize the first letter of the sentence. But if a long sentence is divided into two segments, the
first letter of the second segment does not need to be capitalized.
Special symbols / TAGs
In case of the following situations during the marking process, the corresponding special labels shall
be added, and the labels must be legal: avoid missing pairs of labels, inconsistent case, and unequal
brackets.
Update: [OVERLAP/] [/OVERLAP] The label is normal. For other labels, only the corresponding
attributes are selected. They are not added in the text transfer. Subsequent technical batch
processing is performed.
Quality The accuracy of the transcription should be 95% or above (100% is required for
Requirements the punctuation)
and If a segment has errors of validity, transcription, etc., then it is considered as
Accuracy incorrect.
The accuracy rate = 1 - (number of incorrect segments / number of all segments)
FAQ? Q: When the end of one sentence and the A: In this case, the best position to
beginning of the following sentence are intercept is at the lowest point of the
relatively close, is that okay to omit silent waveform. A small number of segments
section? can be allowed to exceed 8 seconds to
make sure the word is finished.
Q: If there is one second of breathing A: If the breathing noise is continuous
following the speech, can we include a with the speech, it’s tolerable to contain a
little breathing noise in order to avoid little bit.
unfinished word?
1. When there is breathing noise in a 10s long sentence, generally it should be segmented
according to semantics.
2. For the breathing noise at the beginning or the end of a sentence, if it is connected to the
speech, then only leave 0.05s of noise for silent section, and cut off other noise; if there is silent section
between the noise and the speech, even if it’s very little, as long as the word is not unfinished, cut off all the
noise.
e.g. The following sample should be intercepted at the lowest points of the waveform like the red frame
indicates.
3. For those unavoidable sudden noises (such as ambient noises or oral noises like inhalation, exhalation,
smack of lips, etc.) in the middle of the segment, a n error rate of 20% is tolerable.
4. Labelling priority: one speaker in one segment> semantic coherence> requirements of segment length
5. The authorization at the beginning of each audio does not need to be intercepted or transcribed. The
annotator needs to mark the speaker 1 and 2 according to the order in which they give the authorization.
6. Every ID need to complete atleast 1hours of audio transcription and the quality should be at or above
95%
7. Every team need to complete Transcription and Annotation both for their team and others