OCR Transcription Rules-French

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

OCR Guideline

General Requirements

OCR image annotation requires four main operations: judge if the picture is valid, select
picture attribute, text frame, attribute and text transcription. You should frame All the
French in the picture. One line one frame.

Step 1. Valid or Invalid

Before annotation, you need to judge if the picture is valid,


If invalid, only mark the image as invalid.
If valid, select picture attribute, text frame, attribute and text transcription.

Invalid picture scenarios, please mark “invalid”, and don’t transcribe.


1) Exposure or backlight picture.
Example:

2) Watermark image.
Example:
3) Picture that is Not captured by camera. Or if the picture was captured on the
computer screen or downloaded from internet, it is “invalid”.
Example:

4) The whole picture is blurred or unidentifiable.


Example:

5) Complicate text background, which makes the text hard to be recognized.


Example:

6) Printed document (Not written).


Example:
7) pornographic
8) Picture quality: the identifiable characters of each picture should be no less than
90% of the total characters. Otherwise, it should be “invalid”.
Example:

9) If the French characters are less than 70%, and the blurred part is over 30%, such
pictures are marked as bad data/invalid.
Example:

Step2. Select picture attribute

Each picture is classified into a scene, and pictures that cannot be classified into scenes can
be selected “other”.
Picture scene type:
1) slogan
2) Business card/menu= Business-menu
3) map
4) Store
5) Receipt slip/invoice= list
6) board
7) advertisement
8) packaging
9) written
10) other

Step 3. Annotation Rules

Frame&transcribe

1) One line one frame. Frame only one line each time and transcribe the characters exactly
in the picture. (Exception case: only special formulas can include more than one line)

Incorrect Example:

(As the picture shown, it’s wrong when making a frame with two lines at the same time. If
there’s characters over one line, such as inserting characters or phonetic notation of
characters, we have to make frames separately.)

Correct example:
Example 3: Include Formulas within one frame (Never separate)

(When there are complicated formulas in picture, make one frame and don’t transcribe
the content.)

2) Frame Accuracy Requirements: draw frame closer to characters but NEVER on characters.

(As the picture shown, it’s wrong when draw a frame not closer to characters)

(As the picture shown, it’s wrong when draw a frame on characters)
Correct Example:

3) Mark the upper and lower boundary points of the box:


Upper boundary point (red point) and lower boundary point (green point) as in the picture
below.

Horizontal text: The point(red point) on the top left corner of frame is the upper boundary
point, and point (green point) on the right bottom is lower boundary point.

Vertical text: The point (red point) on top right corner is the upper boundary point, and the
point (green point) on the left bottom is lower boundary point.

Rectangle frame: the upper and lower boundary points are automatically by default when
framing.
(Horizontal rectangular frame: the upper and lower boundary points are automatically
generated after the frame is pulled.
Vertical rectangular box: you need to change the upper and lower boundary points after
pulling the box.)
Polygon frame: need manually add upper and lower boundary points, right-click mouse to
select upper and lower boundary points respectively.

Attributes

Each frame must be given a corresponding Attribute Tag, one Attribute only.
“Horizontal or vertical” should be judged according to the actual text.

French-horizontal (or French-vertical):

French indicates the language of the transcribed characters, horizontal or vertical refers to
the direction of the text layout, which needs to be transcribed.

Smear Attribute

Smear-horizontal (or smear-vertical)


Smear Attribute means when the French characters are incomplete, truncated, sheltered, or
the text is incomplete due to reflection, etc. choose Smear Attribute. Horizontal and vertical
refer to text layout and need to be transcribed.

French is smeared and needs to be transcribed. For the attribute, select smear horizontal /
smear vertical
English is smeared, and there is no need to transcribe. The attribute is en smear horizontal /
en smear vertical

Example: text is truncated.

Example: reflection

If the truncated, sheltered, and reflective characters can be judged according to the
semantics and the shape of the remaining characters, then the characters need be
transcribed.
Among the characters in one frame, if it is not possible to transcribe some truncated,
sheltered, and reflective characters according to the rest part of the characters and the
semantics, use <ERR> to replace the unclear characters. <ERR> can represent multiple
characters as needed.
English-horizontal/English-vertical

If English characters or numbers (such as 123) appear in the picture, select English horizontal

(English vertical) after pulling the box, and there is no need to transcribe the text.

en-smear- horizontal/ vertical

When English is smeared, there is no need to transcribe after the box is pulled. The attribute
is en smear horizontal / en smear vertical.

Not-care

When the whole line of text cannot be recognized, or a whole line is phonetic symbols,
select “Blurry& Phonetic symbols” attribute, and that line need be framed but no need
transcription.
Example:
Other-language

For languages other than French and English, select other-language attribute, don’t
transcribe.
Example:

Formula

When there are complex formulas of physics, chemistry, and mathematics that are difficult
to transcribe, frame the text, select formula attribute, and don’t transcribe.
Example:

apostrophe

It means Multi-point, Just draw a frame, don’t transcribe.

Example:
Frame and mark “apostrophe“”; for number “3”, frame and mark “Chinese and English
horizontal” as “3” is a number.
Other requirements:

Blurred text

1) The whole picture is blurred: mark as bad data/invalid, don’t transcribe


Example:

2) If part of the text in one frame is blurred: For characters that cannot be transcribed
according to semantics due to blurry, use <ERR> to replace unclear characters, and if the
remaining text is clearly recognizable characters, it can be transcribed normally. The
attribute should be French-horizontal (or French-vertical).
Example:

3) A whole line of text is unclear and cannot be transcribed, select the Blurry& Phonetic
symbols Attribute, don’t transcribe.
Example:

Incomplete text

1) If the text is truncated, blocked by leaves, wires and other obstacles, reflections, or the
text is short of strokes or characters, it is called incomplete text.
2) No matter how much the text is truncated, you must draw a frame, select the smear
attribute and transcribe, use <ERR> to replace unidentifiable characters if you cannot judge.
Example:

Non-French characters among French characters

1) If the Non-French characters can’t be transcribed(e.g. emojis/emoticons, etc: Only frame


the French text and transcribe, do not frame characters that cannot be transcribed.
Example:

2)If one line of text has multiple languages, and the French has a larger proportion, it can be
framed separately, the text of the French is transcribed normally, and the other languages is
framed with “other-language” attribute without transcribing the content.
If the French has a small proportion, frame the whole line, select “other-language” attribute,
and do not transcribe the content.
Example:

3 ) If the French characters and English characters are on the same line, and the space
between the two languages is less than 2 characters, please pull a box, select the French
horizontal attribute, transfer all characters, and do not pull a box separately.

Table image

The picture in the form of a table with lists such as ingredients list, must be transcribed in
frames according to the table frames and should also follow “one line one frame” rule as
well:

Space problem between words

If the space between the characters exceeds over two characters, the frame must be
separated into two frames.

Space problem in one word

If there is a relatively large gap between letters or characters in one word, transcribe
normally without adding spaces between letters in the same word.
Example:

should be transcribed as:

should be transcribed as:

language symbol

Just transcribe the letters under the symbol, example is as follows,

should be transcribed as:

Tone symbol

If the tone symbol affects the meaning of the word, the tone symbol needs to be transcribed,
if it does not affect the meaning of the word, it is not transcribed.

Special example

1) When drawing a frame, close to the words without pressing on them, and different
frames can overlap and cross.
Example:

Use a polygonal frame to select the required text on the arc stamp as a whole, and manually
mark the upper and lower boundary points accordingly.
3) Reflection words
a. Clear reflection words(readable or recognizable) should be framed and marked with
“Blurry& Phonetic symbols” attribute.

b. The reflection words that


are not clearly visible can be
ignored, no need to draw
frames.

4) Special symbol:
a. The bullet before the word:
If the bullets that can be transcribed, you need draw a frame. For special symbols that
cannot be transcribed, you can just frame the part of text.
The yellow frame is the correct way to frame. (If the distance between the symbol and the
first character is within two characters, it needs to be framed together.)
b. Graphic symbols

Graphical symbols that can be transcribed require a frame. Such as “w” in the word graphics.

c. Special symbols that cannot be typed on the keyboard do not need to be framed:

Example: don’t transcribe the symbol marked in red.

5) Underline
a. Underline without text before and after(underline only), ignore it without frame.
b. If there is text before or after the underline, and there is no text above the underline, just
mark one “_” no matter how long the underline is.
Example:

If there is text on the underline, only frame the text and ignore the underline; (Space applies
to rule mentioned before).
Example:
7) If the superscript and subscript are on the same horizontal line as the text, please frame it
in one line, no need to separate the frame.
Example:

You might also like