Professional Documents
Culture Documents
OCR Transcription Rules-French
OCR Transcription Rules-French
OCR Transcription Rules-French
General Requirements
OCR image annotation requires four main operations: judge if the picture is valid, select
picture attribute, text frame, attribute and text transcription. You should frame All the
French in the picture. One line one frame.
2) Watermark image.
Example:
3) Picture that is Not captured by camera. Or if the picture was captured on the
computer screen or downloaded from internet, it is “invalid”.
Example:
9) If the French characters are less than 70%, and the blurred part is over 30%, such
pictures are marked as bad data/invalid.
Example:
Each picture is classified into a scene, and pictures that cannot be classified into scenes can
be selected “other”.
Picture scene type:
1) slogan
2) Business card/menu= Business-menu
3) map
4) Store
5) Receipt slip/invoice= list
6) board
7) advertisement
8) packaging
9) written
10) other
Frame&transcribe
1) One line one frame. Frame only one line each time and transcribe the characters exactly
in the picture. (Exception case: only special formulas can include more than one line)
Incorrect Example:
(As the picture shown, it’s wrong when making a frame with two lines at the same time. If
there’s characters over one line, such as inserting characters or phonetic notation of
characters, we have to make frames separately.)
Correct example:
Example 3: Include Formulas within one frame (Never separate)
(When there are complicated formulas in picture, make one frame and don’t transcribe
the content.)
2) Frame Accuracy Requirements: draw frame closer to characters but NEVER on characters.
(As the picture shown, it’s wrong when draw a frame not closer to characters)
(As the picture shown, it’s wrong when draw a frame on characters)
Correct Example:
Horizontal text: The point(red point) on the top left corner of frame is the upper boundary
point, and point (green point) on the right bottom is lower boundary point.
Vertical text: The point (red point) on top right corner is the upper boundary point, and the
point (green point) on the left bottom is lower boundary point.
Rectangle frame: the upper and lower boundary points are automatically by default when
framing.
(Horizontal rectangular frame: the upper and lower boundary points are automatically
generated after the frame is pulled.
Vertical rectangular box: you need to change the upper and lower boundary points after
pulling the box.)
Polygon frame: need manually add upper and lower boundary points, right-click mouse to
select upper and lower boundary points respectively.
Attributes
Each frame must be given a corresponding Attribute Tag, one Attribute only.
“Horizontal or vertical” should be judged according to the actual text.
French indicates the language of the transcribed characters, horizontal or vertical refers to
the direction of the text layout, which needs to be transcribed.
Smear Attribute
French is smeared and needs to be transcribed. For the attribute, select smear horizontal /
smear vertical
English is smeared, and there is no need to transcribe. The attribute is en smear horizontal /
en smear vertical
Example: reflection
If the truncated, sheltered, and reflective characters can be judged according to the
semantics and the shape of the remaining characters, then the characters need be
transcribed.
Among the characters in one frame, if it is not possible to transcribe some truncated,
sheltered, and reflective characters according to the rest part of the characters and the
semantics, use <ERR> to replace the unclear characters. <ERR> can represent multiple
characters as needed.
English-horizontal/English-vertical
If English characters or numbers (such as 123) appear in the picture, select English horizontal
(English vertical) after pulling the box, and there is no need to transcribe the text.
When English is smeared, there is no need to transcribe after the box is pulled. The attribute
is en smear horizontal / en smear vertical.
Not-care
When the whole line of text cannot be recognized, or a whole line is phonetic symbols,
select “Blurry& Phonetic symbols” attribute, and that line need be framed but no need
transcription.
Example:
Other-language
For languages other than French and English, select other-language attribute, don’t
transcribe.
Example:
Formula
When there are complex formulas of physics, chemistry, and mathematics that are difficult
to transcribe, frame the text, select formula attribute, and don’t transcribe.
Example:
apostrophe
Example:
Frame and mark “apostrophe“”; for number “3”, frame and mark “Chinese and English
horizontal” as “3” is a number.
Other requirements:
Blurred text
2) If part of the text in one frame is blurred: For characters that cannot be transcribed
according to semantics due to blurry, use <ERR> to replace unclear characters, and if the
remaining text is clearly recognizable characters, it can be transcribed normally. The
attribute should be French-horizontal (or French-vertical).
Example:
3) A whole line of text is unclear and cannot be transcribed, select the Blurry& Phonetic
symbols Attribute, don’t transcribe.
Example:
Incomplete text
1) If the text is truncated, blocked by leaves, wires and other obstacles, reflections, or the
text is short of strokes or characters, it is called incomplete text.
2) No matter how much the text is truncated, you must draw a frame, select the smear
attribute and transcribe, use <ERR> to replace unidentifiable characters if you cannot judge.
Example:
2)If one line of text has multiple languages, and the French has a larger proportion, it can be
framed separately, the text of the French is transcribed normally, and the other languages is
framed with “other-language” attribute without transcribing the content.
If the French has a small proportion, frame the whole line, select “other-language” attribute,
and do not transcribe the content.
Example:
3 ) If the French characters and English characters are on the same line, and the space
between the two languages is less than 2 characters, please pull a box, select the French
horizontal attribute, transfer all characters, and do not pull a box separately.
Table image
The picture in the form of a table with lists such as ingredients list, must be transcribed in
frames according to the table frames and should also follow “one line one frame” rule as
well:
If the space between the characters exceeds over two characters, the frame must be
separated into two frames.
If there is a relatively large gap between letters or characters in one word, transcribe
normally without adding spaces between letters in the same word.
Example:
language symbol
Tone symbol
If the tone symbol affects the meaning of the word, the tone symbol needs to be transcribed,
if it does not affect the meaning of the word, it is not transcribed.
Special example
1) When drawing a frame, close to the words without pressing on them, and different
frames can overlap and cross.
Example:
Use a polygonal frame to select the required text on the arc stamp as a whole, and manually
mark the upper and lower boundary points accordingly.
3) Reflection words
a. Clear reflection words(readable or recognizable) should be framed and marked with
“Blurry& Phonetic symbols” attribute.
4) Special symbol:
a. The bullet before the word:
If the bullets that can be transcribed, you need draw a frame. For special symbols that
cannot be transcribed, you can just frame the part of text.
The yellow frame is the correct way to frame. (If the distance between the symbol and the
first character is within two characters, it needs to be framed together.)
b. Graphic symbols
Graphical symbols that can be transcribed require a frame. Such as “w” in the word graphics.
c. Special symbols that cannot be typed on the keyboard do not need to be framed:
5) Underline
a. Underline without text before and after(underline only), ignore it without frame.
b. If there is text before or after the underline, and there is no text above the underline, just
mark one “_” no matter how long the underline is.
Example:
If there is text on the underline, only frame the text and ignore the underline; (Space applies
to rule mentioned before).
Example:
7) If the superscript and subscript are on the same horizontal line as the text, please frame it
in one line, no need to separate the frame.
Example: