Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Project Skokie

Image QA Validation

Presented by:
Appen Data Collection

March, 2021
Project Overview

• The purpose of Spruce OCR image collection projects is to collect images of


textual items with specific types of documents to improve the AI/ML recognition
technology.

• In this project, you validate Business Cards of an individual or Non-business


Cards of a company in different languages. The purpose of this task is to ensure
that the poor quality/irrelevant data are removed from the dataset and good
quality ones are correctly allocated to the categories/languages.
Annotate the images of business cards and transcribe

You may need to reject the


image when
● Incorrect language
● Weird angles
● Handwritings
● Digital Screen
● Irrelevant Image
● Bad quality:
Blurry/unrecognisable, Not
focused, Too Dark/Too Bright,
Interference Watermark/Water
Stain/Ink/Shadow/Reflection,
High Density
● Other
Remember:
Reject!
Mandatory fields must be
present in the image.

● giver's first name (mandatory)


● giver's last name (mandatory)
Remember:
Reject!
Not Target language (or Your
language)
● Less than 60% of text in target
language.
○ Please note phone number,
email address, social media
account, organization name
are not considered as “TEXT”

● Target language: English, German,


Spanish, French, Dutch,
Portuguese, Italian, Polish, Hebrew,
Japanese, Korean, Simplified
Chinese, Traditional Chinese,
Russian
Remember:
Reject!
Not accept cards with
HANDWRITINGS.
● Any handwritten marks -
check marks, tick marks, lines or
any strokes, are considered
handwritings and must be
rejected.
Remember:
Reject!
Digital Screen or Irrelevant
Image

● Digital images: Images


captured from a digital
screen/monitor screen.(hint: with
click buttons)

● Irrelevant images: Images


that are not part of our category
scope
Remember:
Reject!
Bad Quality

● Font size too small: If the


font size of the text in the
document is below 8pt, should be
rejected.

● Blurry/unrecognisable, Not
focused: Text should always be
readable. If text is blurry this
should be rejected right away
Remember:
Reject!
Bad Quality

● Image too dark

● Image too bright


Remember:
Reject!
Bad Quality
● Interference: Anything that is
severely obstructing the text,
such as writing, highlights,
shadow, that affects the quality of
text should be rejected as
interference

● Back side Text- When a


document paper is THIN, text
from the other side can be seen.
It would be hard for the machine
the read the text clearly.
Remember:
Reject!
Bad Quality

● Multiple objects in the


image

● Mirror Text- If the text is being


reflected to another area or space
in the image.
Remember:
Reject!
Bad Quality

● Image corrupted

● Image cannot load: ADAP


latency issue
Max of 20 deg
tilt or weird
angle
Text is readable
General requirements

GENERAL - DO’s GENERAL - DON’Ts


● Business Cards in the specified ● No background text
language only from behind the page that comes
through on the original page. Avoid
pages that are too thin.
● Only one-face business cards
● No handwritten texts in the
● Only printed documents are relevant documents

● No Blurry Images

● Documents must be aligned


(maximum 20° angle), and centered
and occupy at least 80% of the image
.
Business Cards fields:

Image should contain at least 2 of the following fields:


● giver's first name (mandatory)
● giver's last name (mandatory)
● contact title (e.g. Dr./Prof.) (optional)
● company/organization name (optional), or at least written in the Logo
● logo (optional)
● job title (optional)
● job title department (optional)
● telephone number(s); fixed line and/or mobile phones (optional)
● fax (optional)
● email(s); contact email and/or company email (optional)
● website (optional)
● address: house number, street name, city, zip code, state, country (optional)
Other (is it the comment area in ADAP for?)

If the reason for rejection is not in our reject


reason’s list. Select “Other” then comment
what is the reject reason.Ex. “Incorrect PII
Redaction by Vendor”
How to annotate

This video demonstrate how you do the job (Step 3) The purpose of this
video is giving the rough idea how to perform your job. If the video cannot
play, please click this link to watch. How to annotate:
1. Check if you need to
reject business card

2. Select a correct label


and add a bounding
box

3. Check the prefill text


to ensure it fit the
format requirement
and correct text if any
errors
Remember:
PII Labeling
● Bounding boxes should be
as tight as possible to the
relevant text.

● If the business card has


PII and it was redacted by
vendor, we should not
accept it.
Remember:
Proper Annotation
● All required fields present
in the document should be
annotated.

Ex. Organization name


appeared twice in this image.
Annotate both names from the
logo and on the information
section.
Remember:
Proper Annotation
● If the person has multiple
roles mentioned in the
document, annotate each
of the role separately.

● If the person has multiple


middle names mentioned
in the document, annotate
all middle names in ONE
bounding box .
Remember:
Proper Annotation
● The company name is
abbreviated and has the
full meaning of
abbreviation in the doc
altogether and if there’s a
different field that will be
covered by another
annotation box

- We will annotate the


company name separately
Remember:
Proper Annotation
● Punctuation

- The less punctuation the


better
- Only annotate punctuation
when it is required
- namePrefix Dr.
- Punctuation in
addresses No.1
- Punctuation in the
phone number:
(+1)234 567, EXT1234
Labels To Annotate (please read carefully)
Labels To Annotate (please read carefully)
Labels To Annotate (please read carefully)
Labels To Annotate (please read carefully)
Labels To Annotate (please read carefully)
Thank you

You might also like