Build Accurate Training Datasets With Amazon SageMaker Ground Truth AIM308

AIM308
Build accurate training datasets with

Amazon SageMaker Ground Truth
Warren Barkley Vikram Madan Kevin Dela Rosa
GM, Augmented AI Product Lead ML Engineer, Perception
Amazon Web Services Amazon Web Services Snap
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data labeling is tedious and difficult
• Massive scale: ML models need large,

labeled datasets
• High accuracy: ML models depend on

accurately labeled data
As a result, building the training dataset

takes up to 80% of a data scientist’s time
Build highly accurate training datasets

Reduce data labeling costs by up to 70%
▪ Built-in features to improve label accuracy

▪ Automated data labeling capability
▪ Access to multiple workforces: Amazon Mechanical Turk and
third-party vendors
▪ Option to bring your own workers and secure handling of data
▪ Tight integration with Amazon SageMaker pipeline
More accurate and efficient data labeling
Active learning An accurate training

model is trained from dataset is ready for use
human-labeled data
Human-labeled data is then sent

back to retrain and improve the
machine learning model
Built-in data labeling workflows
Image classification Bounding boxes Semantic segmentation
Custom
Text classification Named entity recognition

Custom data labeling workflows
Learn more @ https://amzn.to/2OsREAk

Human workforce options
Labeling job creation
Scan is powered by AI
Machine learning & computer vision are at the core of the scan:
• Image classification
• Object detection
• Semantic segmentation
• Content-based information retrieval
• Ranking
Etc.
Machine learning – bird’s-eye view
Dataset Model
Evaluation Deployment
collection training
Machine learning – bird’s-eye view
Dataset Model
Evaluation Deployment
collection training
“Data is the new oil”
Many large-scale public ImageNet

CNN
ImageNet
datasets are available; for data label
example:
• ImageNet (1M images )
• Open Images (9M images)
New output
• Places (10M images) New task Pre-trained
layer for
data CNN
target
Data is great …
Getting to state of the art
{
"label": "Hand",
"score": 0.99794453
}
👍
Your application’s data is different
{
"label": "Hand",
"score": 0.9703855
}
👎
New problem
We have relevant images we can learn from, but no labels
Solution
What we liked:
• In AWS
• We’re already using some AWS solutions in our training workflows (Amazon SageMaker
training, Amazon S3), so it’s easy to point to data
• Speed
• We can get images labeled on-demand quickly
• Flexibility to leverage public or private workforces to label data
• Ability to kick off labeling jobs programmatically or via UI
Integrating with Amazon SageMaker Ground Truth
Get target smInput := sagemaker.CreateLabelingJobInput{

images
HumanTaskConfig: &humanTaskConfig,
InputConfig: &inputConfig,
LabelAttributeName: &attrName,
Pre-processing LabelCategoryConfigS3Uri: &job.CategoryUri,
LabelingJobName: &fullJobName,
OutputConfig: &outputConfig,
Submit Ground RoleArn: roleArn,
Truth job
}
client.CreateLabelingJob(&smInput)
Extract labels
Data labeling pipeline (Kubeflow)

Integrating with Amazon SageMaker Ground Truth
Get target Labeled
images image list
Create TF
Pre-processing
record
Submit Ground SageMaker

Truth job training
Extract
Extract labels
model file
BigQuery
Data labeling pipeline Model training pipeline
Learn more @ https://bit.ly/35GzfH4

Impact
• We can gather labels for

1000s of images in hours
• Incorporating new data
improves our predictions
• Our models do the right
thing now
No longer labeled as
hand
{
"label": "Hand",
"score": 0.06862099
}
👍
Tips for getting high-quality labels
Provide clear instructions
• Show images of good and bad examples
• Bootstrap this by running a small test set and gathering common
mistakes
Opt for the smallest label set possible

• If you have multiple potential labels, narrow down the field instead of
exposing all possible labels
Use multiple workers per task

• Helps improve your overall accuracy
Data labeling best practices
• Evaluate and improve your labels
• Make labeling easier for your labelers
• Use multiplicity to improve accuracy
• Measure accuracy & throughput of labelers
• Label only what you need to

Evaluate and improve your labels
☐ Yes
☑ No
Learn more @ https://amzn.to/364LQFh

Make labeling easier for your labelers
Learn more @ https://amzn.to/33IbiyL Auto-segment uses Deep Extreme Cut (DEXTR) algorithm
Use multiplicity to improve accuracy
https://amzn.to/2N9PrsD
Measure accuracy and throughput of labelers
Raw worker responses emitted to S3
{
"answers":
[{"answerContent":
{
"crowd-classifier":{"label":"Athlete"}}, Response from worker 1
"submissionTime":"2019-10-16T03:25:56.656Z",
"workerId":"private.us-west-2.2fa5a9d73ef73ba0",
"workerMetadata":
{ "identityData":
{
"identityProviderType":"Cognito",
"issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq",
"sub":"c9a8f4a4-ed4a-4dad-a722-8532d0d6016e“
}
}
},
{"answerContent":
{
"crowd-classifier":{"label":"Animal"}}, Response from worker 2
"submissionTime":"2019-10-16T03:27:31.048Z",
"workerId":"private.us-west-2.7dcbcca1ce3117d8",
"workerMetadata":
{ "identityData":
{
"identityProviderType":"Cognito",
"issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq",
"sub":"7eb0d3bc-2da5-4244-b14f-d9ec6ffe2e17“
}
}
}]
}
Measure accuracy and throughput of labelers
Amazon CloudWatch Logs & metrics for worker throughput

{
"worker_id": "cd449a289e129409",
"cognito_user_pool_id": "us-east-2_IpicJXXXX",
"cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd",
"task_accepted_time": "Wed Aug 14 16:00:59 UTC 2019",
"task_submitted_time": "Wed Aug 14 16:01:04 UTC 2019",
"task_returned_time": "",
"workteam_arn": "arn:aws:sagemaker:us-east-2:############:workteam/private-crowd/Sample-labeling-team",
"labeling_job_arn": "arn:aws:sagemaker:us-east-2:############:labeling-job/metrics-demo",
"work_requester_account_id": "############",
"job_reference_code": "############",
"job_type": "Private",
"event_type": "TasksSubmitted",
"event_timestamp": "1565798464"
Learn more @ https://amzn.to/34N2eZQ

Label only what you need to
“Not all data is created equal”
Check out the full blog @ https://amzn.to/2VZnBDv

Label only what you need to
Check out the full blog @ https://amzn.to/2VZnBDv

Learn ML with AWS Training and Certification
The same training that our own developers use, now available on demand
Role-based ML learning paths for developers, data scientists, data

platform engineers, and business decision makers
70+ free digital ML courses from AWS experts let you learn from
real-world challenges tackled at AWS
Validate expertise with the

AWS Certified Machine Learning - Specialty exam
Visit https://aws.training/machinelearning
Thank you!

Build Accurate Training Datasets With Amazon SageMaker Ground Truth AIM308

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Build Accurate Training Datasets With Amazon SageMaker Ground Truth AIM308

Uploaded by

Copyright:

Available Formats

AIM308

Build accurate training datasets with

• Massive scale: ML models need large,

• High accuracy: ML models depend on

As a result, building the training dataset

Build highly accurate training datasets

▪ Built-in features to improve label accuracy

Active learning An accurate training

Human-labeled data is then sent

Image classification Bounding boxes Semantic segmentation

Text classification Named entity recognition

Learn more @ https://amzn.to/2OsREAk

Many large-scale public ImageNet

Get target smInput := sagemaker.CreateLabelingJobInput{

Data labeling pipeline (Kubeflow)

Submit Ground SageMaker

Data labeling pipeline Model training pipeline

Learn more @ https://bit.ly/35GzfH4

• We can gather labels for

Opt for the smallest label set possible

Use multiple workers per task

• Evaluate and improve your labels

• Make labeling easier for your labelers

• Use multiplicity to improve accuracy

• Measure accuracy & throughput of labelers

• Label only what you need to

Learn more @ https://amzn.to/364LQFh

Amazon CloudWatch Logs & metrics for worker throughput

Learn more @ https://amzn.to/34N2eZQ

“Not all data is created equal”

Check out the full blog @ https://amzn.to/2VZnBDv

Check out the full blog @ https://amzn.to/2VZnBDv

Role-based ML learning paths for developers, data scientists, data

Validate expertise with the

You might also like