Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

AIM308

Build accurate training datasets with


Amazon SageMaker Ground Truth
Warren Barkley Vikram Madan Kevin Dela Rosa
GM, Augmented AI Product Lead ML Engineer, Perception
Amazon Web Services Amazon Web Services Snap

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data labeling is tedious and difficult

• Massive scale: ML models need large,


labeled datasets

• High accuracy: ML models depend on


accurately labeled data

As a result, building the training dataset


takes up to 80% of a data scientist’s time
Amazon SageMaker Ground Truth

Build highly accurate training datasets


Reduce data labeling costs by up to 70%

▪ Built-in features to improve label accuracy


▪ Automated data labeling capability
▪ Access to multiple workforces: Amazon Mechanical Turk and
third-party vendors
▪ Option to bring your own workers and secure handling of data
▪ Tight integration with Amazon SageMaker pipeline
More accurate and efficient data labeling

Active learning An accurate training


model is trained from dataset is ready for use
human-labeled data

Human-labeled data is then sent


back to retrain and improve the
machine learning model
Built-in data labeling workflows

Image classification Bounding boxes Semantic segmentation

Custom

Text classification Named entity recognition


Custom data labeling workflows

Learn more @ https://amzn.to/2OsREAk


Human workforce options
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Labeling job creation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scan is powered by AI
Machine learning & computer vision are at the core of the scan:
• Image classification
• Object detection
• Semantic segmentation
• Content-based information retrieval
• Ranking
Etc.
Machine learning – bird’s-eye view

Dataset Model
Evaluation Deployment
collection training
Machine learning – bird’s-eye view

Dataset Model
Evaluation Deployment
collection training
“Data is the new oil”

Many large-scale public ImageNet


CNN
ImageNet
datasets are available; for data label
example:
• ImageNet (1M images )
• Open Images (9M images)
New output
• Places (10M images) New task Pre-trained
layer for
data CNN
target
Data is great …
Getting to state of the art

{
"label": "Hand",
"score": 0.99794453
}

👍
Your application’s data is different

{
"label": "Hand",
"score": 0.9703855
}

👎
New problem
We have relevant images we can learn from, but no labels
Solution
Amazon SageMaker Ground Truth
What we liked:
• In AWS
• We’re already using some AWS solutions in our training workflows (Amazon SageMaker
training, Amazon S3), so it’s easy to point to data
• Speed
• We can get images labeled on-demand quickly
• Flexibility to leverage public or private workforces to label data
• Ability to kick off labeling jobs programmatically or via UI
Integrating with Amazon SageMaker Ground Truth

Get target smInput := sagemaker.CreateLabelingJobInput{


images
HumanTaskConfig: &humanTaskConfig,

InputConfig: &inputConfig,

LabelAttributeName: &attrName,
Pre-processing LabelCategoryConfigS3Uri: &job.CategoryUri,

LabelingJobName: &fullJobName,

OutputConfig: &outputConfig,
Submit Ground RoleArn: roleArn,
Truth job
}

client.CreateLabelingJob(&smInput)
Extract labels

Data labeling pipeline (Kubeflow)


Integrating with Amazon SageMaker Ground Truth
Get target Labeled
images image list

Create TF
Pre-processing
record

Submit Ground SageMaker


Truth job training

Extract
Extract labels
model file
BigQuery

Data labeling pipeline Model training pipeline

Learn more @ https://bit.ly/35GzfH4


Impact

• We can gather labels for


1000s of images in hours
• Incorporating new data
improves our predictions
• Our models do the right
thing now
No longer labeled as
hand
{
"label": "Hand",
"score": 0.06862099
}

👍
Tips for getting high-quality labels
Provide clear instructions
• Show images of good and bad examples
• Bootstrap this by running a small test set and gathering common
mistakes

Opt for the smallest label set possible


• If you have multiple potential labels, narrow down the field instead of
exposing all possible labels

Use multiple workers per task


• Helps improve your overall accuracy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data labeling best practices

• Evaluate and improve your labels

• Make labeling easier for your labelers

• Use multiplicity to improve accuracy

• Measure accuracy & throughput of labelers

• Label only what you need to


Evaluate and improve your labels

☐ Yes
☑ No

Learn more @ https://amzn.to/364LQFh


Make labeling easier for your labelers

Learn more @ https://amzn.to/33IbiyL Auto-segment uses Deep Extreme Cut (DEXTR) algorithm
Use multiplicity to improve accuracy

https://amzn.to/2N9PrsD
Measure accuracy and throughput of labelers
Raw worker responses emitted to S3
{
"answers":
[{"answerContent":
{
"crowd-classifier":{"label":"Athlete"}}, Response from worker 1
"submissionTime":"2019-10-16T03:25:56.656Z",
"workerId":"private.us-west-2.2fa5a9d73ef73ba0",
"workerMetadata":
{ "identityData":
{
"identityProviderType":"Cognito",
"issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq",
"sub":"c9a8f4a4-ed4a-4dad-a722-8532d0d6016e“
}
}

},
{"answerContent":
{
"crowd-classifier":{"label":"Animal"}}, Response from worker 2
"submissionTime":"2019-10-16T03:27:31.048Z",
"workerId":"private.us-west-2.7dcbcca1ce3117d8",
"workerMetadata":
{ "identityData":
{
"identityProviderType":"Cognito",
"issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq",
"sub":"7eb0d3bc-2da5-4244-b14f-d9ec6ffe2e17“
}
}
}]
}
Measure accuracy and throughput of labelers

Amazon CloudWatch Logs & metrics for worker throughput


{
"worker_id": "cd449a289e129409",
"cognito_user_pool_id": "us-east-2_IpicJXXXX",
"cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd",
"task_accepted_time": "Wed Aug 14 16:00:59 UTC 2019",
"task_submitted_time": "Wed Aug 14 16:01:04 UTC 2019",
"task_returned_time": "",
"workteam_arn": "arn:aws:sagemaker:us-east-2:############:workteam/private-crowd/Sample-labeling-team",
"labeling_job_arn": "arn:aws:sagemaker:us-east-2:############:labeling-job/metrics-demo",
"work_requester_account_id": "############",
"job_reference_code": "############",
"job_type": "Private",
"event_type": "TasksSubmitted",
"event_timestamp": "1565798464"

Learn more @ https://amzn.to/34N2eZQ


Label only what you need to

“Not all data is created equal”

Check out the full blog @ https://amzn.to/2VZnBDv


Label only what you need to

Check out the full blog @ https://amzn.to/2VZnBDv


Learn ML with AWS Training and Certification
The same training that our own developers use, now available on demand

Role-based ML learning paths for developers, data scientists, data


platform engineers, and business decision makers

70+ free digital ML courses from AWS experts let you learn from
real-world challenges tackled at AWS

Validate expertise with the


AWS Certified Machine Learning - Specialty exam

Visit https://aws.training/machinelearning

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like