Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

The complete guide to

data engines for AI


00 Content
01 Introduction ------------------------------------------------------------------------------------1

02 The benefits of a best-in-class data engine -------------------------------------------------- 3




Understand your data --------------------------------------------------------------------3

Label data quickly and efficiently------------------------------------------------------- 5

Train and evaluate models ------------------------------------------------------------- 9

Integrate seamlessly with your tech stack ------------------------------------------- 11

Get operations expertise on demand ------------------------------------------------- 12

03 Who benefits from a data engine?----------------------------------------------------------- 13

04 What to look for in a data engine------------------------------------------------------------ 14

05 Optimize and accelerate AI development with Labelbox---------------------------------- 14


01 Introduction
When engineers at Tesla debuted their Full Self Driving (FSD) vehicle technology in 2020,
they had one significant advantage over other enterprises attempting to develop the same
AI-driven technology: a data engine. This data engine, designed by then Director of AI
Andrej Karpathy, was built to solve one of the biggest roadblocks that AI engineers face
in their work. Training, deploying, and maintaining an AI solution requires vast amounts
of unstructured data, which need to be carefully collected, processed, and painstakingly
labeled before they can be used for their intended purpose.

Karpathy’s data engine, jokingly referred to as “Operation Vacation” because it could


ostensibly do this work even if the entire engineering team went on vacation, ensured not
only that data was efficiently collected and processed from the FSD computers on Tesla
vehicles, but that the right data — the information that would best improve the model at hand
— was targeted for collection and labeling. Using techniques based on active learning, the
Tesla data engine examines model performance to find edge cases and rare events, collects
similar data samples from their fleet, and labels small batches of datasets to address specific
issues arising in the model.1

Tesla isn’t the only AI building enterprise to create an efficient data engine — OpenAI has
used theirs to train, deploy, and maintain popular and successful models such as GPT-3 and
DALL-E 2. To train DALL-E 2, they built a data engine that could not only feed data from stores
to labelers to model, but also filter out data that could violate the company’s content policy.
The organization has published a blog post2 about the techniques used to modify and process
the data used to train DALL-E 2 to ensure that no violent or sexual images are included in the
training dataset. Like Tesla engineers, the OpenAI team used two different active learning
techniques to iterate on the image classifiers: one to find and fix false positives, and another
to find and fix false negatives.

Building AI products is a complex and resource-intensive process that requires consistent lock-step
coordination between many internal and external stakeholders.

1
Watch Karpathy’s talk from CVPR 2020, Scalability in Autonomous Driving Workshop, to learn more about this data engine.
2
https://openai.com/blog/dall-e-2-pre-training-mitigations/ 1
AI teams and enterprises who don’t already have the time and resources to architect an
intricate and complex data engine for every use case, however, often face a slow and arduous
journey toward production AI, with common roadblocks that include:

• Siloed AI efforts that create multiple versions of ground truth, duplicate data, and other
challenges
• A large number of stakeholders working within a system that lacks efficient collaboration
tools and easy visibility into analytics
• Team members who want to speed up their projects and build temporary solutions for
data processing and labeling, creating broken processes in the long term
• Poorly built data engines that aren’t scalable and don’t meet requirements for active
learning techniques, labeling operations, model error analysis, and more

Luckily, AI teams today don’t have to build and maintain data engines for their projects like
Tesla and OpenAI did — they can invest in one instead.

Active learning is a technique that involves training a model on high-impact training data based on the
model’s performance. This model training method significantly improves the model with each iteration.

An AI data engine is the foundational infrastructure for how team members interface
with data and models in order to build better AI products — and do it remarkably fast.

Teams that purchase data engine software gain significant benefits (besides not having to
build their own infrastructure), including:

• The flexibility to use it for any AI project, even if they require entirely different modalities
of data, labeling requirements, and training techniques
• The ability to use pre-built solutions for data curation (integral to active learning),
labeling automation, and model evaluation and training
• The ability to work with any internal, external, or combination of labeling
teams and vendors
• Support for any labeling operations, workflow, or other software issue

In short, investing in a data engine can free up your data scientists and AI engineers to work
on what they do best: fully leveraging their unstructured data to build the next generation
of AI solutions.

2
02 The benefits of a
best-in-class data engine
As their needs for leveraging unstructured data grow, ML teams will require more data
management, quality and performance monitoring, and advanced techniques implemented
to improve the speed and efficiency of their labeling operations. Both large enterprises and
agile teams with smaller scale projects and datasets can realize significant benefits from a
data engine.

Understand and visualize your data


Understanding the data available to your AI team can have extraordinary benefits. In fact,
the majority of data (80% to 90%, according to multiple analyst estimates) is unstructured
information like text, images, video, audio, and more. That’s a valuable untapped resource
with the potential to create competitive advantage for companies that figure out how to
leverage it. Knowing what’s in your raw dataset can help you identify patterns, potential edge
cases for your model, and low quality data — that is, data that’s not likely to help your model
improve if it’s used as training data.

Manually sifting through your data without being able to visualize it, however, is an inefficient
use of the team’s time and resources. Additionally, datasets are complex and nuanced
– often including multiple formats and modalities, with varying amounts of metadata.
Unstructured data dissemination inflows and outflows may include external providers
outside of the enterprise. Silos and a lack of standardized systems across individual teams can
distort teams’ idea of what data is available. Even with robust data lakes, enterprises often
struggle to fully leverage that data to drive machine learning model development. Because
data is a fundamental building block to scaling AI operations and machine learning (ML)
model development, the cost of data silos and disarray is exorbitant. These costs include:

• Unnecessary and costly procurement of additional data based on beliefs that there
is insufficient relevant data to support business use cases
• Redundant data due to data silos amongst teams across the enterprise
• Inability to search all your data in one place and visualize ground truth next to
predictions in order to quickly identify edge cases
• Lack of data provisioning, access, and transparency that blocks collaboration
on labeling efforts
• Inability to track data provenance causes costly security and compliance issues
• High costs of organizing data for AI development causing enterprises to delay
or abandon their efforts
• Lack of accessible exposure to additional relevant data across the enterprise
stalls model improvement

A best-in-class data engine will provide your team with easily implemented, intuitive
solutions for better understanding, exploring, and managing your data, including
every format, size, and source.

3
Labelbox Catalog, for example, paints a holistic picture of all your enterprise’s data assets,
organizing your unstructured datasets and helping your team select and prioritize the right
data to accelerate machine learning model development.

Past, present, and future Data formats

• Data annotations • Images


• Metadata • Video
• Model predictions • Text
• Ingested data • Conversational text
• Curated data • Geospatial imagery
• Multimodal data • Simple tiled imagery
• Audio
• Documents (beta)
• HTML (beta)
• DICOM (beta)

Using built-in similarity and metadata search, any team member can explore existing data
and curate new datasets. With a few clicks in the Catalog interface, users can apply weak
supervision to automatically label and classify large tranches of previously unstructured data.

Labelbox Catalog can be used to visually explore datasets, find similar data, identify duplicates and other
low quality data, and curate datasets for labeling.

4
Label data quickly and efficiently
Another vital aspect of a data engine involves the creation of large volumes of high-quality
training data. Often, AI teams outsource this task to labeling services, some of which are
large teams who receive training on the specific labeling task required and quickly proceed to
label large datasets. Others may use AI-powered solutions to speed up the labeling process.
Whatever these vendors provide, they typically have the same disadvantages:

• The labeling process itself is opaque, so AI teams have little insight into metrics such as
labeling quality, throughput, and efficiency
• Service providers have their preferred labeling software and processes, hindering AI
teams from experimenting within the labeling process and taking advantage of techniques
like automation and active learning
• For teams working with sensitive data, outsourcing labeling can be a greater challenge

In the face of these difficulties, AI teams may also turn to open source or homegrown

labeling tools and internal labelers. Open source tools, however, can make it difficult to find
support when issues arise, as well as create challenges as the team scales up their labeling
efforts.3 Homegrown tools can result in a patchwork ensemble of complex, unreliable, and
unscalable tooling that hinders AI projects more often than moving them forward, unless
your organization can invest large amounts of time and resources into its creation and
maintenance. Teams utilizing these tools often face difficulties with:

• Delays from quality management and labeling iteration


• Poor ontology creation and management
• Miscommunication between stakeholders, SMEs, labelers, and engineers

3
Medical AI company VirtuSense used to rely on an open source labeling solution, which proved difficult to scale.
Worse, it caused long and unexpected delays whenever issues arose, because VirtuSense engineers had to solve it 5
themselves instead of relying on a support team.
A best-in-class data engine will provide AI teams with a powerful, flexible, and scalable
annotation tool, enabling them to use any labeling service or vendor with full transparency,
label any type of data, collaborate easily with all stakeholders throughout and within the
labeling process, and more. Below are a handful of benefits that a best-in-class data engine
can provide throughout the labeling process.

Labelbox enables labeling teams to quickly and easily annotate any data type with powerful, flexible,
and configurable labeling editors.

Automation

A data engine will enable users to automate several parts of their labeling process to
accelerate efforts without diminishing the quality of annotations.

1. Automated queuing enables labelers to work continuously, eliminating the delays that
occur as they wait to receive datasets, instructions, and other materials
2. Auto-segmentation tools that cut complex image segmentation drawing tasks
down to seconds
3. Automate data operations and workflows programmatically with a Python SDK
4. AI teams can import model predictions as pre-labels, so that labelers can review and
correct them instead of labeling data from scratch

This final labeling automation technique, called pre-labeling or model-assisted labeling,


has been proven to reduce labeling time and costs by up to 50%4 for AI teams.
Pre-labeling decreases labeling costs as the model gets smarter with every iteration, leaving
teams more time to focus on manually labeling edge cases or areas where the model might
not be performing as well. It’s not only faster and less expensive, but delivers
better model performance.

4
Blue River Technology, an AgTech company developing AI solutions, used Labelbox’s model-assisted labeling
workflow to import model predictions as pre-labeled data, reducing labeling costs by 50%. Watch this 30-minute 6
webinar to learn how they did it.
While teams who have already built models — either in production or in training — often use
that model’s output to generate pre-labels, teams that don’t yet have their own models can
still use this method. They can generate pre-labels with an off-the-shelf model trained on
a similar or broader task, or they can use programmatically generated labels. While these
latter two options are unlikely to save quite as much labeling time as using pre-labels from
your own model, they may still shave seconds or minutes per label, which can amount to
significant time savings when labeling large datasets.

For this Fortune 500 AgTech company, training a model to identify weeds in images of crop fields required
an hour of painstaking manual labeling per image. To accelerate their process, the team used a model that
identified plants to generate pre-labels, leaving labelers to draw masks without first combing the images for
every possible plant.

Custom workflows

Many AI labeling teams struggle to prioritize the right data to label and end up spending more
on data labeling than they should. For enterprise teams working on larger and more complex
projects, structuring, reviewing, and completing training data projects systematically is key to
getting models to production.

Custom workflows can help optimize how labeled data gets reviewed across multiple
tasks and reviewers. With increased collaboration and visibility, this feature allows teams
the flexibility to tailor their review workflows for faster iteration cycles. Teams can set up
dynamic workflows based on attributes, such as annotation type or labeler, to reduce costs
and increase labeling and review throughput, quality, and efficiency.

Customizing labeling workflows based on specific situations, including attributes, annotation types, labeler
or labeling team, and more will result in an ultimately more efficient data engine for your AI team.

7
Collaboration

Built-in collaboration tools enable labelers and other stakeholders to ask and answer
questions, raise issues, and receive feedback within the labeling interface and directly on the
asset instead of toggling between applications.

Collaboration is essential among ML teams, from data engineers to external labeling workforces.
Teams are able to move 3X to 5X faster when they can discuss and resolve labeling issues in real
time, all on one platform.

Get full transparency with labeler and annotation analytics

AI team managers and administrators will be able to track live analytics throughout their
labeling projects to monitor quality, throughput, and efficiency. With Labelbox, teams can
drill further into metrics on individual labeler progress and performance on labeling time,
review and rework time, total time spent, and more.

AI team managers and administrators will be able to track live analytics throughout their
labeling projects to monitor the quality, throughput, and efficiency. With Labelbox, teams can
drill further into metrics on individual labeler progress and performance on labeling time,
review and rework time, total time spent, and more.

AI teams can view detailed analytics on labeling projects with the Project Performance Dashboard
in Labelbox.

8
Train and evaluate models
Whether AI teams have multiple models in production or are only starting to develop their
first models, continuous, iterative improvement with high-quality training data is key to
progress. Just like data management and labeling, however, AI teams are often faced with a
lack of robust, data-centric solutions and workflows when it comes to developing, training,
and evaluating model performance. Instead, many teams rely on ad-hoc scripts or in-house
tools that are slow, not flexible enough to use for all their projects, and difficult to maintain.

While there are experiment management, training, and evaluation solutions available
separately from various vendors, it’s up to the AI teams to ensure that they are integrated
with the rest of their tech stack. This setup process can take additional time and resources,
and when teams use a multitude of tools patchworked together, an issue with any
independent solution can cause delays and issues with the entire AI development process.

Having an easy way to manage model training and retraining pipelines enables faster
development cycles. With a purpose-built data engine, AI teams have ready access not just
to the data management and labeling capabilities, but also model training and evaluation
features that already integrate seamlessly with the rest of the engine, dramatically
simplifying the data-to-model pipeline and setting teams up for faster iteration and a more
successful AI endeavor. Below are a few of the workflows readily available with a best-in-
class data engine.

Train models quickly and easily

Best-in-class data engines will offer an intuitive front-end UI for preparing training data,
configuring hyperparameters, and versioning data and model runs — and integrate easily
with your preferred model training framework or compute service. These features enable
AI teams with flexibility and control over every part of their data-centric model training
workflows.

With Labelbox Model, AI teams can manage model data, version data and model runs, run model error
analysis, and more.

9
With model training features, users of varying technical proficiency levels can:

• Improve their models by improving training data


• Quickly kick off low-code model development
• Validate a hypothesis, business, or mission use case with minimal manual effort
• Train a model to intentionally accelerate data labeling with active learning
• Train models from their labeled data with one click by connecting with a compute
service via API

Evaluate model performance and find errors

With this core data engine feature, AI teams can instantly visualize model errors, use the
data management tool to find similar raw data, and curate a dataset to address specific
model errors within minutes.

Labelbox enables AI teams to dig deep into model performance to surface and fix errors, inspect
performance on slices of data, track model versions, and more.

10
Manage model configurations and data selection

AI teams can trade in their DIY solutions that stitch together Jupyter notebooks,
spreadsheets, and homegrown evaluation tools for a data engine that enables them to:

• Interact with data visually, identify trends in model behavior, identify edge cases, and
compare model predictions to ground truth
• Split labeled data into training, validation, and test datasets with intuitive sliders in the
UI or make custom configurations based on previous model runs
• Cluster visually similar data to understand trends in model performance and data
distribution
• Comprehend the nuances of a model’s performance and add crucial context to key
decision-making metrics like confidence intervals, Intersection over Union (IoU),
precision and recall, and more

Integrate seamlessly with your tech stack


Even though a data engine can enable data management, labeling, and model training and
evaluation all on one platform, AI teams will still need to create a pipeline of data that flows
to and from their data storage, labeling services and teams, and their preferred deployment
and inference tools. With a patchworked data engine made from homegrown and/or open
source tools, this pipeline will be yet another challenge to build from scratch.

A best-in-class data engine will integrate seamlessly with leading data stores, services, and
tools as the foundation of your tech stack.

11
Get labeling operations expertise on demand
All AI teams have varying resources and expertise, and many AI projects can be delayed or
stymied by gaps in knowledge, management, or skills. These may include:

• Labeling experience and unique knowledge about your use case


• Adequate materials and processes for training a labeling team
• Liaisons between external labeling teams and the AI team
• Tools and talent for implementing specialized workflows for quality management,
automation, active learning, and more

While a data engine is primarily a software solution, a best-in-class vendor will also offer
on-demand guidance in labeler training, ontology management, and metrics tracking. They
will also be able to pair AI teams with labeling teams experienced in relevant use cases and
industries, and with any required security clearances.

12
03 Who benefits from a data engine?
Organizations investing in AI will need to equip themselves with the tools and processes
to produce high-quality training data in order to build performant models and applications.
Below are the specific types of team members who can derive outsized benefits when
adopting a data engine.

AI product managers will be empowered to quickly deploy and maintain


applications with timely insights into labeling productivity, label quality, and
improved integration between the models and data labeling pipelines. They’ll also
be able to implement advanced workflows to automate labeling, leverage active
learning techniques, and optimize model training, all on one platform.

Data scientists and ML engineers will be able to ensure that labeled data (the
source code for ML models) is of high-quality and reflects required business
objectives.

Technical leaders (especially heads of engineering and senior data leaders) will
be able to unite all of their siloed unstructured data efforts into a single platform
for enterprise-grade compliance, collaboration, and meet their trust and safety
standards.

Business leaders will be able to track areas of AI investment to key business


priorities to better correlate ML projects to business value, as well as develop and
maintain high-performing production AI fast to gain and keep competitive edge.

Labeling operations leaders will be able to measure labeling efficiency and


performance to optimize workflows.

Labeling teams can access the fastest and most intuitive tools for labeling
training data.

13
04 What to look for in a data engine
Choosing the right data engine for your AI team or organization is no easy task. More options
are available now than ever before, and you’ll need to ensure that the platform you choose
will meet all your immediate and future requirements. Be sure to consider the following
questions as you evaluate your options:

• How is your organization currently • Will your team need to use advanced
accessing, storing, and leveraging techniques to increase speed and
unstructured data (in the form of images, efficiency, such as labeling automation,
text, video, etc)? active learning or weak supervision?
• How easily can your ML and data teams • Can your data engine provide full visibility
get access to this data in order to build and control over your data throughout the
products and help the organization solve labeling process? Don’t forget that your
problems and grow revenue? labeled data is a valuable asset; many
• What data types will you need to label? enterprises consider it as IP that will help
Will you need to label different types of them build a competitive advantage and
data in the future? create unique/personalized experiences.

• Who will be labeling your data, and how • How will your data engine help you
will they access it? Will your labeling team increase iteration speed as you target
grow in the future? model performance?

• How will your data need to be labeled?


Your data engine will need to support all
current and future labeling methods and
ontologies.
Compared to enterprises relying on in-house tools and development, a comprehensive data
engine helps teams save years of custom R&D, given that it can be difficult to accurately
define the scope and construct a solution for needs across engineering and product groups.
Building dedicated software to handle one of the most time intensive parts of their workflow
means spending less time on infrastructure planning, resource allocation, and preparing for
the unknown. In the absence of a data engine, companies find that their internal tools are
generally not built for usability, scalability, or cross-team support.

05 Optimize and accelerate AI


development with Labelbox
Labelbox is an enterprise-grade data engine built by and for those who build AI products.
Every day, organizations of all sizes, including Fortune 500 companies and leading startup
teams, use Labelbox to better understand their data, produce high-quality training data,
set up proven advanced techniques such as pre-labeling and active learning, and train and
evaluate their models.

The Labelbox platform, which serves as an end-to-end data engine for leading AI teams.

14
Here’s what customers are saying on how they derive business value and realize
a strong ROI for their investment:

“With Labelbox, we’re able to generate high-quality


annotations by allowing our team of domain experts and
labelers to collaborate more efficiently. The workflow
we’ve built queues up all the work for our labelers to create
image annotations, which are then sampled and reviewed
by experts, and fed into ML models to make better AI
diagnoses.”

Miao Zhang, AI Scientist

“Reducing our data requirements is huge because we


can get the same amount of improvement in our model’s
performance in half the time and with half the effort. This
was enabled through targeting the model’s weaknesses with
Labelbox’s Model product and then being able to prioritize
the right data through Catalog. By doing so, we’ve reduced
our labeling spend and data needs by over 50%.”

Noé Barrell, ML Engineer

“Three years ago, our average label took more than an hour...
now, we use model-assisted labeling in Labelbox and have
optimized the labeling cycle. About 50% of our images are
pre-labeled...and we’re down to 25 minutes per label. And
the quality has been maintained.”

Arjun Adhikari, Data Scientist

15
“We were very pleasantly surprised by how easy [Labelbox]
is to use. People got the hang of it just like that. And it’s not
just about how easy it is to label data. We really like its ability
to distribute data to multiple people and keep track of what’s
happening — it gives us a base of operations to help traverse
through all that data….but the real beauty of it, the reason
we really love Labelbox, is that it integrates so well with our
tech. From a thousand-foot view, it looks like an AI labeling
engine, which feeds into the deep learning system and spits
out trained networks.”

Deepak Gaddipati, Founder & CTO of VirtuSense

“Tracking productivity and quality in other services felt more


like a black box because after submitting responses, there
was nothing else that we could do. In contrast, Labelbox
provides the ability to count the number of labels done,
revisit submitted labels, fix errors, run a full quality assurance
pipeline, and manage labeler productivity.”

Data science leader at a Fortune 500 education technology company

AI teams armed with a data engine are more likely to produce better quality training data,
iterate faster on their models, and deploy transformative AI solutions. While building a data
engine requires time, talent, and resources, investing in an end-to-end data engine solution
like Labelbox can help your team reap the same (and more) benefits, saving your resources
for building and deploying AI.

Try Labelbox today


Get started for free or see how Labelbox can fit
your specific needs by requesting a demo.

Try Labelbox for free

16

You might also like