Enterprise Artificial Intelligence and Machine Learning For Managers

Enterprise Artificial
Intelligence and Machine

Learning for Managers
A practical guide to AI and ML for business and government
Nikhil Krishnan, Ph.D.

C3.ai / nikhil.krishnan@c3.ai
Enterprise Artificial Intelligence and Machine Learning for Managers
Introduction 4
1. What is Machine Learning (ML) and Artificial Intelligence (AI)? 5

Categories of AI and ML 6
Supervised Learning 8
Unsupervised Learning 16
Reinforcement Learning 18
Deep Learning 19
2. Tuning a Machine Learning Model 21

Feature Engineering 21
Loss Functions 22
Regularization 23
Hyperparameters
25
3. Evaluating Model Performance 26

Classification Performance 26
Regression Performance 34
4. Runtimes and Compute Requirements 39

Machine Learning Libraries 39
Programming Languages for Machine Learning 40
Infrastructure: Machine Learning Hardware Requirements 41
5. Selecting the Right AI/ML Problems 43

Tractable Problems 43
Economic or Business Value 46
Ethical Implications of the Problem 49
Use Case Prioritization 54
© 2020 C3.ai | All Rights Reserved 2

6. Best Practices in Prototyping 58

Problem Scope and Timeframes 58
Cross-Functional Teams 60
Getting Started by Visualizing Data 61
Common Prototyping Problem – Information Leakage 64
Common Prototyping Problem - Bias 67
Pressure-Test Model Results by Visualizing Them 68
Model the Impact to the Business Process 69
Model Interpretability is Critical to Driving Adoption 71
Ensuring Algorithm Robustness 74
Planning for Risk Reviews and Audits 75
7. Best Practices in Ongoing Operations 76

AI as Part of Software Development Process 76
Testing and Planning for Scale 78
Algorithm Maintenance and Support 78
Tracking Value Captured 81
Marketing Program Achievements and Reporting Success 82
8. Building a Strong Team 84

Typical Required Skillsets 84
Candidate Screening and Interview Process 88
Team Organization 89
Professional Development 91
Summary 92
Acknowledgements 93
About the Author 94

Introduction
Artificial intelligence (AI) and machine learning (ML) are important emerging technologies that have the
potential to transform organizations. Exponential increases in computational capacity, the emergence of
cloud computing, and innovations in algorithms have resulted in tremendous advances in the application
of AI. Leveraging AI and ML techniques, organizations can unlock tremendous value through improved
customer service, streamlined operations, and the realization of new business models.
Despite tremendous advances in the field over the last decade, AI remains very much the domain of expert
data scientists. AI technologies are a complex subject and are relatively new to the business world;
few managers and enterprise technology professionals have an understanding of these techniques.
Furthermore, implementing AI techniques within the enterprise requires different skill sets and approaches
from traditional software products.
In order for organizations to truly unlock value from AI, its practitioners and their methods need to be fully
integrated into the fabric of the enterprise. This integration requires that managers have a basic understanding
of the science behind AI/ML, the techniques leveraged, and the metrics used to measure success.
Managers who are well-versed in the complexities of leading AI/ML efforts will be able to capture the most
value from these new technologies. These managers will be able to select the best AI use cases, effectively
collaborate and problem-solve with data scientists during prototyping phases, support the transition of
algorithms into production use, and design the right business processes and change management activities to
capture value for the organization. In order to achieve this, managers need a “field guide” to AI and ML techniques.
As we have designed, developed, and implemented AI techniques and AI-enabled enterprise applications at
organizations across the world over the past decade, it has become clear to us that such a field guide does
not exist. Most books and articles are either too technical, and focused on machine learning practitioners, or
are too managerial without encapsulating sufficient mathematical and machine learning knowledge.
This publication attempts to provide such a managerial field guide. It captures essential information about
the practical application of enterprise AI/ML techniques, gathered from our extensive experience at C3.ai,
across a wide range of industries and business problems. It also captures our experience with AI/ML teams
– recruiting, organizing, and managing high-performance teams.
In this publication, I first provide a brief overview of AI techniques and the key metrics and criteria used to
measure algorithm performance. I explain what feasible enterprise AI problems look like, describe best
practices for prototyping and production transitions, and discuss how to build high-performing data
science teams.

1
What is Machine Learning (ML)
and Artificial Intelligence (AI)?
Logic-based algorithms represent the core of traditional programming. For decades,
computer scientists were trained to think of algorithms as a logical series of steps or
processes that can be translated into machine-understandable instructions and used to solve
problems. Traditional algorithmic thinking is quite powerful and can be used to solve a range
of computer science problems, including data management, networking, and search.
Traditional logic-based algorithms effectively handle many different problems and tasks.
But they are often not effective at addressing tasks that are quite easy for humans to do.
Consider a basic task such as identifying an image of a cat. Writing a traditional computer
program to do this correctly would involve developing a methodology to encode and
parameterize all variations of cats –different sizes, breeds, and colors as well as their
orientation and location within the image field. While a program like this would be
enormously complex, a two-year-old child can effortlessly recognize the image of a cat.
ML algorithms take a different approach from traditional logic-based approaches. ML

algorithms are based on the idea that, rather than code a computer program to perform
a task, it can instead be designed to learn directly from data. So instead of being written
explicitly to identify pictures of cats, the computer program learns to identify cats using
an ML algorithm that is derived by observing a large number of different cat images. In
essence, the algorithm infers what an image of a cat is by analyzing many examples of
such images, much as a human learns.
AI algorithms enable new classes of problems to be solved by computational approaches

faster, with less code, and more effectively than traditional programming approaches.
Image classification tasks, for example, can be completed with over 98% less code when
developed using machine learning versus traditional programming.1

Categories of AI and ML
In this publication, I often use the terms AI and ML interchangeably. The overall taxonomy of the AI/ML space
can be confusing, and definitions vary based on points of view. However, AI is generally considered to be a
broad topic with several sub-fields, methodologies, and practitioners.
One of the key distinctions is the difference between general and specialty AI (Figure 1).2 General AI – or
artificial general intelligence (AGI) – involves the idea that computer programs can exhibit broad intelligence
and reason across domains like humans. Most researchers believe that true AGI may not be achievable in
the near future.3
Specialty AI involves the idea that computer programs can be trained to reason and solve specific dedicated
tasks. Examples of these tasks include detecting certain images, classifying potential failures for equipment
types, or detecting certain types of fraud. This field of specialty AI has been advancing rapidly in the last
couple of decades.
There are different sub-fields within specialty AI, including ML, optimization, and logic. ML describes a class
of algorithms which leverages powerful statistical learning techniques that operate on data.4 The power of
machine learning is that the algorithms can be quite generic; just a few algorithms can address and solve
many problems.5 Additionally, ML algorithms can learn from data, and can therefore, as described in the cat
identification problem above, reduce the need for complex logic and code.
ML has proven its ability to unlock economic value by solving real-world problems, including enabling useful
search results, providing personalized recommendations, filtering spam, and identifying fraud.6
Artificial Intelligence
Specialty AI
Artificial General
Machine
Intelligence Learning Optimization Logic
Figure 1: Overall taxonomy of the artificial intelligence field

There are three main subcategories of ML techniques – supervised learning, unsupervised learning,
and reinforcement learning.7 Within each category there are the “traditional” ML algorithms, and also
newer, deep learning algorithms (that are described further later in this chapter). The following figure
summarizes the most common machine learning categories and approaches.
Machine Learning
Categories Supervised Learning Unsupervised Learning Reinforcement Learning8
Dimensionality
Approaches Classification Regression Clustering Decision-making
Reduction
• Support vector • Linear regression • Principal • K-means • Monte Carlo

machines (SVM) • Ridge regression component • Gaussian mixture • Markov decision process
• XGBoost analysis (PCA) model (GMM)
• Random forest • Temporal difference learning
Traditional • Gradient-boosted • Density-based special
Algorithms decision trees (GBDT) clustering (DBSCAN)
• Random forest
• Multi-layer • Multi-layer • Auto-encoders • Deep Gaussian • Deep Q-learning

perceptrons (MLPs) perceptrons (MLPs) mixture model • Hidden Markov models (HMM)
• Convolutional • Convolutional (DGMM)
Deep Learning
networks networks
Algorithms
(examples of neural • Long short-term
networks)9 memory (LSTM)
Figure 2: Common categories of machine learning algorithms

Supervised Learning
Supervised techniques require a set of inputs and corresponding outputs to “learn from” in order to build a
predictive model. Supervised learning algorithms learn by tuning a set of model parameters that operate on
the model’s inputs, and that best fit the set of outputs. The goal of supervised machine learning is to train a
model of the form y = f(x), to predict outputs, y based on inputs, x.10
There are two main types of supervised learning techniques. The first type is classification. Classification
techniques predict categorical outputs, such as whether a certain object is a cat or not, whether a
transaction represents fraud or not, or whether a customer will return or not. The second type is regression.
Regression techniques predict continuous values, such as a forecast of sales over the next week.
The inputs to machine learning algorithms are called features. Features can include mathematical
transformations of data elements that are relevant to the machine learning task, for example, the total value
of financial transactions in the last week, or the minimum transaction value over the last month, or the 12-
week moving average of an account balance.
After features, x, are designed and implemented and observations, y, are identified, the model y = f(x) is ready
to be trained. During model training, the ML algorithm “learns” parameters or weights. These parameters
or weights are applied to the features to generate a trained model f(x) to best fit the outputs. Examples
of model parameters include coefficients in a linear regression or split points in a decision tree. The
following figure illustrates the concept.

A Simplified Machine Learning Pipeline
Figure 3: A supervised machine learning pipeline including raw data input, features, outputs, the ML model and model parameters,
and prediction outputs. In this example, the machine learning model is trained to classify whether a customer will remain or leave.
After a model is trained, it is often evaluated and tested on a holdout data set to validate model performance.
Generating predictions on holdout data indicates how well the model performs with new data on which
it was not trained. Training and testing, or validation, are often iterative, time-intensive steps in a machine
learning project. These topics are discussed in additional detail in the following sections.
Supervised techniques often require non-trivial dataset sizes to learn reliably from ground truth observations.
Models may require many thousands of input and output examples to learn from in order to perform effectively.
Larger datasets, including greater numbers of historic examples from which to learn, enable the algorithms to
incorporate a variety of edge cases and produce models that handle these edge cases elegantly. Depending on
the business problem at hand, multiple years of data are necessary to account for seasonality.
Consider a machine learning model that aims to classify if something is “true” or “false.” This type of classifier
can be used to predict customer attrition: it may aim to predict if an existing business customer is likely to
remain or leave. The following figure shows a graphical representation of a supervised model, where the
horizontal and vertical axes display input features, x (e.g., level of digital engagement by customer and number
of purchases by customer), and the color-coded dots indicate labeled examples of past customer behavior,
y (blue indicating attrition, red indicating retention). The labeled examples teach the model to identify patterns
that indicate attrition. Once the model is trained, it can be applied to new data to predict the behavior of future
customers. In the figure below, the green dashed line represents a decision boundary that partitions the
feature space. On one side of the decision boundary, the model predicts “true” and on the other side “false.”

Supervised Learning: “Good Truth” Available
Training Data Resulting Model Applied to New Data
Figure 4: Supervised machine learning models are trained on labeled data that are considered “ground truth” for the
model to identify patterns that predict those labels on new data.
During model training, the supervised machine learning algorithm is fed examples of both model inputs and
outputs. The following figure demonstrates the design of a feature table with inputs and outputs that can be
used to train a machine learning model to predict customer attrition – in this case, retail customers who stop
subscribing to a service. Training data are aggregated at a monthly interval, with a single record for each
customer on the first of each month. We could just as easily create a similar feature matrix at more frequent,
or “rolling,” intervals, but this simple example illustrates the concept.
To aggregate at a monthly level, features are aggregated over the monthly time period – like total purchases
in the last month ($) and the month-to-month change in website traffic (clicks). Outcomes or outputs are
also captured monthly. In this case, the output value equals 1 if the customer stopped subscribing at any
time over the past month, and 0 otherwise.
This feature matrix is input into the supervised model and the model parameters are adjusted so that the
model best “fits” the example outputs. By leveraging historical examples to train the model, the model
learns the patterns that are predictive of customer attrition in the past. When new customer data are
available, the trained model can be used to predict customers who will unsubscribe in the future.
When developing a new machine learning model, it is just as important to recognize its limitations as it is to
understand its potential benefits. In the customer attrition example, the model is not predicting new, novel ways
to retain customers. The model is learning based on historical patterns and then applying those patterns to
predict future behavior.
Machine learning systems are, however, self-learning. As new data labels become available (e.g., new modes
of customer churn or attrition), models can be retrained to learn those new patterns.

Model Features, X (m columns) Outputs, Y
Purchases Past Change in Web Customer

Date Customer ID 30 Days ($) Traffic Rolling 30 Attrition
Days (Visits) (Unsubscribed?)
1-Jan Adams 1,000.00 +1 0
1-Jan Bhat - 0 0
1-Jan Chu - -4 1
1-Feb Adams 200.00 0 0
1-Feb Bhat - -1 1
1-Feb Chu 400.00 -1 0
Data points
form matrix 1-Mar Adams 1,000.00 +1 0
of n rows
1-Mar Bhat 650.00 +3 0
1-Mar Chu 300.00 -1 0
1-Apr Adams 1,000.00 0 0
1-Apr Bhat 750.00 +1 0
1-Apr Chu 200.00 -5 0
j features form
matrix of width m
Figure 5: Examples of input signals and output data are required to train a supervised learning model.

One of the simplest machine learning formulations is described in the following equation:
Y = X.θ
In the above equation:
Model features are represented by a feature matrix, X, where columns represent features, and rows
correspond to each data point. With m features, and n data points, the dimensions of X are n X m. Labels are
represented by a vector Y, where rows correspond to each data point (dimension n X 1). And model weights
(or the importance) of each feature are represented by the vector, θ (dimension m X 1).
The training task of the supervised machine learning algorithm involves finding feature weights, θ, that
minimize a training loss function.
The dimensionality of the problem is important to consider. The size of the feature space (m, in the above
formulation) should typically be smaller than the number of labeled data points (n, in the above formulation).
In practice, supervised machine learning problems are often limited by the number of labeled examples
that are available from which the algorithm can learn. Usually, the more examples available, the higher the
likelihood that a supervised technique will be successful.
There are two main categories of supervised learning techniques: classification and regression.

Classification
Classification models predict a class label, such as whether a customer will return or not, whether a
certain transaction represents fraud or not, or whether a certain image is a car or not. Classification
approaches are useful for business problems that have large amounts of historical data, including
labels, that specify if something is in one group or another.
Classification algorithms map inputs (X) to outputs (Y) where Y∈ {1, … , C} with C being the number of
classes. If C = 2, this is called binary classification, and if C > 2, this is called multiclass classification.
An example of a classification task is predicting when an equipment or a machine is likely to fail. This predictive
maintenance task is a common problem faced by manufacturing and operations-focused companies.
Predictive maintenance can help avoid failure events that may be expensive or potentially dangerous.
If sufficient historical failure examples are available, as well as other relevant input data (e.g., sensor data,
technician notes), a supervised machine learning classifier can be trained to predict if equipment will be
operating in the failed or not-failed class in the future. In supervised classification problems, training examples
are often referred to as labels. The following figure shows an example of failure labels and classifier predictions.
Historical Training Data New Predictions
Figure 6: Time-series representation of a classifier label (“failed” or “not failed”) that can be used
to train a predictive maintenance machine learning model using classification
Examples of supervised classifier models include support vector machines (SVM), XGBoost, gradient-
boosted decision trees (GBDT), random forest, and neural networks.

Regression
Regression models predict quantities, such as how many customers are likely to churn or the sales
forecast over the next week. Regression techniques are useful for business problems that have large
historical datasets that correlate to numeric labels, including such things as sales, inventory, or loan value.
Reconsider the predictive maintenance example we explored with a classifier model, but this time we
want to predict equipment failure using a regression model. Instead of predicting a categorical label
like “failed” or “not failed” (as with the classifier model), a regression model can be trained to predict a
continuous value, such as time to failure, as shown in the following figure.
Training a regression algorithm is similar to training a classifier. The feature matrix is comprised of input
signals such as sensor data and work orders. The regression model also requires labels, but instead of
a binary (1 or 0) indicator of class (“failed” or “not failed”), the label is numeric (time to failure).
Historical Training Data New Predictions
Model Predicts a
Tra Failure is Approaching
in ing
Lab
el
Figure 7: Time-series representation of a time-to-failure label that can be used to train a predictive
maintenance machine learning model using regression

Examples of supervised regression models include linear regressions that predict a linear relationship
between input features and outputs, ridge regressions that are a more advanced variation of linear regression,
random forests that predict nonlinear relationships between inputs and outputs using decision trees, and neural
networks that predict nonlinear relationships between inputs and outputs using layers of complex nodes.
The following figure shows an example of a classification and regression technique. The left hand side of the
figure illustrates the result of a classification algorithm that estimates a decision boundary to separate two
classes (the classes are represented with different symbols in the figure). The axes on the chart represent
two input features. The right hand side of the figure illustrates the result of a regression algorithm that
predicts a quantity (shown on the y axis) as a function of a feature input (shown on the x axis).
Classification Regression
Predicting a Class Label Predicting a Quantity
Figure 8: Examples of classification and regression techniques

Unsupervised Learning
In contrast to supervised learning techniques, unsupervised learning techniques operate without known
outputs or observations – that is, these techniques are not trying to predict any specific outcomes. Instead,
unsupervised techniques attempt to uncover patterns within data sets. Unsupervised learning is a useful
approach for problems that do not have sufficient output or example data to train a supervised model.
Unsupervised techniques include clustering algorithms that group data in meaningful ways. Clustering
algorithms are used, for example, to identify and segment retail bank customers who are similar, or to
identify similar sensor data feeds from equipment. Examples of clustering algorithms include k-means—
a method to create subgroups of similar data points using “distance” between data points based on
features— and Gaussian mixture models (GMM)— a method to identify subgroups of similar data points
using statistical probability distributions.
Consider a simple machine learning model, as shown in the following figure, that aims to cluster data into
four categories (shown in the following figure as blue, yellow, red, and green). Here, the algorithm was told
to look for four categories, but the categories were not pre-defined or labeled in the training data.
This type of clustering model can be applied to the customer attrition prediction example discussed before,
and can be used, for example, to identify groups of similar customers. Although no outputs or labels are
known, an analyst can review the clusters to understand buying behavior, and identify outlier customers or
groups of customers who may be at risk of attrition.
Unsupervised Learning: No Ground Truth
Training Data Resulting Model Applied to New Input
Figure 9: Unsupervised machine learning models do not require labels to train on past data. Instead, they automatically
detect patterns in data to generate predictions. This example illustrates a clustering algorithm.

Another example of an unsupervised learning technique is dimensionality reduction. One of the central
problems in machine learning is representing human-interpretable patterns in complex data. Advanced
data science problems may involve work with large volumes of high-dimensional data, such as pixels in
images, sensory measurements of equipment, or human-gene distributions.
Dimensionality reduction is a powerful approach to construct a low-dimensional representation of high-

dimensional input data. The purpose of dimensionality reduction is to reduce noise so that a model can
identify strong signals among complex inputs – i.e., to identify useful information.
High dimensionality poses two challenges. First, it is hard for a person to conceptualize high-dimensional
space, meaning that interpreting a model is non-intuitive. Second, algorithms have a hard time learning
patterns when there are many sources of input data relative to the amount of available training data.
Examples of dimensionality reduction models include autoencoders, an artificial neural network approach
that “encodes” a complex feature space to capture important signals,9 and principal component analysis
(PCA), a statistical method that uses linear methods to combine a large number of input variables to
generate a smaller, more meaningful set of features.
One of the most useful – and significant – purposes of unsupervised machine learning is to perform
anomaly detection. Anomaly detection is an approach to define normal behavior in a data set and identify
inconsistent patterns. Using anomaly detection, it is possible to predict similar patterns that require labeled
history using a supervised model, such as identifying abnormal equipment behavior or recognizing a faulty
sensor. The following figure shows example output of an anomaly detection algorithm. Anomalies are
highlighted as red circles.
Figure 10: Example of an unsupervised machine learning model for anomaly detection

Reinforcement Learning
Reinforcement learning (RL) is a category of machine learning that uses a trial-and-error approach. RL is a
more goal-directed learning approach than either supervised or unsupervised machine learning.8
Reinforcement learning is a powerful means for solving business problems that do not have a large
historical dataset for training because it uses a dynamic model with rewards and penalties. Reinforcement
learning models learn from interaction – an entirely different approach than supervised and unsupervised
techniques that learn from history to predict the future.
Reinforcement learning models use a reward mechanism to update model actions (outputs) based on
feedback (rewards or penalties) from previous actions. The model is not told what actions to take, but
rather discovers what actions yield the most reward by trying different options. A reinforcement learning
model (“agent”) interacts with its environment to choose an action, and then moves to a new state in
the environment. In the transition to the new state, the model receives a reward (or punishment) that is
associated with its previous action. The objective of the model is to maximize its reward, thereby allowing
the model to improve continually with each new action and observation.
For example, if you want to train a machine learning model to play checkers, you are unlikely to have a
game tree that models all possible moves in a game or to have a comprehensive historical dataset of past
moves (there are 1020 possible moves in checkers). Instead, reinforcement learning models can learn game
strategy using rewards and punishments.
To test this approach, a team from software company DeepMind trained a reinforcement learning model
to play the strategy board game Go. With a game tree of 10360 possible combinations of moves, Go is
more than 100 orders of magnitude more complex than checkers. The DeepMind team trained a model to
successfully defeat reigning Go professional world champion Lee Sedol in 2016 in a five-game match.11

Deep Learning
Deep learning is a subset of machine learning that involves the application of complex, multi-layered artificial
neural networks to solve problems.9 Deep learning techniques are applicable across diverse problems and
used in all three machine learning subcategories discussed before. For example, a deep neural network
classifier is a form of supervised learning, while a deep neural network autoencoder is a form of unsupervised
learning, and a deep neural network Q-function is a form of reinforcement learning.
Deep learning takes advantage of yet another step change in compute capabilities. Deep learning models
are typically compute-intensive to train and much harder to interpret than conventional approaches.
In a deep neural network, data inputs are fed to an input layer of “neurons,” and the output of the neural
network is captured in the output layer. The layers in the middle are hidden “activation” layers that perform
various data transformations. The number of required layers generally (but not always) increases with the
complexity of the use case.
A single node in an artificial neural network takes input signals and produces an output, as shown in the
following figure.
Deep Learning: Node
Input Signals Node Output
Figure 11: Single node in a deep learning neural network

A deep learning neural network is a collection of many nodes. The nodes are organized into layers, and the
outputs from neurons in one layer become the inputs for the nodes in the next layer.
Deep Learning: Layers
Layer L 1 Layer L 2
Figure 12: Single nodes are combined to form input, output, and hidden layers of a deep learning neural network.
In the network shown above, each layer is fully connected to the previous layer and the following layer. Each
layer enables complex mathematical transformations to be represented. Deep neural nets typically have
multiple (more than two or three) hidden layers.
Deep neural networks initially found broad application in the field of computer vision. A specific type of
deep neural network – convolutional neural networks (CNNs) – are used broadly today in image and video
processing. Convolutional neural networks are not fully connected as in the previous example, but instead
apply convolutional functions at each layer and transfer the results to the next layer. They simulate how
visual neurons work in animals.

2
Tuning a Machine
Learning Model
Tuning a machine learning model is an iterative process. Data scientists typically run
numerous experiments to train and evaluate models, trying out different features,
different loss functions, different AI/ML models, and adjusting model parameters
and hyperparameters. Examples of steps involved in tuning and training a machine
learning model include feature engineering, loss function formulation, model testing
and selection, regularization, and selection of hyperparameters.
Feature Engineering
Feature engineering – a critical step to enhance AI/ML models – broadly refers to
mathematical transformations of raw data in order to feed appropriate signals into
AI/ML models.
In most real-world AI/ML use cases, data are derived from a variety of source
systems and typically are not reconciled or aligned in time and space. Data
scientists often put significant effort into defining data transformation pipelines
and building out their feature vectors. Furthermore, in most cases, mathematical
transformations applied to raw data can provide powerful signals to AI/ML
algorithms. In addition to feature engineering, data scientists should implement
requirements for feature normalization or scaling to ensure that no one feature
overpowers the algorithm.
For example, in a fraud detection use case, the customer’s actual account balance
at a point in time may be less meaningful than the average change in their account
balance over two 30-day rolling windows. Or, in a predictive maintenance use case,
the vibration signal related to a bearing may be less important than a vibration signal
that is normalized with respect to rotational velocity.
Thoughtful feature engineering that is mindful of the underlying physics or functional

domain of the problem being solved, coupled with a mathematical explosion of the
feature search space, can be a powerful tool in a data scientist’s arsenal.

Loss Functions
A loss function serves as the objective function that the AI/ML algorithm is seeking to optimize during
training efforts, and is often represented as a function of model weights, J(θ). During model training, the AI/
ML algorithm aims to minimize the loss function. Data scientists often consider different loss functions to
improve the model – e.g., make the model less sensitive to outliers, better handle noise, or reduce overfitting.
A simple example of a loss function is mean squared error (MSE), which often is used to optimize regression
models. MSE measures the average of squared difference between predictions and actual output values.
The equation for a loss function using MSE can be written as follows:
Where represents a model prediction, represents an actual value, and there are n data points.
It is important, however, to recognize the weaknesses of loss functions. Over-relying on loss functions as
an indicator of prediction accuracy may lead to erroneous model setpoints. For example, the two linear
regression models shown in the following figure have the same MSE, but the model on the left is under-
predicting while the model on the right is over-predicting.
Loss Function is Insufficient as Only Evaluation Metric
Under-Predicting Over-Predicting
Figure 13: These two linear regression models have the same MSE, but the model on
the left is under-predicting and the model on the right is over-predicting.

Regularization
Regularization is a method to balance overfitting and underfitting a model during training. Both overfitting
and underfitting are problems that ultimately cause poor predictions on new data.
Overfitting occurs when a machine learning model is tuned to learn the noise in the data rather than the
patterns or trends in the data. Models are frequently overfit when there are a small number of training samples
relative to the flexibility or complexity of the model. Such a model is considered to have high variance or
low bias. A supervised model that is overfit will typically perform well on data the model was trained on, but
perform poorly on data the model has not seen before.
Underfitting occurs when the machine learning model does not capture variations in the data – where the
variations in data are not caused by noise. Such a model is considered to have high bias, or low variance.
A supervised model that is underfit will typically perform poorly on both data the model was trained on, and
on data the model has not seen before. Examples of overfitting, underfitting, and a good balanced model,
are shown in the following figure.
Low Bias,
High Variance High Bias Low Variance
Overfitting Underfitting Good Balance
Figure 14: Regularization helps to balance variance and bias during model training.
Regularization is a technique to adjust how closely a model is trained to fit historical data. One way to apply
regularization is by adding a parameter that penalizes the loss function when the tuned model is overfit.
This allows use of regularization as a parameter that affects how closely the model is trained to fit historical
data. More regularization prevents overfitting, while less regularization prevents underfitting. Balancing the
regularization parameter helps find a good tradeoff between bias and variance.

Regularization is incorporated into model training by adding a regularization term to the loss function, as
shown by the loss function example that follows. This regularization term can be understood as penalizing
the complexity of the model.
Recall that we defined the machine learning model to predict outcomes Y based on input features X as Y = f(X).
And we defined a loss function J(θ) for model training as the mean squared error (MSE):
One type of regularization (L2 regularization) that can be applied to such a loss function with
regularization parameter is:
Where represents a model prediction, represents an actual value, there are n data points, and m features.

Hyperparameters
Hyperparameters are model parameters that are specified before training a model – i.e., parameters that are
different from model parameters – or weights that an AI/ML model learns during model training.
For many machine learning problems, finding the best hyperparameters is an iterative and potentially time-
intensive process called “hyperparameter optimization.”
Examples of hyperparameters include the number of hidden layers and the learning rate of deep neural
network algorithms, the number of leaves and depth of trees in decision tree algorithms, and the number of
clusters in clustering algorithms.
Hyperparameters directly impact the performance of a trained machine learning model. Choosing the
right hyperparameters can dramatically improve prediction accuracy. However, they can be challenging to
optimize because there is often a large combination of possible hyperparameter values.
To address the challenge of hyperparameter optimization, data scientists use specific optimization
algorithms designed for this task. Examples of hyperparameter optimization algorithms are grid search,
random search, and Bayesian optimization. These optimization approaches help narrow the search
space of all possible hyperparameter combinations to find the best (or near best) result. Hyperparameter
optimization is also a critical area where the data scientist’s experience and intuition matter.

3
Evaluating Model
Performance
Managers implementing machine learning solutions to solve business
problems need to understand how to quantify model performance – a critical
step that informs model selection and tuning, helps architect the right business
processes around the model, and informs decisions about ongoing model
maintenance and operations.
Some examples of model performance measures follow.
Classification Performance
As described before, classifiers attempt to predict the probability of discrete
outcomes. The correctness of these predictions can be evaluated against
ground-truth results. Concepts of true or false positives, precision, recall,
F1 scores, and receiver operating characteristic (ROC) curves are key to
understanding classifier performance. Each of these concepts is described
further in the following sections.
True or False Positives

Consider, for example, a supervised classifier trained to predict customer
attrition. Such a classifier may make a prediction about how likely a customer is
to ‘unsubscribe’ over a certain period of time.
Evaluating the model’s performance can be summarized by four questions:
1. How many customers did the model correctly predict would unsubscribe?
2. How many customers did the model correctly predict would not unsubscribe?
3. How many customers did the model incorrectly predict would
unsubscribe (but did not)?
4. How many customers did the model incorrectly predict would not
unsubscribe (but did)?
These four questions characterize the fundamental performance metrics of

true positives, true negatives, false positives, and false negatives. Figure 15
visually describes this concept.

Attrition predictions made by the model are in the top half of the square. Each prediction is either true (the
customer will unsubscribe), or not (the customer will not unsubscribe) – and are therefore true or false positives.
The total number of customer attrition predictions made by the classifier is the number of true positives plus
the number of false positives.
There are two other concepts highlighted in the figure. False negatives (bottom left) refer to attrition predictions
that should have been made but were not. These represent customers who unsubscribed, but who were
not correctly classified. True negatives (bottom right) refer to customers who were correctly classified as
remaining (not unsubscribing).
Reality
Model
Predictions
Figure 15: Visual representation of true positives, false positives, true negatives, and false negatives
Precision
Precision refers to the number of true positives divided by the total number of positive predictions –
the number of true positives, plus the number of false positives. Precision therefore is an indicator of
the quality of a positive prediction made by the model.
Precision is defined as:
In the customer attrition example, precision measures the number of customers that the model correctly
predicted would unsubscribe divided by the total number of customers the model predicted
would unsubscribe.

Recall
Recall refers to the number of true positives divided by the total number of positive cases in the data set (true
positives plus false negatives). Recall is a good indicator of the ability of the model to identify the positive class.
Recall is defined as:
In the customer attrition example, recall measures the ratio of customers that the model correctly predicted would
unsubscribe to the total number of customers who actually unsubscribed (whether correctly predicted or not).
While a perfect classifier may achieve 100 percent precision and 100 percent recall, real-world models never
do. Models inherently trade off between precision and recall; typically the higher the precision, the lower the
recall, and vice versa.
In the customer attrition example, a model that is tuned for high precision – each prediction is a high-quality
prediction – will usually have a lower recall; in other words, the model will not be able to identify a large portion
of customers who will actually unsubscribe.

F1 Score
The F1 score is a single evaluation metric that aims to account for and optimize both precision and recall.
It is defined as the harmonic mean between precision and recall.12 Data scientists use F1 scores to quickly
evaluate model performance during model iteration phases by collapsing both precision and recall into this
single metric. This helps teams test thousands of experiments simultaneously and identify top-performing
models quantitatively.
A model will have a high F1 score if both precision and recall are high. However, a model will have a low F1
score if one factor is low, even if the other is 100 percent.
The F1 score is defined as:
Receiver Operating Characteristic (ROC) Curve
Another tool used by data scientists to evaluate model performance is the receiver operating characteristic
(ROC) curve. The ROC curve plots the true positive rate (TPR) versus the false positive rate (FPR).
TPR refers to the number of positive cases surfaced as the model makes predictions divided by the total
number of positive cases in the data set. This metric is the same as recall.
TPR is defined as:

FPR refers to the number of negative cases that have been surfaced as the model makes predictions,
divided by the total number of negative cases in the data set.
FPR is defined as:
Note that the ROC curve is designed to plot the performance of the model as the model works through
a prioritized set of predictions on a data set. Imagine, for example, that the model started with its best
possible prediction, but then continued to surface data points until the entire data set has been worked
through. The ROC curve therefore provides a full view of the performance of the model – both how
good the initial predictions are and how the quality of the predictions is likely to evolve as one continues
down a prioritized list of scores.
The ROC and corresponding area under the curve (AUC) are useful measures of model performance
especially when comparing results across experiments. Unlike many other success metrics, these
measures are relatively insensitive to the composition and size of data sets.
The following figure illustrates an example ROC curve of a classifier in orange. The curve can be
interpreted by starting at the origin – bottom left – and working up to the top right of the chart.
Classifier Performance
Receiver Operating Characteristic (ROC)
This straight line represents

a predictive model that is
An ideal classifier can “randomly guessing”
predict failures with high
true positives with low
false positives
Figure 16: Receiver operating characteristic (ROC) curve

Imagine a classifier starting to make predictions against a finite, labeled data set, while seeking to identify
the positive labels within that data set. Before the classifier makes its first prediction, the ROC curve will
start at the origin (0, 0). As the classifier starts to make predictions, it sorts them in order of priority – in other
words, the predictions the classifier is most “certain” about, with the highest probability of success, are
plotted first. A data scientist would hope these initial predictions overwhelmingly consist of positive labels
so that there would be many more positive cases surfaced relative to negative ones (TPR should grow
faster than FPR). The orange line plotting the performance of the classifier should therefore be expected to
grow rapidly from the origin, at a steep slope – as shown in the figure.
At some point, the classifier is unable to distinguish clearly between positive and negative labels, so the
number of negative labels surfaced will grow and the number of positive labels remaining will start to
dwindle. The classifier performance therefore starts to level off and the FPR starts to grow faster than the
TPR. The classifier is forced to surface data points until both TPR and FPR are at 1 – the top right-hand
side of the plot.
What is important in this curve is the shape it takes. The start (0,0) and the end (1,1) are pre-determined. It
is the initial “steepness” of the curve’s slope and the AUC that matter. As shown in the following figure, the
greater the AUC, the better the classifier’s performance.
Classifier Performance
Area Under ROC Curve (AUC)
The greater the area under the

curve compared to straight line,
the better a classifier is
Random guessing
Figure 17: Area under the ROC Curve (AUC) measures how much better a machine
learning model predicts classification versus a random luck model.

Note that the classifier’s performance on an ROC curve is compared with random guessing. Random
guessing in this case is not a toss of a coin (in other words, a 50 percent probability of getting a class
right). Random guessing here refers to a classifier that is truly unable to discern the positive class from the
negative class. The predictions of such a classifier will reflect the baseline incidence rates of each class
within the data set.13
For example, a random classifier used to predict cases of customer attrition will randomly classify
customers who are likely to “unsubscribe” with the same incidence rate that is observed in the underlying
data set. If the data set includes 100 customers of which 20 unsubscribed and 80 did not, the likelihood of
the random classifier making a correct attrition prediction will be 20 percent (20 out of 100).
It can be shown that a random classifier, with predictions that correspond to class incidence rates, will on
average plot as a straight line on the ROC curve, connecting the origin to the top right-hand corner.
The AUC of a random classifier is therefore 0.5. Data scientists compare the AUC of their classifiers
against the 0.5 AUC of a random classifier to estimate the extent to which their classifier improves on
random guessing.
Setting Model Thresholds
A classifier model typically outputs a probability score (between 0 and 1), that reflects the model’s confidence
in a specific prediction. While a good starting rule of thumb is that a prediction value greater than 0.5 can be
considered a positive case, most real-world use cases require a careful tuning of the classifier value that is
determined to be a positive label.
Turning again to the customer attrition example: Should a customer be considered likely to unsubscribe if
their score is greater than 0.5 or 0.6, or lower than 0.5? There is no hard-and-fast rule; rather, the actual set
point should be tuned based on the specifics of the use case and trade-offs between precision, recall,
and specific business requirements. This value that triggers the declaration of a positive label is called the
model threshold.

Increasing the model threshold closer to 1 results in a model that is more selective; fewer predictions are
declared to be positive cases. Decreasing the threshold closer to 0 makes the model less selective; more
predictions are labeled as positives.
The following figure demonstrates how changing the threshold parameter alters the selectivity of a classifier
predicting customer attrition. A higher threshold results in higher precision, but lower recall. A lower
threshold results in higher recall, but lower precision. It’s therefore important to identify an optimal model
threshold with favorable precision and recall.
Class 2 =
negative class
Class 1 =
positive class
Figure 18: Threshold selectivity of an ML classifier used to predict customer attrition

Regression Performance
Evaluating performance of a regression model requires a different approach and different metrics than
are used to evaluate classification models. Regression models estimate continuous values; therefore,
regression performance metrics quantify how close model predictions are to actual (true) values.
The following are some commonly used regression performance metrics.
Coefficient of Discrimination, R-squared (R2)
R2 is an indicator of how well a regression model fits the data. It represents the extent to which the variation
of the dependent variable is predictable by the model.
For example, an R2 value of 1 indicates that the input variables in the model (such as sales history and
marketing engagement for customer attrition) are able to explain all of the variation observed in the output
(such as number of customers who unsubscribed). If a model has a low R2 value, it may indicate that other
inputs should be added to improve accuracy.
Mathematically, R2 is defined as:
where n is the total number of evaluated samples, yi is the ith observed output, ŷi is the ith predicted
output, and ȳ is the mean observed output. The quantity (yi - ŷi) can also be referred to as the prediction
error, denoted êi .
Let’s consider a simple regression model that is trained to forecast monthly sales at a company. The following
table illustrates the concept.

Company X Sales
Sales Forecast Actual Sales

Month
($ Millions), ŷi ($ Millions), yi
Month 1 28 32
Month 2 37 35
Month 3 41 42
Month 4 36 45
Month 5 27 33
Month 6 32 35
Month 7 45 51
Month 8 51 45
Month 9 50 48
Month 10 48 55
Month 11 55 51
Month 12 57 49
Table 1. Example of a simple sales forecasting model
A data scientist may want to compare the model’s performance relative to actuals (for instance, over the last
year). A data scientist using R2 to estimate model performance would perform the calculation described
in the following table. The R2 value for this sales forecasting model is 0.7.

Company X Sales
Sales Forecast Actual Sales R2 Metrics

Month
($ Millions), ŷi ($ Millions), yi (yi –ŷi )2 (yi –ȳi )2
Month 1 28 32 16 238
Month 2 37 35 4 41
Month 3 41 42 1 6
Month 4 36 45 81 55
Month 5 27 33 36 270
Month 6 32 35 9 130
Month 7 45 51 36 3
Month 8 51 45 36 58
Month 9 50 48 4 43
Month 10 48 55 49 21
Month 11 55 51 16 134
Month 12 57 49 64 185
Average ȳ 43 Sum 352 1183
R2 0.70
Table 2. Calculation of R 2 for the Simple Sales Forecasting Model
Mean Absolute Error (MAE)
MAE measures the absolute error between predicted and observed values. For example, an MAE
value of 0 indicates there is no difference between predicted values and observed values. In practice,
MAE is a popular error metric because it is both intuitive and easy to compute.
Mathematically, MAE is defined as:
where n is the total number of evaluated samples, yi is the ith observed (actual) output, and ŷi is the ith
predicted output.

Mean Absolute Percent Error (MAPE)
MAPE measures the average absolute percent error of predicted values versus observed values.
Normalizing for the relative magnitude of observed values reduces skew in the reporting metric
so it is not overly weighted by large magnitude values. MAPE is commonly used to evaluate the
performance of forecasting models.
Mathematically, MAPE is defined as:
predicted output.
Root Mean Square Error (RMSE)
RMSE is a quadratic measure of the error between predicted and observed values. It is similar to MAE as a
way to measure the magnitude of model error, but because RMSE averages the square of errors, it provides
a higher weight to large magnitude errors. RMSE is a commonly used metric in business problems where higher
magnitude errors have a higher consequence – like predicting item sales prices, where high-priced items
matter more for bottom-line business goals. However, this also may result in over-sensitivity to outliers.
Mathematically, RMSE is defined as:
predicted output.
We can now compute MAE, MAPE, and RMSE for the same monthly sales forecasting example, as outlined
in the following table.

Company X Sales
MAPE Metric
Sales Forecast Actual Sales MAE Metric RMSE Metric
Month |yi – ŷi|
($ Millions), ŷi ($ Millions), yi |y –ŷi| (yi – ŷi)2
|yi |
Month 1 28 32 4 0.13 16
Month 2 37 35 2 0.06 4
Month 3 41 42 1 0.02 1
Month 4 36 45 9 0.20 81
Month 5 27 33 6 0.18 36
Month 6 32 35 3 0.09 9
Month 7 45 51 6 0.12 36
Month 8 51 45 6 0.13 36
Month 9 50 48 2 0.04 4
Month 10 48 55 7 0.13 49
Month 11 55 51 4 0.08 16
Month 12 57 49 8 0.16 64
Average Ȳ 43 Sum 58 Sum 1.34 Sum 352
MAE 4.8 MAPE 0.11 RMSE 5.4
Table 3. Calculation of MAE, MAPE, and RMSE for the Simple Sales Forecasting Model
As seen in Tables 2 and 3, the R2 (0.7) and MAPE (0.11) regression metrics provide a normalized relative
sense of model performance. A “perfect” model would have an R2 value of 1. The MAPE metric provides
an intuitive sense of the average percentage deviation of model predictions from actuals. In this case, the
model is approximately 11 percent “off.”
The MAE (4.8) and RMSE (5.4) metrics provide a non-normalized, absolute sense of model performance
in the predicted unit (in this case millions of dollars). MAE provides a sense of the average absolute value of
the forecast’s deviation from actuals. Finally, RMSE provides a “root-mean-square” version of the forecasts’
average deviations from actuals.

4
Runtimes and Compute
Requirements
Machine Learning Libraries
Some of the earliest machine learning applications involved consumer-facing
use cases developed by companies like Google, Amazon, LinkedIn, Facebook,
and Yahoo. Machine learning practitioners at these companies applied their
skills to improve search engine results, advertisement placement and click-
throughs, and advanced recommender systems for products and offerings.
Many of the machine learning practitioners from these companies, as well

as many in the academic community, embraced the open source software
model, in which contributors would make their source code for core underlying
technical capabilities freely available to the broader community of scientists
and developers. The idea was that these contributions would encourage the
pace of innovation for all.
As a result of this early work and the ongoing commitment to open source
technology, data scientists and machine learning engineers now have a wide
variety of machine learning libraries, languages, and infrastructure options to
Library Supported Languages Open Source/Proprietary

Python, Java, C, Go, C++, JavaS-
TensorFlow Open source
cript, Swift
Keras Python Open source
Torch/
Lua, Python, C++, Java Open source
PyTorch
MLlib Python, Scala, Java, R Open source
scikit-learn Python Open source
Python, Java, C++, R,

XGBoost Open source
Julia, Perl, Scala
NumPy Python Open source
Pandas Python Open source
NLTK Python Open source
statsmodel Python Open source
spaCy Python, Cython Open source
Matplotlib Python Open source
Theano Python Open source
SciPy Python Open source
Table 4. Commonly used machine learning libraries

develop applications. Many core machine learning libraries today – including scikit-learn, SciPy, Pandas,
and NumPy – began to emerge as the open source standard.
Machine learning libraries enable data scientists to rapidly train and test new models without having to write
all of an algorithm’s code from scratch. Python has emerged as the machine learning language of choice; a
significant share of source code contributions have included Python libraries and tools.
Programming Languages for Machine Learning

Python has become the most widely adopted programming language for machine learning. Data
scientists choose different programming languages based on ease of use, simplicity in programming
syntax, number of machine learning libraries available, integration with other programs like cloud
infrastructure or visualization software, and computational speed and efficiency.
What Programming Language Do You Use on a Regular Basis?
Note: Data are from the 2018 Kaggle Machine Learning and Data Science Survey. Survey, found here:
http://www.kaggle.com/kaggle/kaggle-survey-2018. A total of 18,827 respondents answered the question.
Figure 19: Data scientists have many options for programming languages to develop machine learning models.
Python has become a popular choice.14

Infrastructure: Machine Learning

Hardware Requirements
Choosing the right hardware to train and operate machine learning programs will greatly impact the
performance and quality of a machine learning model. Most modern companies have transitioned data
storage and compute workloads to cloud services. Many companies operate hybrid cloud environments,
combining cloud and on-premise infrastructure. Others continue to operate entirely on-premise, usually
driven by regulatory requirements.
Cloud-based infrastructure provides flexibility for machine learning practitioners to easily select the
appropriate compute resources required to train and operate machine learning models.
The processor is a critical consideration in machine learning operations. The processor operates the
computer program to execute arithmetic, logic, and input and output commands. This is the central nervous
system that carries out machine learning model training and predictions. A faster processor will reduce the
time it takes to train a machine learning model and to generate predictions by as much as 100-fold or more.
There are two primary processors used as part of most AI/ML tasks: central processing units (CPUs) and
graphics processing units (GPUs). CPUs are suitable to train most traditional machine learning models
and are designed to execute complex calculations sequentially. GPUs are suitable to train deep learning
models and visual image-based tasks. These processors handle multiple, simple calculations in parallel.
In general, GPUs are more expensive than CPUs, so it is worthwhile to evaluate carefully which type of
processor is appropriate for a given machine learning task.
Other specialized hardware increasingly is used to accelerate training and inference times for complex,
deep learning algorithms, including Google’s tensor processing units (TPUs) and field-programmable gate
arrays (FPGAs).
In addition to processor requirements, memory and storage are other key considerations for the
AI/ML pipeline.

To train or operate a machine learning model, programs require data and code to be stored in local memory
to be executed by the processor. Some models, like deep neural networks, may require more fast, local
memory because the algorithms are larger. Others, like decision trees, may be trained with less memory
because the algorithms are smaller.
As it relates to disk storage, cloud storage in a distributed file system typically removes any storage
limitations that were imposed historically by local hard disk size. However, AI/ML pipelines operating in
the cloud still need careful design of both data and model stores.
Many real-world AI/ML use cases involve complex, multi-step pipelines. Each step may require different
libraries and runtimes and may need to execute on specialized hardware profiles. It is therefore critical to factor
in management of libraries, runtimes, and hardware profiles during algorithm development and ongoing
maintenance activities. Design choices can have a significant impact on both costs and algorithm performance.

5
Selecting the Right
AI/ML Problems
While the potential economic benefits of AI/ML are substantial, many organizations
struggle with capturing business value from AI. Many enterprises have
widespread AI prototype efforts, but few companies are able to run and scale AI
algorithms in production. Fewer still are able to unlock significant business value
from their AI/ML efforts.
Based on our work with several of the largest enterprises in the world, the most
critical factor in unlocking value from AI/ML is the selection of the right problems
to tackle and scale up across the company.
During problem selection, managers should think through three critical dimensions.
Managers should ensure that the problems they select (1) are tractable, with
reasonable scope and solution times; (2) unlock sufficient business value and can be
operationalized to enable value capture; and (3) address ethical considerations.
Tractable Problems
Ability to Solve the Problem
A first step to AI/ML problem selection is ensuring that the problem actually can
be solved. This involves thinking through the premise and formulation of the
problem. At their core, many AI/ML tasks are prediction problems – and data are
at the center of such problems.
That’s why a consideration of problem tractability should involve analysis of the

available data. For supervised learning problems, this involves thinking through
whether sufficient historical data are available and whether there are sufficient
data signals and labels for an algorithm to be trained successfully.
For many use cases that involve supervised learning problems, the number
and quality of available labels becomes a key limiting issue. Supervised models
typically require hundreds of labels for training and often thousands, or even
millions, of labels to learn to accurately predict outcomes. Many organizations may
not have the historical data sets needed to support supervised learning models,
particularly because the underlying enterprise IT systems and data models were
never designed with either machine learning or labels in mind.

For unsupervised learning problems, this involves thinking through whether sufficient “normal” historical
periods can be identified for the algorithms to determine what a range of normal operations look like.
Another factor that must be considered up front is whether data scientists and SMEs consider the
fundamental problem formulation to be tractable. There is an art to analyzing problem tractability.
Problem tractability analysis may involve assessing whether there are sufficient signals encoded within the
data set to predict a specific outcome, whether humans could solve the problem given the right data, or
whether a solution could be found given the fundamental physics involved.
For example, consider a problem in which a bank is trying to identify individuals involved in money
laundering activities. The bank has years of transaction records, with millions of transactions that contain
useful information about money transfers and counterparties. The bank also has significant contextual
information about its customers, their backgrounds, and their relationships, and access to external data
sources including news feeds and social media. The bank also may have thousands of historical suspicious
activity reports to act as labels from which an algorithm could learn.
Further, humans may be able to diagnose and identify individual money laundering cases very well – but
humans can’t scale to interpret data from millions of transactions and customer accounts. This is an example
of a data-rich and label-rich environment in which the tractability of the problem formulation is established
– humans can perform the diagnostic task to a limited degree – but the key challenge is performing the
identification task at scale and with high fidelity. This is a good, tractable problem for a supervised machine
learning algorithm.
Contrast this with a different example problem, in which an operator is trying to predict the failure of a very
expensive and complex bespoke machine. The machine may be only partially instrumented with few input
signals and may only have one or two historical failures from which an algorithm could learn. Given the
available signals and data inputs, human operators may not be able to effectively predict upcoming failures.
This is an example of a data-poor and label-poor environment and system in which the tractability of the
problem formulation is unclear. Ultimately, this may not be a tractable problem for a supervised machine
learning algorithm.
Depending on the amount of historical data and instrumentation available, this problem may be amenable
to other AI/ML techniques (for example, unsupervised anomaly detection methods). But this is an example
of a problem that solution teams may want to examine carefully before pursuing. The following figure
summarizes examples of tractable and intractable ML use cases.

Tractable Use Case for Supervised Learning: Intractable Use Case for Supervised Learning:
Anti-Money Laundering Predictive Maintenance for a Single Complex Machine
• Large number of input signals • Few input signals; data available for one machine
• Thousands of historical labels • Less than one or two historical failures
• May be possible for humans to detect • May or may not be possible for individual operators to
individual cases but challenging for humans detect upcoming failures
to this precisely, and at scale
Figure 20: Examples of tractable and intractable machine learning use cases
In most organizations, understanding and analyzing available data requires input and cooperation from
both IT managers and business managers or subject matter experts (SMEs). Business teams usually have a
good understanding of their data sets, but do not understand the underlying source data systems. IT teams
usually have a good understanding of the data sources, but usually do not know what the data represent.
Based on our experience, the data complexity for most enterprise business problems is significant. It
is reasonable to assume that at least five or six disparate IT and operational software systems will be
required to solve most real-world enterprise AI use cases that unlock substantial business value. At most
organizations, the individual IT source systems weren’t designed to interoperate and typically have widely
varying definitions of business entities and ground truth.
A cross-functional business and IT team is required to identify a range of relevant data sources for any
problem (a combination of all sources that have relevant signals and labels) and then to analyze those
sources to characterize the available data.

Economic or Business Value

Economic Value of the Problem
A second criterion in problem selection involves the economic rationale – the business case – and an
analysis of the potential value that could be unlocked if the problem were addressed.
This is often a crucial step for most enterprises because the number of potential applications of AI in an
organization is enormous. Most companies are at the very start of their AI business transformations, which
means that almost all business processes potentially can be honed using AI.
The following figure shows an illustrative distribution of value and number of AI use cases we typically see
at large enterprises.
Usually, however, only a few business cases are likely to result in vastly disproportionate high returns – and
it is those business cases that warrant immediate consideration. A rough order of magnitude economic
value calculation before embarking on a specific use case can help focus and prioritize efforts.
Figure 21: Typical illustrative distribution of AI use cases at large enterprises

Performance Relative to a Baseline
To determine how much value can be unlocked by machine learning and AI applications, it is particularly
important to understand and articulate the baseline performance (e.g., efficacy, efficiency) of the
business function that the AI application is seeking to augment. In most business use cases, the baseline
performance directly reflects the problem that the company is seeking to address with AI/ML. For example,
an equipment operator who wants to use machine learning to predict failures may have a baseline known
as “run to failure,” where the operator uses the equipment until it breaks down. It is important to evaluate
baseline performance in order to understand the current economic performance (or other performance
measures) of the business use case today, as well as the benefits from the application of AI/ML techniques.
Consider an example from the financial services industry. C3 AI Anti-Money-Laundering (AML) is one
of the AI/ML applications offered by C3.ai that applies machine learning to identify whether a banking
customer is committing money-laundering fraud. At most financial institutions, the baseline financial crimes
process draws on a library of rules that flag suspicious client behavior.
The costs of such a baseline system to banks are two-fold. First, financial regulators impose fines and
penalties when banks fail to catch money launderers; these fines and penalties drive significant
reputational and personal risk to bank executives. Fines totaled more than $8 billion globally in 2019.15
Second, banks hire thousands of analysts to manually review and investigate potential money laundering
cases each year.
In one example, a division of a bank evaluated the baseline operations of their rules-based system. The rules
identified 7,300 cases of potential money laundering, but only 33 of those cases were verified by an analyst
to be true. Upon later review, the financial institution discovered that there were 110 actual cases of money
laundering in the data set, meaning that 87 cases went undetected by the existing rules. As shown in the
following figure, the baseline precision of the rules-based system was just 0.5%. And the baseline recall of
the rules-based system was 30%.
An AI/ML system that improves on these baseline numbers could add significant value to a bank by both
increasing efficiency (smaller staff to review alerts) and effectiveness (be able to catch more money launderers).
I am often told by clients, prospects, and partners that they will not accept the output of AI/ML systems if
the “accuracy” of such systems is not in the “90% range.” There are many problems with this statement,
including often a vague definition of accuracy.

However, the primary issue I take with this statement (as in the case above) is that there are many business
use cases where the performance of existing rules-based, physics-based, or business-logic-based systems
is far from “90%.” And, in most of those business use cases, even modest improvements to the baseline
performance numbers of an organization can unlock significant economic, social, and environmental benefits.
I am therefore convinced that the right question is not: “Does the AI/ML algorithm reach 90%?” Instead,
it should be: “What is the business performance gain that the AI/ML algorithm delivers?” We should also
determine whether that performance gain is worth the investment.
Consider Performance Relative to Baseline

Example: Anti-Money-Laundering
Text
Figure 22: Baseline accuracy of a rules-based system to detect money laundering at a financial institution
Through first evaluating the economic value of a business problem, the machine learning project can be
prioritized for development and accurately recognized for the economic value it will create. Demonstrating
value helps drive adoption and change management among the end users who will deploy the machine
learning program to make data-based decisions.

Ethical Implications of the Problem

In addition to considering the fundamental problem-solving capabilities of AI/ML techniques, practitioners
should also consider upfront the ethical implications of a machine learning project.
Ethical AI is a nuanced, complex, and emerging discipline, which means there are few concrete guidelines
for companies to follow today.16,17 Some technology companies are crafting their own AI principles,18,19 while
others are hiring chief ethics officers to set guidelines and steer organizations toward responsible actions.20
AI systems face several critical ongoing ethical challenges. While a detailed treatment and analysis of AI
ethics is outside the scope of this publication, managers need to be mindful of a few key themes as they
move forward in this evolving and sometimes controversial area.
Fairness and Bias
The most important and frequently occurring ethical issue with enterprise AI/ML systems involves the
management of fairness and bias. AI/ML algorithms fueled by big data are driving decisions about health
care, employment, education, housing, and policing even as an ever-growing body of evidence shows that
AI algorithms can be biased. Even models developed with the best of intentions may inadvertently exhibit
discriminatory biases against historically disadvantaged groups, perform relatively worse for certain
demographics, or promote inequality.
Enterprises must be concerned not only with statistical bias in their AI models (for example, selection, sampling,
inductive, and reporting bias) but also with ethical fairness. Discrimination through performing predictions
and classifications on data is the very point of machine learning, but enterprises must be concerned with using
statistical discrimination as the basis for unjustified differentiation. Differentiation may be unjustified due to
practical irrelevance (for instance, incorporating race or gender in prediction tasks such as employment) and/or
moral irrelevance despite statistical relevance (such as factoring in disability). Where race is suspected of causing
unjustified bias, the “easy fix” – removing race as a feature – doesn’t work because it may be correlated with other
features, for instance ZIP code, which may knowingly or unknowingly have been included in a model. Instead, the
best practice is to explicitly include race as a feature and subsequently correct for bias.
In a famous example from 2018, machine learning practitioners at Amazon.com were experimenting with
a new AI/ML-based recruiting system to help streamline their hiring process.21 No matter what the data
science team did, they found that the algorithm’s results were biased against women candidates. This
occurred in spite of the significant care Amazon’s data scientists had taken to strip out gender-related
information from resumes. The bias in results occurred because historical training labels of successful
hires were biased towards men. AI/ML algorithms are very good at identifying “successful” outcomes – in

this case, male hires. Despite the best efforts of data scientists, the algorithms identified and latched onto a
range of features that were highly correlated with gender. The system essentially figured out whether the
candidate would be a male hire. Amazon therefore had to stop using AI/ML algorithms for this purpose. The
following figure is from a news article discussing this challenge.
Figure 23: Amazon scraps AI/ML recruiting tool because of gender bias

Other questions around bias often relate to the use of AI in other human resources (HR) decisions – for
example, whether to promote people or make salary recommendations – and in situations where AI agents
determine the ability to receive loans or access healthcare.
Unfairness in ML systems is primarily caused by human bias inherent in historical training data. ML models
are prone to amplifying such biases. No consensus on an ideal definition of fairness exists today. Rather
than attempting to resolve questions of fairness within a single technical framework, the approach
should be to educate the people involved in building ML models to examine critically the many ways that
machine learning affects fairness. Several fairness criteria have been developed to quantify and correct for
discriminatory bias in classification tasks, including demographic parity,22 equal opportunity, and equalized
odds.12 It is important to select the right type of fairness; otherwise, the wrong metric can lead to harmful
decisions and we risk propagating systematic discrimination at scale. When it comes to building fair ML
models it is crucial to understand when to use each fairness metric and what to consider when applying
them. A trade-off between model performance and fairness usually can be found.
Safety
The safety and reliability of AI systems is another critical issue. The key question here revolves around
whether we trust the AI system to make reliable and appropriate decisions. Safety considerations are often
thought about in the specific context of physical or real-world systems. But even information systems can
have broader and cascading effects on human, economic, social, or environmental safety.
An often-cited example of AI safety centers on the ethical considerations around self-driving cars. For
example, in the case of an emergency trade-off, what decisions should the car make? Should it seek to
protect its passengers or should it seek to protect others?23
While the self-driving car example may feel theoretical or contrived, other safety considerations are
driving concrete decisions today. For example, dating back to a directive in 2012, the U.S. government has
attempted to set guidelines around the use of autonomous and semi-autonomous weapons systems,
seeking to “allow commanders and operators to exercise appropriate levels of human judgment over the
use of force.”24 This is an evolving space. There is no international consensus on the use of autonomous AI
technologies and individual nations may make differing decisions.

Explainability and Transparency
Another frequent AI ethics concern centers around explainability, also referred to as interpretability. AI
algorithms often are perceived as black boxes making inexplicable decisions. Unlike traditional software, it
may not be possible to point to any “if/then” logic to explain a software outcome to a business stakeholder,
regulator, or customer. This lack of transparency can lead to significant losses if AI models – misunderstood
and improperly applied – are used to make bad business decisions. This lack of transparency can also result
in user distrust and refusal to use AI applications.
Certain use cases – for instance, leveraging AI to support a loan decision-making process – may present a
reasonable financial services tool if properly vetted for bias. But the financial services institution may require
that the algorithm be auditable and explainable to pass any regulatory inspections or tests and to allow
ongoing control over the decision support agent.
In fact, European Union regulation 679, enacted in 2016, gives consumers the “right to explanation of the
decision reached after such assessment and to challenge the decision” if it was affected by AI algorithms.25
Given our current understanding, certain classes of algorithms, including more traditional machine learning
algorithms, tend to be more readily explainable, while being potentially less performant. Others, such as
deep learning systems, while being more performant, remain much harder to explain. In such cases, it is
recommended to deploy the AI model with a second “interpreter module.” The interpreter module can deduce
what factors the AI model considered important for any particular prediction. For the more technical reader,
these might include model-agnostic approaches like Lime and Shapley or model-specific approaches like tree
interpreters.26,27 Improving our ability to explain AI systems remains an area of active research.

Auditability
Traditional software is relatively static. Once an application is released to production, occasional

enhancements and upgrades are dictated over time and carefully tracked through DevOps processes and
code control. AI systems are far more dynamic. Significant changes can arise with minimal notice and ML
models may continuously evolve. When developing an AI application, thousands of models, each with different
parameters and dependencies, may be developed, tested, deployed, and used in parallel to adjust dynamically
to changing data and business needs. With all of this complexity, auditing system outcomes and tracing the
many variants – past and present – of ML models can become an overwhelming task.
Smart ML model management is the necessary antidote to enable auditability of AI systems. A smart
ML management framework enables users to track the variety of ML models deployed or being used as
challenger models to current and past production deployments. Each of these ML models captures the
algorithm, libraries, and parameters along with times these models were deployed.
In conjunction with ML model management, ML results and associated data are tagged to allow end-to-end
traceability. This is key to establishing data lineage for the thousands of results being generated. Smart ML
model management approaches increase the users’ ability to track machine learning results against specific
models and parameters.
For example, auditability is a mandatory requirement for one C3.ai customer using AI for loan decision
support. Not only does this Fortune 100 enterprise need to be able to recall AI lending decisions immediately
for regulators, but the bank also must be able to highlight specific decisions used at the time of each
recommendation. To meet these requirements, the institution is using C3.ai’s out-of-the-box ML model
management capabilities to capture all models developed and deployed in production.
Behavioral Manipulation
AI agent manipulation is among the most widespread and most concerning of all ethical AI issues. Occurring
primarily in consumer-facing AI, manipulation spans a broad spectrum of concerns from targeted marketing
and behavioral nudges to “fake news” and social manipulation.28 Practices, policies, and guidelines around
behavioral manipulation remain fragmented and uncertain.
A full treatment of this topic is significantly beyond the scope of this publication, but managers working with
AI, particularly consumer-facing AI, should consider carefully the behavioral ethics of algorithms prior
to implementation.

Use Case Prioritization

A crucial initial step in a digital transformation effort can be to perform a use case prioritization exercise to
identify a portfolio of high-priority AI/ML problems appropriate for an enterprise, business unit, or division.
Setting priorities for use cases involves thinking through all of the dimensions mentioned in this chapter,
including problem tractability, economic value, and ethical considerations.
At C3.ai, we have reviewed hundreds of enterprise AI problems over the last decade. A typical first step
involves a full value-chain exploration of high-potential AI/ML use cases. The following figure depicts an
illustrative, high-level value chain map of AI use cases for a financial services company.
AI-Machine Learning
C3 AI Suite Business Processes
Unified Federated Data Image
Figure 24. High-potential AI/ML use cases for a financial services company. Example from a C3.ai strategic workshop
After developing a case map, business leaders typically want to perform additional exploratory work in
certain areas to further flesh out the most tractable and valuable use cases for their organizations.
Most organizations can conduct a deep-dive exploration and understanding of AI/ML use cases quite
rapidly, without requiring a prolonged strategy phase. The leadership team usually already has the relevant
business knowledge with the help of SMEs. At C3.ai, we have developed a playbook over the past decade
that lets us rapidly identify a portfolio of high-potential AI/ML use cases through a series of screening and
scoping exercises and workshops. The basic principles are quite simple.

First, we ask business leaders and their management teams to fill out a template of the top business problems
that they think could benefit from the application of AI. This activity is performed as pre-work before more
detailed and in-depth workshops and discussions take place. The high-potential use cases outlined in the figure
above can serve as inspiration for such an exercise. But we have found that, in most cases, business leaders
and SMEs have already given significant thought to areas that can benefit from the application of AI/ML. The
following figure shows an example of a pre-work template for use case identification and prioritization.
Use Case Overview (describe in simple terms what value an AI solution would bring to the business):
• Identify early warning signals of clients who are likely to move their investments to another financial
services company to enable proactive and timely engagement of those clients and take proactive action
to intercept churn.
Use Case 1:
Item Question
Churn Prevention
1 Describe the current business practice. A large financial services company provides a range of
wealth management products to individuals, with over
What might be gained with AI as 100,000 clients.
a part of the existing process?
AI can help us proactively reach out to customers who are
at high propensity to permanently move their investments to
another provider.
2 What is the qualitative value that The company believes that application of a machine learning
AI can provide in the use case? system that aggregates data from internal and external
sources and proactively identifies customers who are likely to
If possible, what is the potential churn can unlock significant value.
quantitative value driver (calculation
methodology is fine here)? If the churn can be reduced by just 10%, this solution could
be worth tens of millions of dollars annually.
3 What constraints exist currently on The data in the enterprise’s CRM system are very noisy
being able to take advantage of because these systems are not used well. In addition, financial
AI in the use case? instruments, fees, transactions, and investment/returns data
are in a variety of source systems and are not unified.
Data Systems and Scope

1 For each use case, what data Investments, Deposits, Products/ Vehicles and Fees,
systems are used currently? Returns, Customer Interactions, Contextual Account
Information, Financial Advisor Information
Are there any additional that would
be leveraged for the use case?
2 Are there historical data for a machine 3 years available. Will have to derive the instances of
learning model to be developed? historical customer churn from the data with SME help,
If so, roughly how many per year and but we think there should be thousands of labels
how many years?
Figure 25. Illustrative template to be filled out as pre-work, ahead of use case prioritization workshops

This pre-work activity is then followed by one or more use case prioritization workshops. The workshops can
take many forms. One of the most productive formats involves presentations made by individual managers
proposing their candidate AI/ML use cases to a leadership steering committee. In such a workshop format,
individual managers explain the reasons why they consider their use case to be high-potential and a
top priority.
This format accomplishes two objectives simultaneously. First, it ensures that the business requirements and
value around a specific use case have been thought through well and peer-reviewed by both leadership and
the AI/ML steering committee. Second, this format ensures that the business has bought into the opportunity’s
value and benefits. All too often, enterprise leaders delegate AI/ML initiatives to digital or IT teams, stepping
away from direct involvement. Ultimately, however, the entire business needs to incorporate AI/ML technology
as part of their day-to-day operations in order to unlock value. Incorporating AI/ML involves business process
change and challenging change management activities. A format in which the business actively asks for
investment, early in the process, ensures that there is strong buy-in from business managers, plus interest and
alignment in wholeheartedly implementing the AI/ML technology as part of daily business operations.
Following one or more use case presentation and discussion workshops, most businesses can assemble a
portfolio of AI/ML initiatives to prioritize, resource, and put into production to unlock significant business and
operational benefits to the full enterprise.
The following figure shows an illustrative example of a portfolio of high-potential AI/ML use cases for a
business unit within a financial services company. Time-to-value, tractability, and the actual economic
value are plotted on the chart. Business unit leaders can use this portfolio analysis to plan out their AI/ML
transformation roadmap. In the example below, for instance, they may start with AI projects for customer
churn or anti-money-laundering – efforts that may require a “medium” effort or time to implement, but
are very tractable and have high economic value. These initial projects then can fund additional efforts as
part of an enterprise AI transformation roadmap.

Figure 26: Illustrative portfolio of AI/ML use cases for a business unit at a financial services company
Once the initial use cases are prioritized, enterprises can then prototype those use cases and scale them
into production. The next chapter focuses on best practices in managing AI/ML prototyping efforts.

6
Best Practices in
Prototyping
When evaluating and scaling machine learning systems, managers are faced
with constraints: people, technology, budget, and time. To profoundly impact
the organization, managers must balance the need to enable data science
experimentation with the realities that business value is usually only captured after
models are deployed to production and integrated into business processes.
If the prototyping phase is mismanaged or cut short, immature models can

succumb to real-world complexities and be rejected by business teams and end
users. If the experimentation phase is allowed to drag on, the business burns
through precious budget and wastes time on “science experiments.”
This chapter focuses on best practices for the prototyping phase: how to set
it up for success, what to watch out for, and how to know when a model is
good enough.
Problem Scope and Timeframes

Each organization will vary in its capacity to provide leeway for data science teams
to identify solutions for complex business problems. However, it is universal that
rapid demonstration of success and value generation is key to ongoing funding
and resourcing of AI use cases.
In our ten years deploying AI systems, we have seen a common pattern across
business teams. Most teams are interested in the capabilities of AI/ML and are
looking to verify the potential for algorithms to demonstrate operationalizable
business value, usually within 8 to 16 weeks or, at most, 24 weeks. Following
this, business teams either decide to double-down and operationalize the AI
capabilities, or move on to different problems.

In order to demonstrate the value of AI/ML, and to accelerate adoption and ensure future investments, we
recommend carefully managing the scope of the problem based on two criteria:
1. Reduce the scope of initial work to create boundaries around the problem so that it can be solved in
a short period of time, usually no more than 8 to 16 weeks, while at the same time ensuring sufficient
complexity to convince decision makers of the algorithms’ benefits.
2. Ensure that if the initial work is successful, it can be rapidly transitioned into production to unlock
significant economic value.
The following figure demonstrates a typical timeline for a small team to start and complete the
experimentation phase of AI use case development.
Emphasize speed of process—> Quick win
Figure 27: Typical timeframe for an AI/ML prototype
Note that managers should ensure that prototype efforts are conducted with a view towards a rapid
transition to production with minimal additional effort, and the corresponding ability to capture significant
economic value.
A good rule of thumb is to seek to operationalize into production an AI/ML prototype within six months of the
prototype effort’s start. This focus on value and time to production can greatly accelerate the organization’s
excitement and interest around AI/ML and its ability to scale up its AI digital transformation efforts.

Cross-Functional Teams
Like any new project, a collaborative, hard-working, well-functioning team is needed to ensure success
of AI/ML prototype efforts. But unlike many business projects, which are often functionally managed in a
single department, AI/ML projects require significant cross-functional expertise.
The number of people required varies based on the complexity of the project, but as with all good software
development efforts, a handful of fully committed resources will always produce superior outcomes than a
larger, highly fragmented team.
The following figure highlights five key roles in any prototyping effort: data scientist(s), data engineer(s),
project manager, product manager, and developer(s).
Figure 28: A cross-functional team is required to prototype AI/ML applications
The data engineer(s) extract, wrangle, and unify big data from source systems (like data historians
or databases). They are responsible for collaborating with data scientists to ensure the data are
correlated and normalized in a manner conducive to the machine learning models.
The data scientist(s) explore, test, and evaluate machine learning models using historical data. They are
responsible for visualizing data, training machine learning models, computing performance metrics, and
visualizing results.

The developer(s) create the software interfaces that enable end users to consume and act on ongoing model
inferences. Such interfaces can include APIs (or other hooks) to existing systems and applications or full
workflow-enabled, browser-based application user interfaces.
The product manager defines the scope and requirements of the AI/ML application, including the AI insight(s)
necessary, how to surface those insights to users, and the best approach to operationalize those insights
within a business process. The product manager is a bridge between business users, data scientists, and
data engineers. The product manager has to understand the economic value drivers of the machine learning
problem while structuring and guiding the problem-solving effort.
The project manager tracks the activities, timelines, deliverables, and status of the development. The project
manager aligns work activities, tracks progress, identifies solutions to project barriers, and escalates issues
early and often to senior leadership.
Getting Started by Visualizing Data

It is always tempting to dive straight into prototyping and algorithm development activities. But taking some
time at the outset to better understand data always leads to more nuanced insights and relevant results
because teams are able to identify issues and nuances early on.
The best way to know if data contain useful and relevant information for solving a problem using machine
learning is to visualize and study the data. If possible, it is better to work with SMEs who are familiar with the
problem statement and the business nature while visualizing and understanding the relevant data sets.
In a customer attrition problem, for example, a business SME may be intimately familiar with aggregate data
distributions such as higher attrition caused by pricing changes and lower attrition for customers who use a
particularly sticky product. Such insights, observed in the data, will increase confidence in the overall data set
or, more importantly, will help data scientists focus on data issues to address early in the project.
We typically recommend various visualizations of data in order to best understand issues and develop an
intuition about the data set. Visualizations may include aggregate distributions, charting of data trends over
time, graphic subsets of the data, and visualizing summary statistics, including gaps, missing data, and
mean/median values.
For problem statements that support supervised modeling approaches, it is beneficial to view data with clear
labels for the multiple classes defined, for example customers who have left versus existing customers.

For example, when examining data distributions across multiple classes, it is helpful to confirm that the classes
display differences. If data differences, however minor, are not apparent during inspection and manual
analysis, it is unlikely that AI/ML systems will be successful at discovering them effectively. The following
figure shows an example of plotting data distributions across positive (in blue) and negative (in orange) classes
across different features in order to ascertain whether there are differences across the two classes.
In customer attrition problems, those customers who have left may represent a disproportionate number of
inbound requests to call centers. Such an insight during data visualization may lead data scientists to
explore customer engagement features more deeply in their experiments.
Portfolio Return Sentiment Meetings
Number of Sales
Product Rating Total Return Professionals Assigned
Figure 29: Understanding the impact of individual features on outcomes or class distributions can offer
significant insight into the learning problem
We also recommend that teams physically print out data sets on paper to visualize and mark up
observations and hypotheses. It is often incredibly challenging to understand and absorb data trends
on screens. Physical copies are more amenable to deep analysis and collaboration, especially if they are
prominently displayed for team members to interact with them, for instance on the walls of a team’s room.
Using wall space and printed paper is more effective than even a very large projector. The following figure
shows a picture of one of our conference room walls that is covered in data visualizations.

Figure 30: Conference room walls covered in data visualizations to facilitate understanding and collaboration during model development
In the following figure from an AI-based predictive maintenance example, aligning individual time
series signals makes it possible to rapidly scan them for changes that occur before failures
Figure 31: Example of a time series data visualization exercise as part of an AI-based predictive maintenance prototype.
By visualizing these data, scientists were able to identify small patterns that can later be learned by algorithms.

Common Prototyping Problem – Information Leakage

Information leakage – when information from the future is incorrectly and inadvertently used as part of an
AI/ML prediction task – is one of the most pernicious problems that can affect AI/ML prototyping efforts,
confounding data scientists.
Good models use relevant, available information – inputs – about the past and current states in order to
make an inference – prediction – about the future (e.g., is a customer going to churn?) or about other data
the model does not have access to (e.g., is a customer engaging in money laundering?).
Information leakage occurs when models inadvertently have access to future information presented as
model inputs. For example, in a customer attrition prediction problem, information leakage can be overt – for
instance, the customer may have closed their account but not completed the transaction – or it can be more
nuanced – for instance, if the customer engages in a transaction that is only available to those who have
already engaged services with other businesses.
The information leakage challenge exists because most AI/ML problems have a strong temporal element.
Data therefore have to be carefully represented over time. But most real-world data sets are complex, come
from disparate databases, are updated at differing frequencies and time granularities, and follow complex
business rules. Often, no one individual at a company understands all the data in scope for a problem. Plus,
data scientists are often unaware of the underlying data complexities.
Information leakage often presents itself in the form of terrific model results during prototyping efforts, but
poor results when models are transferred into production.
It can be easy to diagnose information leakage if prototyping results are “too good to be true.” However, in
some cases information leakage can occur even when results seem to be reasonable.
One example of information leakage comes from IBM.29 In 2010, data scientists at IBM developed a machine
learning model to predict potential customers who would purchase IBM software products. The inputs
to the model included information about each potential customer as of 2006, and the goal was to predict
who would become a customer by 2010. However, the IBM team did not have access to historical customer
websites from 2006. They used current website data from 2010 as an input to train the model, thinking this
could substitute for “real” 2006 data.

Figure 32: Overview of IBM machine learning model built in 2010 to predict which customers would purchase IBM products
At first IBM was pleased to see very good results from the machine learning model. But upon
analyzing the relative weights of feature contributions to model predictions, the team quickly realized
a disappointing fact: The top distinguishing characteristic that caused the model to predict which
customers would purchase IBM products was the customer website data. At first, that seemed to be a
reasonable input to the model. But because the website data was current as of 2010 when the model was
trained, the data included names of IBM products that customers had already purchased. Put simply, the
IBM team accidentally included labels identifying who became a customer in their training data.
Figure 33: IBM team had information leakage since model inputs – website data – included explicit labels of
outputs (who became customers by 2010)

Another example of information leakage comes from our own work in AI/ML-based fraud detection. In
this case, we were seeking to predict cases of electricity theft using information from smart connected
electricity meters, work order systems, electricity grid systems, customer information systems, and fraud
investigation systems. The data volumes and data complexity were significant.
One of the first prototype versions of our models had terrific performance. But we soon realized that the
model used a specific work order code – one of scores of codes – to predict a theft event from the official
fraud database. It turned out that the fraud investigation system was time-delayed and the work order code
was an early entry made by some investigators after a fraud event – so not predictive – in order to mark a
specific customer as a potential fraud case prior to the official adjudicated database entry.
This kind of issue can be incredibly complex to debug, especially in feature spaces with many thousands of
features and data from dozens of databases. Some approaches to address information leakage issues of
this nature involve “masking out” a buffer time period before labels – for instance, not using information that
is in close temporal proximity, say two to three days, of the label being predicted. The specific configuration
of the mask requires an understanding of the business problem to be solved and the nuances of the data
sets and databases.
Other approaches involve programmatically analyzing the correlations between variables and labels and
closely examining those that appear to be “too good to be true.”
Finally, examining feature contributions/explainability of the AI/ML algorithm can provide valuable clues
regarding potential information leakage events.

Common Prototyping Problem - Bias

While information leakage focuses on largely temporal effects in data science techniques, bias errors often
result from complexities in the underlying data distributions. There are two common kinds of bias that occur
when prototyping machine learning models: reporting bias and selection bias.
Reporting Bias: A common bias that is often overlooked relates to the provenance of the training data
available to data scientists. Early in C3.ai’s history, for example, we developed machine learning algorithms to
detect customer fraud. In one customer deployment, it was clear to us that the algorithms were significantly
underperforming in one particular geography, a remote island. When we examined the situation further, we
realized there was substantial reporting bias in the data set from the island. Every historical investigation
performed on the island was a fraud case, skewing the data distributions from that island.
It turns out that because of the island’s remoteness, investigators wanted to be sure that a case would be
fraudulent before they would travel there. Because AI/ML algorithms are inherently greedy, in this example
the algorithm incorrectly maximized performance by marking all customers on the island with a high
fraud score.
Because the frequency of events, properties, and outcomes in the training set from that island differed
from their real-world frequency, our teams had to counteract the implicit bias caused by the selective fraud
inspections on the island.
Reporting bias is common where humans are engaged in the initiation, sampling, or recording of data used
for eventual machine learning model training.
Selection Bias: Another common bias in machine learning training data refers to the selection of data
for training models. It is imperative that teams focus on ensuring their training data are representative of
the real-world situation in which the model is to perform. For example, AI/ML models that seek to predict
customer attrition for a bank may need to carefully consider the demographics of the population. Attrition
for high-net-worth individuals is likely to have substantially different characteristics than attrition for lower-
net-worth individuals. A model trained on one set would likely perform quite poorly against the other.
Selection bias is common in situations where prototyping teams are narrowly focused on solving a specific
problem without regard to how the solution will be used and how the data sets will generalize. A machine
learning modeler must ensure that training data properly represent the population or take alternative steps
to mitigate introduction of bias to the model.

Pressure-Test Model Results by Visualizing Them

Data scientists will often focus, rightly, on aggregate model performance metrics like ROC curves and F1
scores, as discussed in Chapter 3. However, model performance metrics only tell part of the story; in
fact, they can obfuscate problematic model issues like information leakage and bias, as discussed above.
Managers should recognize that complex AI problems require nuanced approaches to model evaluation.
Imagine that you are a customer relationship manager and your team has given you a model that predicts
customer attrition with very high precision and recall. You may be excited to use the results from this
model to call on at-risk clients. When you see the daily model outputs, however, they don’t seem to
change much; in fact, you receive calls from your customers that they are abandoning your business. In
each case, you check the AI predictions and see that the attrition risk values are extremely high on the
day the customers call you, but extremely low on the preceding days. The data scientists have indeed
given you a model that has very high precision and recall but it has zero actionability. In this case, you
need sufficient advance warning in order to save those customers. While this is an extreme example –
and usually the formulation of the AI/ML problem would seek to perform the prediction with sufficient
advanced notice – the visualization of risk scores is still extremely valuable.
To combat this potential problem before deploying models into production, we recommend a visual
inspection of example interim model results similar to the visual inspections performed on the data
inputs discussed above.
A commonly used technique that we recommend involves producing model outputs that mimic or simulate
how actions and business processes are likely to occur once the model is deployed in production.
For example, if a team is attempting to predict customer attrition, it is imperative to visually inspect attrition
risk scores over time; if the model is indeed useful, it will show a rising risk score with sufficient advance
warning to enable the business to act. For models like this, practitioners can also make sure that risk scores
are indeed changing over time – in other words, customers are potentially low risk for a long period of time
and then risk scores rise as they grow increasingly dissatisfied about such things such as interactions with
their financial advisor, economic returns, costs incurred, or available products and services.
We recommend following a similar approach to what was used during the data visualization phase of
the prototype work – building out different visualizations and individual case charts, socializing these
among the team and experts, printing out physical copies, and placing them prominently to encourage
interaction, collaboration, and problem solving.

In the example in the following figure, the predicted risk score is visualized in orange. A score of 1.0
corresponds to 100 percent likelihood of customer attrition. The blue vertical lines represent true examples
of attrition.
By visually inspecting plots of model predictions over time, we can see how the models change and
evaluate their effectiveness in real-life situations.
Risk Score
Risk Score
Timestamp
Figure 34: Example output of a trained machine learning model to predict customer attrition (orange) overlaid with true attrition labels (blue)
Model the Impact to the Business Process

Similar to the advanced warning problem related above, prototyping teams must be mindful of the day-to-
day impact their models have on the business. A customer relationship manager may only have time to call
on five customers per day. But if the model requires them to call on an average 15 per day to reduce the risk of
customer attrition, the business cannot act on the model’s recommendations without adopting fundamental
changes that may not be easy to implement.
We recommend evaluating models to understand the case load of alerts that may be generated – again, using
a replay of history to simulate a real-world scenario.
In the example figure above, it may appear at first glance that there are only four alerts that correspond to the
four failures predicted. In reality, though, we have to take into account the time-based nature of the predictions.
If the model is generating results daily, then the actual number of daily alerts could easily total 40 (10x the
number of predicted failures). If the model is generating results hourly, we could be looking at 2,400 alerts!

There may be multiple solutions to the problem:
• Use software post-processing to generate new alerts only when risk scores exceed thresholds for
the first time
• Evaluate scores at an interval that is compatible with business operations
• Improve the model further before promoting it to production
Regardless of the final solution, it is important to think about the end users and what the model will (or
should) require of them and to do so during the prototyping phase.
It’s also critical to design the AI/ML-enabled business process appropriately to ensure value capture and
the organizational buy-in that will be required to support any necessary change management. AI/ML
techniques enable fundamental business transformation. Algorithms ideally should be designed – within
the constraints of feasibility – to simplify business process transitions while maximizing value capture.
Part of redesigning the business process could include designing an office review process in which
trained analysts evaluate algorithm results and formally adjudicate cases for promotion to others within
the organization. These analyst roles may not exist prior to the AI/ML transformation but are central to the
fully AI/ML-enabled organization.
Building on the previous customer attrition example, a financial institution could deploy AI/ML-enabled
applications for customer attrition prediction to be used by a team of central office analysts. These analysts
could review risk scores, examine evidence packages, affirm or reject algorithm recommendations, and
capture valuable feedback for data scientists.
After a thorough review, analysts could then promote selected cases to customer relationship
representatives who could interact with clients to reduce customer churn, improve customer
satisfaction, and capture value for the financial institution.
This type of multi-tier business process is just one example of how to design an AI/ML-enabled
organizational process. Other examples could include direct dispatch of cases to field representatives,
support of remote monitoring/engineering functions, or automated control/application of results in
certain cases – and those are just a start.
The principal requirement is a detailed business process evaluation at the time of algorithm prototyping.
The business process should guide algorithm design, including how algorithm performance is evaluated,
how often the algorithm is run, what business value can be captured, and the number of cases,
thresholds, setpoints, precision/recall tradeoffs, and model retraining paradigms.

Model Interpretability is Critical to Driving Adoption

A key barrier to broad AI/ML adoption stems from the opacity for users of the reasons behind the insights
generated. People, especially SMEs, are naturally skeptical of AI/ML results when first encountering them;
model interpretability – explainability – is critical to helping drive change management and adoption.
Furthermore, interpretability helps evaluate and troubleshoot machine learning models. Exposing model
interpretability helps users to understand why a model is predicting certain outcomes and how input
features influence predictions.
In general, the more complex a machine learning model is, the harder it is for a human to interpret the
results. For example, deep learning models include many hidden layers of a neural network. With current AI
approaches, it is not possible to identify what the nodes in each layer really represent, and what their relative
importance is.30 In contrast, simpler models like regressions or trees support clearer interpretability because
it is possible to determine the relative importance of each decision element for every predicted output.
In many cases, you may face a small marginal improvement in performance when a more complex model
is employed. In those cases, managers may want to explicitly consider whether the more complex model
is “worth it” or whether a simpler model with better explainability works best within the overall business
process. In several situations, the organization can get started with a simpler model while it builds trust in AI/
ML techniques. More complex models can be deployed later to take advantage of the associated additional
business benefits after there is a foundation for building trust.
In some cases, more complex models like deep neural networks may be needed to achieve the required
performance. While it is still possible to interpret these models to some extent, there are significant
limitations given current understanding and capabilities. 31,27
The following section presents design elements of interpretability and how these can be incorporated into a
new machine learning effort.

Interpretability Overview
Earlier chapters explain that machine learning models identify the feature weights, θ, that minimize a training
loss function. Interpretability techniques introspect θ to give relative importance to the weights.
When performed on the aggregate trained model, we consider the outputs as “global” interpretability. This
contrasts to “local” interpretability that is performed on a specific model prediction (e.g., a specific customer
attrition score). In addition, there are interpretability techniques that are machine learning model-specific or
model-agnostic. These techniques are rapidly evolving in scope and function, but they already open up the
algorithm “black box” to give users guidance on what the model deems important, both globally and locally.
Some machine learning frameworks include interpretability packages that expose the feature contributions for
each model. Feature contribution percentages tell you the relative importance of the inputs that are used by the
model to generate predictions.
A more in-depth treatment of interpretability is beyond the scope of this text, and we would direct readers to
other references.32 However, some of the techniques we use to provide interpretability as a part of model
prototyping and ongoing operations include:
1. Linear Models or Tree-Based models: These models include logistics, linear regressions, and decision
trees. For these models it may be possible to explicitly understand feature importance. Model weights that
are determined following model training can offer insight into feature importance.
2. Local Interpretable Model-Agnostic Explanations (LIME): For more complex models, techniques such
as LIME become more important. LIME works by perturbing the model “locally” around the region of the
predicted result to examine the sensitivity of the prediction to individual features.
3. Shapley Values (SHAP): This concept comes from cooperative game theory and was coined by Lloyd
Shapley. The technique involves computing the average marginal contribution of a feature across all potential
coalitions of that feature (when combined with other features) to the difference between an actual result
and a predicted result. Shapley values provide a powerful tool to interpret AI/ML model performance, but the
technique is computationally intensive, particularly for models that have a large number of features.

How to use interpretability during model prototyping
When evaluating models, it is best practice to review the local interpretability for model outputs across
true positives, false positives, and false negatives, where possible. A business user with context should
be able to read the interpretability outputs and understand how they would use the information to make
an informed decision based on the AI insights provided.
Risk
Feature Contribution
Days since last interest rate change .31
Change in sum of profitibility of cash processing product in the last 90 days, year over year .26
Change in revenue of ACH product in the last 30 days, year over year .25
Value of credit transactions in the last 30 days .11
Credit utilization ratio in the last 90 days .06
Count of debit transactions above $100,000 in the last 30 days .01
Figure 35: Example of a risk score from an AI/ML model charted over time. Details of the local
feature contribution at a specific point in time appear in the table below.
In addition to exposing the feature contribution percentages, model interpretability can be improved
through using human interpretable feature names. Data scientists may be tempted to use shorthand in
their code to label features with names like “X1” and “X2,” but this shortcut limits the ability to understand
the model results easily. Instead, encourage use of descriptive names like “Days Since Last Interest Rate
Change” or “Value of Credit Transactions in Last 30 Days.”

Ensuring Algorithm Robustness

Another factor to consider during AI/ML model prototyping is whether the model will be robust when
deployed in production. Robustness involves thinking through whether the model pipelines prototyped
will be able to robustly handle real-world data, including poor data quality, gaps or missing data, or
potential adversarial attacks.
Model robustness is not entirely an algorithmic task. Business rules – for both pre-processing of data
sets and post-processing after an algorithm is run – can be appropriate to ensure model robustness.
However, robustness must be considered along the entire end-to-end pipeline – from data ingestion to
model outputs, including the way the output will be used in the business process.
One example of the effect of a non-robust model appears in the following figure. In this example, a deep
learning model was trained to label images of pigs. The original model successfully labeled the image on
the left as a “pig.” 33
However, in this case, a small 0.05% noise in the original image – that could occur from an adversarial
impact – leads to a dramatically different outcome. When the modified image was passed through the
same deep learning model, a new prediction labeled the image as an “airliner.”
Deep Learning Systems Can Be Sensitive to Adversarial Attacks33
“pig” “airliner”
Figure 36: Small changes that are imperceptible to the human eye – like applying 0.05% noise to an image –
can drastically change the predicted output.
One way of ensuring the robustness of deep learning models such as this involves injecting additional
noisy or potentially adversarial data as part of the model training process, allowing the model to “learn”
how to make predictions in the presence of noise or an adversarial attack. Other techniques may involve
the use of generative adversarial networks (GANs). But there usually is a tradeoff between the robustness
of the model and its performance.

Planning for Risk Reviews and Audits

Another critical factor to consider during the algorithm prototyping stage is preparing for reviews by
model risk committees or model audits that may be required post-production.
The best strategy here is to put in place repeatable processes that thoroughly assess machine learning
models before they are used in production environments and that continue to maintain models that are
open to audit and management after they are deployed and operating in production.
If you do not already have one, we recommend instituting a model review board or process to inspect
algorithm details and pipelines before the machine learning model goes into production. The process
does not have to be onerous. At its core, the prototyping team should prepare a written summary and
presentation of the model and prototyping process, along with documentation.
The prototyping team should present:
• The machine learning model’s formulation

• Model explainability factors like feature contribution percentages
• Case study examples of predictions
• Economic value generated by the model
• Potential downside risks of deploying the model
• An action plan for maintaining and updating the model over time
The review board then must approve the project before the model is implemented in production. A clearly
defined process like this helps the project team think critically about the machine learning problem and
ensures that models are properly screened to mitigate potential risks.
The prototyping team should also ensure that appropriate information – including, for example, true/false
positives and evidence packages – are logged and scored when the model is running in production. This
will ensure that an audit trail of the model’s performance can be readily performed.
To simplify tasks around post-production model reviews and audits, the best prototyping teams and
prototyping/production operating software tools will automate or build in the data management, model
management, AI/ML operations, and model audit processes as part of their software pipelines.

7
Best Practices in
Ongoing Operations
Setting your team up for success in prototyping is a necessary step toward
positively impacting your business with AI-generated insights. Sustaining that
initial success, however, requires a thoughtful approach to scaling and monitoring
AI algorithms, ensuring user adoption, and monitoring and tracking business
value being generated. In this chapter, we outline best practices distilled from
our years of operating and monitoring AI models in production at global scale
– millions of models operating against ongoing data updates within single-
environment instances.
AI as Part of the Software

Development Process
Developing and maintaining complex AI use cases requires a sophisticated
and rigorous approach that includes implementing a method to periodically
improve the deployed algorithms, designing for and monitoring nuanced edge
cases, creating a robust set of automated tests to prevent regressions, and
gracefully alerting administrators to issues. Because the process to implement
highly scalable solutions is analogous to modern software development and
deployment, those processes can be used as a model for AI development.
Given the highly iterative nature of algorithm configuration and application

logic development, it is recommended that both algorithm development and
application development proceed together, in lockstep, using modern software
development approaches.
This typical development process involves six steps to ensure reliable and
performant code is released to end users:
1. Code reviews: Developers/data scientists review each other’s code to identify

potential bugs and to streamline solutions that are simple and elegant.
2. Unit and integration testing: Developers/data scientists test new functions

with existing programs to identify and resolve issues. Common issues arise
when data input requirements of a new machine learning model do not match
the available data format in existing programs.

3. Generation of a release candidate: Once a candidate “green build” is generated that includes the
required functionality and passes unit and integration tests, the software build is deployed to the
QA environment.
4. Quality assurance: QA testers use the QA environment to test the new functionality. Bugs are
identified and prioritized to be resolved quickly.
5. Testing in preproduction: After QA is complete, the program is promoted to a preproduction

environment. Preproduction is the final validation step to ensure the new features and bug fixes are
fully functional before they are released to all users.
6. Production deployment: A final version of the program is released and available for end users.
Recommended Practice: Manage AI as Part of the Software Development Process
Figure 37: The software development process requires code reviews, testing, release, QA,
preproduction, and production phases.

Testing and Planning for Scale

Having developed a model or set of models, practitioners must migrate them to a live “production”
environment where the models can generate ongoing inferences based on new data and trigger any
downstream actions or alerts.
For most enterprises, this may mean that the model is wrapped within an enterprise application that is being
used by humans to make decisions. For example, a manufacturing organization may embed AI inferences
about top equipment within an AI-based predictive maintenance application that provides valuable clues
to maintenance crews. Alternatively, the model could be embedded as a microservice within existing
applications and business processes, or the algorithm’s outputs could be distributed to existing operational
systems (for example, tuning setpoints for controllers).
Deploying a machine learning model to a production environment at scale requires close collaboration and
communication between business and technical stakeholders.
Thought should be put into the data volumes required, the frequency of inferences needed, the number of
end consumers of an application, and the impact to existing operational systems and business processes.
The production environment should meet the needs of the business problem at hand. If many end users
require accessing insights simultaneously, be sure that the web-hosted environment can handle high traffic.
If new predictions are required to make rapid decisions, test the inference service level agreements (SLAs)
to be sure that the algorithms execute quickly enough to meet the business requirement.
Algorithm Maintenance and Support

Once deployed to production, model performance must be monitored and managed. Models typically
require frequent retraining by data science teams. Some retraining tasks can be automated, such as
automated retraining based on the availability of new data, new labels, or based on model
performance drift.
However, for most real-world problems teams should be prepared for data scientists to put in ongoing
work to understand the underlying reasons for model performance degradation or data deviations, and
then to debug or seek to improve model performance over time.
Organizations therefore need to develop and apply significant technical agility to rapidly retrain and
deploy new AI/ML models as circumstances and business requirements evolve.

Practitioners should be aware: All AI/ML models, if not actively managed, will hit the end of their useful life
sooner than anticipated. Planning for that eventuality will ensure your business is taking advantage of AI
models providing peak performance.
Model Monitoring
It is expected that the model performance will change over time. This is because business operations are
dynamic, with constant changes to data, business processes, and external environments. The trained model
is representative of a historical period that may or may not still be relevant. As new data are collected over
time, training and updating the model will drive ongoing performance improvements.
Model drift and performance monitoring are critical for continued adoption of AI across the organization.
Businesses may employ a variety of techniques to monitor model performance, including capturing data
drift relative to a reference data set, deploying champion as well as challenger models to enable the “hot
replacement” of one model with another as circumstances change, and creating model KPI dashboards to
track performance.
Reference data sets provide a clear baseline of performance for trained models. They are often developed
as part of the prototyping process, require a high degree of vetting, and establish a clear set of bounds in
which the model must perform.
Champion/challenger methods are employed when there are multiple viable model solutions to an AI problem
and there is a clear performance benchmark against which all models can be evaluated. When operating in
production, one of multiple models is selected as the champion, and all or most of the predictions come from
that model. Challenger models also operate in production, but often in shadow mode.
For example, a challenger model will make predictions at the same frequency as the champion model
but may never be exposed to a user or downstream service. If the underlying data change, or if retraining
impacts the models, the challenger may perform better against the predefined benchmark and either alert a
user to promote it to be the new champion, or do so automatically.
Most importantly, as we have scaled AI systems at C3.ai, we have emphasized the need to present
executives, business users, and data scientists with up-to-date views on model performance. These
insights, continually updated, provide clarity and transparency to support user adoption and highlight
potential emergent issues.

Model Updates
In practice, machine learning models may be retrained when new, relevant data are available. Retraining
can occur continuously as new data arrive or, more commonly, at a regular interval. It can also occur in an
automated manner or with a human in the loop to verify training data and performance metrics.
In our experience, retraining models at a regular interval or upon availability of a certain amount of new
data balances the tradeoff between high costs associated with compute time of retraining and declining
model performance over time.
Plan for Ongoing Model Updates
Weeks
Timeline
Initial Model and Ongoing Model

App Go Live Updates
Figure 38: Ongoing model retraining and updates drive continuous improvement and ensure that
predictions remain relevant and accurate.

Tracking Value Captured

Business process change is among the hardest challenges in scaling AI systems. Capturing value from AI
requires strong leadership and a flexible mindset.34
Organizations may need to adapt their workforce to accept recommendations from AI systems and provide
feedback to AI systems. This is often challenging. For example, maintenance practitioners who have been
doing their jobs in a specific way for decades are often resistant to new recommendations and practices
that AI algorithms may identify.
In practice, this means showing the model’s success through internal marketing and executive sponsorship,
building AI interpretability into the program, growing data science knowledge in the organization, and
tracking value captured so it is visible to the end users.
Tracking business value visibly in the application being developed or put into production helps to align
stakeholders so that all parties agree on business objectives and the value unlocked and captured by
machine learning models. These reporting numbers can be used for organizational marketing and internal
evangelism. The following figure provides an example of a C3.ai application, where the business value
captured is central to the application experience.

Figure 39: Example of how C3 AI Energy Management tracks and reports value created and captured
with machine learning model recommendations directly within the application
Marketing Program Achievements

and Reporting Success
In planning for the successful launch of a new machine learning program, it is critical to engage with
corporate marketing to publish accomplishments across the enterprise. Marketing the achievements of the
program drives talent recruiting externally and technology adoption internally.
As the machine learning and software development team grows in the organization, it needs to be able to
recruit talent from universities with strong computer science, mathematics, and statistics programs. This
may be a new talent pool within an organization. Widespread marketing of the machine learning program’s
successes will help raise brand awareness among people with these skills.
Marketing content can demonstrate the value and benefits of the machine learning program and encourage
end users to feel pride in this new way of working. As an example, the following figure illustrates Shell’s
marketing of their AI efforts on their website as part of their Shell.ai program.

Figure 40: Global energy company Shell advertises the Shell.ai Artificial Intelligence program
directly on their website under Energy and Innovation
In addition to direct economic value captured by the machine learning program, an organization implementing
a new digital transformation initiative can drive economic performance through analyst reporting and
the financial community. Reporting on the machine learning program tells the investment community that
meaningful changes in ways of working are taking place and highlights a roadmap for further developments.

8 Building a Strong Team

Typical Required Skillsets
There is significant demand for AI/ML practitioners today, with the existing talent
pool concentrated at a few companies like Google, Facebook, Amazon, and
Microsoft. These companies have often paid significant sums of money to attract
and retain strong AI and ML talent. For example, Google’s 2014 acquisition of
DeepMind Technologies – and its 75 employees – for an estimated $500 million
averaged out to more than $6 million per employee.35
Often, data scientists at businesses tend to be analysts or “citizen data scientists”

who are typically trained in business intelligence and data analysis, but who may
have some AI/ML experience. Or they may be statisticians who are trained in
sampling from data sets to draw inferences. Most companies are just starting
their evolution towards AI and do not have a strong bench of AI specialists.
However, there seems to be emerging recognition that strong, technical AI/ML

talent will be important in many industries. On sites like Glassdoor and LinkedIn,
machine learning engineers, data scientists, and big data developers are among
the most popular jobs.36 These job postings require candidates with specialized
backgrounds that include both advanced math and software expertise.
Given the strategic business value that stands to be captured from AI/ML and the
critical importance of these technologies in providing competitive advantage,
large enterprises should plan to develop in-house AI/ML expertise.

Many organizations ask us about leveraging citizen data scientists – analysts who have been trained in
some AI/ML techniques – to power their AI transformations. But based on our experience and given the
significant challenges involved in applying AI/ML algorithms – including challenges with framing the problem,
correlation vs. causality, bias in datasets, and information leakage – it is difficult for an enterprise to achieve its
AI transformation without a strong, core, technical AI/ML team. This technical team will be central to unlocking
disproportionate economic value from the most complex applications. In our opinion, the science of AI is still
too early in its development cycle to be entrusted to a non-technical team of citizen scientists and developers.
Nonetheless, we see a strong need for citizen data scientists to support and unlock significant value from a
long tail of less complex AI problems. But we recommend that these citizen scientists be complemented and
supported by a core, seasoned technical team.
The following figure illustrates the concept:
Figure 41: At most enterprises, there are a few AI applications with significant complexity and value plus
a long tail of other use cases with smaller value pools.

The core technical AI/ML team can author and publish advanced algorithms and services that can be leveraged
by citizen scientists. This team can also review algorithms published by citizen scientists to ensure they are robust
prior to their inclusion in critical business processes and decisions.
Given the significant increase in technical data science and AI/ML academic programs over the past decade,
it is feasible for companies to attract and retain this talent. The rise in well-paying data science jobs has caused
a surge in enrollment in data science programs: graduates with degrees in data science and analytics grew by
7.5 percent from 2010 to 2015, outpacing other degrees, which grew only 2.4 percent.37 Today, more than 120
master’s programs and 100 business analytics programs are available in the U.S.
There are also a growing number of data science boot camps and training programs for aspiring data scientists.
These programs take in professionals with strong fundamental mathematical backgrounds, including
mathematics, physics, or other engineering disciplines, and prepare them for careers in AI. Some of these boot
camp courses are available online. Coursera, for example, offers online curricula for both machine learning
and for deep learning.38 Other courses are in-person, such as the Insight Data Science program in the San
Francisco Bay Area.39
We typically recommend that companies seek to hire and develop their AI/ML talent from academic programs,
with potentially a few lateral senior hires to build out the team. Many of our clients already have active university
partnerships and programs in place. But these programs are usually not focused on technical AI/ML talent. Small
changes in engagement and focus for existing university programs can result in significant improvements to an
enterprise’s AI/ML technical team.
Internal recruiting can also play a key role. Many of our clients already have individuals with the right technical
profiles, but they often are dispersed across a wide range of internal teams and departments. We often help
our clients run internal recruiting campaigns using tools such as LinkedIn to recruit and consolidate their existing
talent into AI/ML Centers of Excellence (COEs) that can unlock disproportionate value for the enterprise.
In our own AI/ML recruiting, we have learned that, rather than looking for individuals with skills in specific
techniques, we should select candidates with strong technical skills who also have strong mathematical
foundations and intrinsic problem-solving skills that show their potential for learning a wide range of
algorithmic techniques. In general, algorithmic techniques can be coached and learned over time – and are
constantly evolving anyway – but mathematical fundamentals are much harder to learn.

The following figure summarizes the background and experience we recommend as part of building a
core AI/ML team.
MS or PhD in computer science, electrical engineering, statistics, or equivalent fields
Applied machine learning experience (regression analysis, time series, probabilistic models,
supervised classification and unsupervised learning)
Strong mathematical background (linear algebra, calculus, probability and statistics)
Experience with scalable ML (MapReduce, streaming)
Experience with JavaScript and prototyping languages such as Python and R
Knowledge of engineering or other relevant domains
Ability to drive a project and work both independently and in a team
Figure 42: Typical required skillset for AI/ML practitioners

Candidate Screening and Interview Process

It is important to design the interview process carefully in order to build a strong AI/ML team. Given the
challenges in finding strong technical talent, we typically recommend that our clients create as wide a
funnel as possible and be prepared for a recruiting process that involves screening a large number of
candidates to hire just a few individuals.
Our own AI/ML recruiting process has six steps starting with a fast resume screen, followed by an automated
technical assessment, and subsequently multiple rounds of interviews. The resume review and the automated
technical assessment enable us to screen a large number of candidates quickly.
We give significant thought to the technical assessment and we design it to help us understand a
candidate’s fundamental mathematical skills, general familiarity with AI/ML techniques, and coding skills.
In 2019, we received 7,715 applications to our AI/ML team. We screened most of these candidates based
on the resume and technical tests, interviewed nearly 400 candidates, and hired 17. Most organizations
seeking to build up technical AI/ML talent should expect a similar recruiting funnel. The following figure
illustrates the process.
Figure 43: C3.ai’s AI/ML recruiting funnel for 2019

Team Organization
Organizing data science teams is often a complex endeavor involving a combination of factors. First, data
science is a highly specialized field; it involves experienced practitioners with advanced, professional
degrees. These individuals often require specialized structures, environments, and professional
development opportunities.
Second, in order to capture value from data science, most organizations require data scientists to collaborate
effectively and also to collaborate and work with business users, end-users, and other developers and data
engineers in the context of specific products and projects.
Given our experience with managing complex, data science-based products and projects, we typically
recommend and leverage internally a data science organizational structure that operates across
three dimensions.
First, we recommend a core, relatively traditional data science management infrastructure, with individual data
scientists reporting to managers, who in turn report to a VP of Data Science. However, one key difference from
traditional technical management structures, such as those for software development, is that we generally
do not recommend the use of full-time people managers. That is, we typically find more success when
technical leaders who are also good managers are asked to manage small pods of data scientists, while at
the same time still retaining technical responsibility and technical leadership for several tasks.
Second, in addition to the data science management structure, we recommend a separate structure designed
to provide mentorship for data scientists within the organization. Senior scientists are assigned as mentors
to junior ones, and with mentors that do not map to the organizational chart.
Third, we recommend a separate organizational chart for project and product teams that are formed around
specific initiatives. That is, we formally staff data scientists to work on specific project or product initiatives
without regard for the organizational team to which they are assigned. For the duration of the project or
assignment, their primary day-to-day reporting structure follows the project or product team, not their
assigned organizational structure.
The following figure illustrates the concept.

Figure 44: Organizing successful data science teams
Over the last decade at C3.ai, we have also identified several best practices to organize project teams. Project
teams are often cross-functional, requiring data scientists to collaborate with developers and data engineers
to configure and develop AI applications or solutions. These cross-functional project teams require a day-
to-day management structure.
C3.ai and our clients often find a management structure combining both a project manager and a product
manager to be an ideal configuration. In some cases, these roles can be collapsed into a single individual.
The project manager focuses on deliverables, timelines, activities, and reporting – keeping the project train
on the tracks – while the product manager focuses on the application or the solution being developed. This
architecture maximizes function, scalability, and re-use, minimizes technical debt, and is aware of all aspects
of maintenance, management, and ongoing operations of the AI/ML algorithms.
We typically recommend assigning a senior data scientist or one of the data science team leaders to oversee –
using only part of their time – the AI/ML progress on each project. This senior scientist must spend sufficient time
to be in on the details of the work and must be involved in the team’s day-to-day problem solving (Figure 45).
We encourage this senior scientist to advocate actively for the data science perspective on the problem and its
solution. This senior scientist reports separately to executive data science leadership both on progress toward
resolving the data science problem and any roadblocks observed so that they can request assistance as needed.

Figure 45: Typical project team structure: a senior scientist is formally assigned part-time to each project team
Professional Development
It is important for AI/ML organizations at enterprises to consider opportunities for professional development
of data scientists. In addition to coaching and mentorship, we have observed that exposure to a wide range
of problem types and product areas is a critical requirement for the rapid professional development of
scientists early in their careers.
Rotation of data science projects is key to give scientists an exposure to a wide range of problems and to build
out their experience with different problem formulations, solution architectures, frameworks, and algorithms.
To maximize a data science team’s potential, consider regularly rotating scientists to new projects and
products. While there are exceptions, we typically recommend that data scientists be moved to different
projects every three to six months. Rotations are often a win-win – facilitating professional development
while also bringing to bear fresh perspectives on projects and problems.
We also recommend a mixture of project work and product work as part of rotations. We find that data scientists
may develop specific, useful generic or reusable artifacts while working on specific problems. We want to give
the data scientist who identifies and develops the first version of an artifact – for example, a novel unsupervised
anomaly detection pipeline – an opportunity to productize that service so that other data scientists can use it in
their work. We therefore explicitly recommend product rotations for data scientists for more complex work, as
depicted in the following figure.

Figure 46: Typical data scientist rotation across projects and products
We also recommend thinking carefully about the advancement and growth of data scientists across
several dimensions, including core problem-solving skills, AI/ML skills, coding skills, leadership skills, and
communication skills.
Summary
The upshot is that AI/ML is a human capital game. Recruiting, training, organizing, and developing a bench
of talented AI/ML practitioners is a critical success factor for enterprises seeking to transform their
business operations using AI.
Just as organizations must invest in building internal expertise throughout every functional area of the business
from finance, marketing, and sales to research, manufacturing, and logistics, a strong data science team is critical
to succeeding in the digital era. Organizations that excel at AI/ML, particularly those that take an early lead
to build out AI//ML capabilities, will reap significant, sustained competitive advantages. Those that fall behind
in AI/ML will fare less well.
We have offered here some ideas and concepts that may be useful to enterprise business leaders seeking to
improve their organization’s AI/ML capabilities. Taken together with the rest of this book, we hope we have
presented an actionable and effective management guide to power successful AI/ML business transformations
that capture business value at scale.

9 Acknowledgements
Machine Learning for Managers has been eight years in the making. This book
synthesizes many of the lessons we have learned at C3.ai – often the hard way – in
designing, developing, and implementing some of the largest and most complex
global enterprise AI/ML applications.
This book is the result of endless intense discussions and debates with talented
colleagues and customers, with hard-won lessons emerging from trying
different solutions to specific challenging problems. The most important lessons
we have learned in data science have often emerged from genuine confoundment
about why specific, often “textbook,” solutions do not work in the real world. This
book seeks to capture the management lessons we have learned from years of
practical experience.
I would like to acknowledge the significant support and contributions of my

colleagues – who have been instrumental in making this work possible – including
Turker Coskun, Eric Marti, Lila Fridley, Henrik Ohlsson, Louis Poirier, and
Adrian Rami.

About the Author

Nikhil Krishnan, Ph.D., is the Group Vice President of Products at C3.ai. At C3.ai, he is responsible for product
management, product marketing, and AI/machine learning.
Over nearly a decade at C3.ai, Dr. Krishnan has developed deep experience in designing, developing,
and implementing complex, large-scale enterprise AI and ML products and solutions to capture economic
value. This book offers practical advice and insights for managers gathered over years of managing
enterprise AI/ML products and projects.
Dr. Krishnan has extensive experience in unlocking business value from the application of enterprise AI
across industry verticals, including financial services, manufacturing, oil and gas, healthcare, utilities, and
government. He has been involved in large-scale enterprise AI transformations at many of the world’s
largest, most complex, and iconic organizations, including Enel, Bank of America, Baker Hughes, Shell,
Koch Industries, and the United States Air Force.
Prior to C3.ai, Dr. Krishnan was an associate principal at McKinsey & Company, where he was a leader in
McKinsey’s Advanced Industrials and Energy Practices.
Dr. Krishnan was formerly an assistant professor at Columbia University in earth and environmental
engineering. He also worked as a research engineer at Applied Materials, Inc.
Dr. Krishnan earned a bachelor’s degree from the Indian Institute of Technology, Madras, and holds a
master’s and Ph.D. in mechanical engineering from the University of California, Berkeley.

References
1. Hardesty, L., “Probabilistic programming does in 50 lines of code what used to take thousands,” https://phys.org/news/2015-04-probabilis-
tic-lines-code-thousands.html, 2015. Accessed July 21, 2020.
2. There are two kinds of AI, and the difference is important, Popular Science, February 2017, https://www.popsci.com/narrow-and-general-ai.
3. “An Executive Primer on Artificial General Intelligence,” McKinsey & Company, April 2020, https://www.mckinsey.com/business-functions/
operations/our-insights/an-executive-primer-on-artificial-general-intelligence. Accessed July 21, 2020.
4. Ng, Andrew, Machine Learning, Coursera, https://www.coursera.org/learn/machine-learning.
5. Domingos, P., “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World.” Basic Books, 2015.
6. Davenport, T. H., Ronanki, R., “Artificial Intelligence for the Real World,” 2018, https://hbr.org/2018/01/artificial-intelligence-for-the-real-world.
Accessed July 21, 2020.
7. Murphy, Kevin P., Machine learning: a probabilistic perspective, MIT Press 2012; page 3, online: https://www.worldcat.org/title/ma-
chine-learning-a-probabilistic-perspective/oclc/781277861/viewport.
8. Sutton, R. S., Barto, A. G., “Reinforcement Learning: An Introduction,” MIT Press, https://web.stanford.edu/class/psych209/Readings/Sutton-
BartoIPRLBook2ndEd.pdf. Accessed September 22, 2020.
9. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron, Deep Learning. MIT Press, 2016.
10. Hastie, T., Tibshirani, R., & Friedman, J., The Elements of Statistical Learning, Second Edition Springer; e-book, https://web.stanford.
edu/~hastie/Papers/ESLII.pdf. Accessed July 21, 2020.
11. AlphaGo, https://deepmind.com/research/case-studies/alphago-the-story-so-far. Accessed September 22, 2020.
12. Hardt, Moritz, Price, Eric & Srebro, Nati. “Equality of opportunity in supervised learning.” Advances in neural information processing systems. 2016.
13. James, G., Witten, D., Hastie, T., & Tibshirani, R., “An Introduction to Statistical Learning with Applications in R,” Springer. 2013, 16.
14. Hayes, B., “Programming Languages Most Used and Recommended by Data Scientists,” https://businessoverbroadway.com/2019/01/13/pro-
gramming-languages-most-used-and-recommended-by-data-scientists/.
15. Money laundering fines total $8.14bn in 2019, International Investment, https://www.internationalinvestment.net/news/4009055/money-laun-
dering-fines-total-usd-14bn-2019. Accessed September 23, 2020.
16. Jobin, A., Ienca, M. & Vayena, E. “The global landscape of AI ethics guidelines.” Nat Mach Intell 1, 389–399 (2019).
https://doi.org/10.1038/s42256-019-0088-2.
17. Müller, Vincent C., “Ethics of Artificial Intelligence and Robotics,” The Stanford Encyclopedia of Philosophy (Summer 2020 Edition),
Edward N. Zalta (ed.), forthcoming. https://plato.stanford.edu/archives/sum2020/entries/ethics-ai/. Accessed May 27, 2020.
18. “Artificial Intelligence at Google: Our Principles,” https://ai.google/principles/. Accessed May 27, 2020.
19. “Microsoft AI Principles,” https://www.microsoft.com/en-us/ai/responsible-ai?activetab=pivot1%3aprimaryr6. Accessed May 27, 2020.
20. “Rise of the Chief Ethics Officer,” Forbes Insights, March 27, 2019, https://www.forbes.com/sites/insights-intelai/2019/03/27/
rise-of-the-chief-ethics-officer/#5d3d50bb5aba. Accessed May 27, 2020.
21. Amazon scraps secret AI recruiting tool that showed bias against women, Reuters, Oct 2018, https://www.reuters.com/article/us-ama-
zon-com-jobs-automation-insight-idUSKCN1MK08G. Accessed September 25, 2020.
22. Zemel, Richard S., Wu, Yu, Swersky, Kevin, Pitassi, Toniann & Dwork, Cynthia. Learning fair representations. In Proc. 30th ICML, 2013.
23. Etzioni, A., Etzioni, O. “Incorporating Ethics into Artificial Intelligence.” J Ethics 21, 403–418 (2017). https://doi.org/10.1007/s10892-017-9252-2.
24. Department of Defense Directive 3000.09: Autonomy in Weapon Systems, November 21, 2012. DoD Directive 3000.09.
https://www.hsdl.org/?view&did=726163. Accessed May 27, 2020.
25. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the
processing of personal data and on the free movement of such data and repealing Directive 95/46/EC (General Data Protection Regulation)
(Text with EEA relevance). https://op.europa.eu/en/publication-detail/-/publication/3e485e15-11bd-11e6-ba9a-01aa75ed71a1. Accessed May 27, 2020.

26. Ribeiro, M. T., Singh, S., & Guestrin, C., “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144, 2016.
27. Lundberg, Scott M., & Lee, Su-in, . “A Unified Approach to Interpreting Model Predictions.” Advances in Neural Information Processing Systems. 2017.
28. Costa, E., Halpern, D., “The behavioural science of online harm and manipulation, and what to do about it,” The Behavioral Insights Team, 2019.
https://www.bi.team/wp-content/uploads/2019/04/BIT_The-behavioural-science-of-online-harm-and-manipulation-and-what-to-do-about-
it_Single.pdf. Accessed May 27, 2020.
29. Rosset, S., Perlich, C., “Medical data mining: insights from winning two competitions,” Data Mining Knowledge Discussion, 2009,
https://www.prem-melville.com/publications/medical-mining-dmkd09.pdf. Accessed June 2, 2020.
30. Ghorbani, Amirata, Abubakar, Abid, & Zou, James. “Interpretation of neural networks is fragile.” Proceedings of the AAAI Conference on
Artificial Intelligence. Vol. 33. 2019.
31. Ancona, Marco, et al. “Towards better understanding of gradient-based attribution methods for deep neural networks.” Proceedings of ICLR, 2018.
32. Molnar, C., “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable,” June 15, 2020, https://christophm.github.io/
interpretable-ml-book/shapley.html. Accessed June 26, 2020.
33. Madry, A., & Schmidt, Ludwig, A Brief Introduction to Adversarial Examples gradientscience.org/intro_adversarial/.
34. “Putting Artificial Intelligence to Work,” Boston Consulting Group, September 28, 2017, https://www.bcg.com/publications/2017/technolo-
gy-digital-strategy-putting-artificial-intelligence-work.aspx.
35. Shu, Catherine, “Google acquires artificial intelligence startup DeepMind for more than $500 million,” TechCrunch, January 26, 2014.
36. LinkedIn’s 2017 U.S. Emerging Jobs Report, December 2017, Accessed October 2020.
37. Henke, N., et. al., “The age of analytics: Competing in a data-driven world,” McKinsey Global Institute, December 2016, https://www.mckinsey.com/
business-functions/mckinsey-analytics/our-insights/the-age-of-analytics-competing-in-a-data-driven-world. Accessed October 2020.
38. Coursera Deep Learning Specialization, https://www.coursera.org/specializations/deep-learning. Accessed October 2020.
39. Insight Data Science Fellows Program, https://www.insightdatascience.com/. Accessed October 2020.
1300 Seaport Boulevard, Suite 500, Redwood City, CA 94063

1300 Seaport Boulevard, Suite 500, Redwood City, CA 94063


Enterprise Artificial Intelligence and Machine Learning For Managers

Uploaded by

Copyright:

Available Formats

You might also like

Enterprise Artificial Intelligence and Machine Learning For Managers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enterprise Artificial Intelligence and Machine Learning For Managers

Uploaded by

Copyright:

Available Formats

Enterprise Artificial

Intelligence and Machine

Nikhil Krishnan, Ph.D.

1. What is Machine Learning (ML) and Artificial Intelligence (AI)? 5

2. Tuning a Machine Learning Model 21

3. Evaluating Model Performance 26

4. Runtimes and Compute Requirements 39

5. Selecting the Right AI/ML Problems 43

© 2020 C3.ai | All Rights Reserved 2

6. Best Practices in Prototyping 58

7. Best Practices in Ongoing Operations 76

8. Building a Strong Team 84

© 2020 C3.ai | All Rights Reserved 3

© 2020 C3.ai | All Rights Reserved 4

ML algorithms take a different approach from traditional logic-based approaches. ML

AI algorithms enable new classes of problems to be solved by computational approaches

© 2020 C3.ai | All Rights Reserved 5

Figure 1: Overall taxonomy of the artificial intelligence field

© 2020 C3.ai | All Rights Reserved 6

Categories Supervised Learning Unsupervised Learning Reinforcement Learning8

• Support vector • Linear regression • Principal • K-means • Monte Carlo

• Multi-layer • Multi-layer • Auto-encoders • Deep Gaussian • Deep Q-learning

Figure 2: Common categories of machine learning algorithms

© 2020 C3.ai | All Rights Reserved 7

© 2020 C3.ai | All Rights Reserved 8

A Simplified Machine Learning Pipeline

© 2020 C3.ai | All Rights Reserved 9

Supervised Learning: “Good Truth” Available

Training Data Resulting Model Applied to New Data

© 2020 C3.ai | All Rights Reserved 10

Model Features, X (m columns) Outputs, Y

Purchases Past Change in Web Customer

1-Jan Adams 1,000.00 +1 0

1-Feb Adams 200.00 0 0

1-Feb Chu 400.00 -1 0

1-Mar Chu 300.00 -1 0

1-Apr Adams 1,000.00 0 0

1-Apr Bhat 750.00 +1 0

1-Apr Chu 200.00 -5 0

© 2020 C3.ai | All Rights Reserved 11

© 2020 C3.ai | All Rights Reserved 12

Historical Training Data New Predictions

© 2020 C3.ai | All Rights Reserved 13

Historical Training Data New Predictions

© 2020 C3.ai | All Rights Reserved 14

Figure 8: Examples of classification and regression techniques

© 2020 C3.ai | All Rights Reserved 15

Unsupervised Learning: No Ground Truth

Training Data Resulting Model Applied to New Input

© 2020 C3.ai | All Rights Reserved 16

Dimensionality reduction is a powerful approach to construct a low-dimensional representation of high-

© 2020 C3.ai | All Rights Reserved 17

© 2020 C3.ai | All Rights Reserved 18

Deep Learning: Node

Input Signals Node Output

Figure 11: Single node in a deep learning neural network

© 2020 C3.ai | All Rights Reserved 19