The Past Present and Future of Data Science Part 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

The Past, Present, and Future of Data Science – Part 2

451alliance.com/Reports/View/ArticleId/1325/The-Past-Present-and-Future-of-Data-Science-Part-2

March 27, 2019

Alliance Members-Only
Report

by Krishna Roy & Matt Aslett

This is Part 2 of our three-part report series on the past, present and future of data science in the
modern enterprise. Be sure to read Part 1 if you missed it.

In this installment, we drill into the different elements that allow data science projects to thrive in a
business context.

1/15
The Critical Elements of Data Science

Data science platforms and tools are invogue, and are available from a variety of vendors. But
selecting the appropriate one – or ones – isn’t easy.

There at least 50 data science tools currently available, resulting in a plethora of choices and potential
confusion. And because data science tools and platforms mature at different rates, this affects the
capabilities they provide, and creates fresh difficulties when navigating the landscape of products
available.

However, there are some mandatory components and capabilities that 451 Research has identified
which, if left out, could detrimentally affect the quality of analyses an enterprise ends up using.

For successful enterprise data science, it is essential to include the following components for a unified
architecture and workflow process.

2/15
3/15
Rigorous data access and preparation is vital to prevent the ‘garbage in, garbage out’ issue that can
blight any type of analysis project, and result in incomplete or faulty insights.

Tick-box functions include data connectivity to all the internal and external data sources required by a
company for model-building purposes, as well as support for database discovery and data ingestion
to ensure models are generated using all of the appropriate data.

In recent years, the data catalog has also emerged as an essential component to enable enterprises
to identify what data is available across the expanded data management estate (including operational
and analytic databases, as well as data lakes based on Hadoop or object storage, both on-premises
and in the cloud).

Automated Data Preparation

4/15
The data needs to be fit for purpose, so data preparation functions are also essential.

Many enterprise data science platforms employ machine learning to aid or automate data prep. For
example, they suggest appropriate data joins and other transformations a user might require, as well
as surfacing outliers, inconsistencies, missing values and other data-quality issues to prevent ‘dirty
data’ from entering the analysis. Machine-learning functionality can also have a role to play in
identifying data that might be associated with regulatory requirements or usage limitations (such as
personally identifiable information).

Automated data preparation is vital if an organization intends to involve individuals with an analytics
background in a data science project. Even if they do, ML-based data preparation is still a good idea,
because it aspires to make repetitive, mundane data management tasks automatic – enabling
individuals involved in projects to get on with more sophisticated tasks.

Data preparation functionality driven by machine learning – such as automated recommendations – is


increasingly important because data science practitioners of various personas often spend the
majority of their time prepping data. That said, particularly for true data scientists, this is a waste of
their time and skill sets – they should be working on models.

5/15
Data Visualization

The graphic depiction of data for exploratory purposes is an important aspect of data science. It can
identify potential relationships or insights that might be hidden in the data and only surface when
presented in a chart.

Visualization is now emerging in multiple aspects of the data science workflow, including dataset
creation, modeling and insight sharing.

It should be considered critical if analysts and line-of-business users will be heavily involved in data
science, because it helps ensure these individuals don’t build modelsblind, or misunderstand insights.

Natural Language Processing

6/15
Natural language processing (NLP) and natural language generation (NLG) are also starting to surface
in data science workflows.

Novice users can often misinterpret data visualizations. The aim of NLP- and NLG-based capabilities is
to support visualizations by explaining elements that could be misconstrued or may not be
immediately obvious. Organizations with large numbers of individuals who are not data-science-
literate will find them indispensable, and so they should consider them essential.

Collaborative UI

Collaborative and role-appropriate features are also critical because enterprises typically have
individuals with varying skills levels who need to cooperate on a data science project.

Data scientists are likely to be comfortable using a collaborative notebook environment, while
marketeers or other line-of-business users might prefer a visual, drag-and-drop interface. Cross-
functional organizations should seek offerings with multiple interfaces to ensure all user personas’
comfort levels are met.

These user interfaces need to be collaborative to enable various participants in the data science
process to work together on data, features and models. Openness and extensibility are required to
enable individuals to plug in their tools and languages of choice and import pre-built models from
them.

7/15
Model Building

Model building and selection are other fundamental aspects. Many data science platforms now
incorporate machine learning to automate these functions to make them easier and faster.

However, data scientists should still be able to exert control over the model building process by
overriding automatic selections and actions, so it is important to ensure these functions exist as well.

An Audit Trail

An audit trail of actions performed by ML-based automation is necessary for governance and
compliance. An audit also aids collaboration, so that various users can see what their coworkers have
done.

A data and modeling audit trail, security and version management are must-have features to support
visibility into the model creation and management process, while ensuring it is secure and compliant
with existing enterprise business processes, which is also very important.

8/15
Monitoring Data Science Deployment

Deployment and operationalization functions are the final critical elements for enterprise data
science projects to enter production. These capabilities need to be agnostic because an organization’s
data science team, which is typically in charge of creating models, uses different languages and tools
than the IT departments that are largely responsible for rolling them out, managing them and
maintaining them.

Model monitoring is one key component here. Other critical features are monitoring and managing
resource consumption associated with CPUs and GPUs, memory, disk and network I/O, since they will
influence how well the model is running.

Data science management is as much an organizational challenge as it is a technology issue. Why?


Debate often rages within an organization over who should be responsible for monitoring and
managing models after they are deployed into production to ensure they are operating as expected.

Most enterprises would prefer that their highly skilled data scientists concentrated on developing and
training new models, rather than ‘babysitting’ models in production. However, IT and other employees
responsible for data management throughout an organization don’t necessarily have the skills to
interpret the relative accuracy or performance of ML-based data science models. For this reason, it is
vital that data science management tools enable collaboration between data scientists and IT
operations teams.

Finally, it is worth noting that data science management is currently the least mature aspect of most
enterprise data science platforms, although 451 Research expects it to be the next frontier for
development. The operationalization of data science projects involving ML and other AI technologies
is set to be a significant aspect of the next wave of developments in the data management space.

9/15
Organizing for Data Science

Data science should be considered a team sport.

Like the quarterback, runningbacks, and wide receivers of American football, data scientists might be
the most visible and celebrated players, but they are unable to deliver success without the tackles,
guards, centers and other players that make up the combined team.

Similarly, business users, data analysts, programmers, and developers are all essential – alongside
data scientists – in ensuring that data science projects are successful in driving business outcomes.

10/15
11/15
Change Comes from the Top

As with any team sport, you need a head coach. At companies with successful data-driven initiatives,
there’s an executive with a mandate to encourage and enforce a data-driven culture.

Increasingly, that person has the title of chief data officer (CDO) or chief analytics officer (CAO), but we
have also seen head of analytics, as well as both chief digital officer and chief digital and data officer.

In addition to the multiple titles, we have also seen multiple chains of authority: the executive
mandate could mean that the CDO reports directly to the CEO, or at the very least the CIO, and could
potentially be part of the formal C-suite itself. The CDO and their team deliver business value by
changing attitudes to the ways data and analytics are used across a company.

Data Drivers & Data Drifters

12/15
The Alliance reflects this trend in a survey on Data and Analytics in which companies were categorized
as Data Drivers or Data Drifters.

Of the Data Drifters (Alliance members that consider their company to be the least data-driven), 19%
cited lack of support/involvement from senior leadership as a barrier to using data platforms and
analytics. Only 3% of the most data-driven companies, or Data Drivers, has this problem.

The Data Drifters were also much more concerned about a lack of budget (23%) compared to Data
Drivers (16%). Data Drivers were significantly more advanced than the Data Drifters in terms of having
participants outside of IT department influence vendor selection, including executive management
(59% vs 41%), developers (23% vs 6%), and the data science and data analytics group (14% vs 8%), as
well as customer service and support (11% vs 3%).

13/15
14/15
Bringing these different constituents together to collaborate is easier said than done. One of the key
roles of a CDO is to generate a culture that embraces data at the heart of the decision-making
process and to embrace experimentation as a route to success.

Often, larger organizations will create a center of excellence that drives analytics and data science
innovation through agile development, and acts as a funnel for experimenting around new business
ideas.

If this center of analytics excellence is allowed to fail fast, it needs autonomy from the IT department
in terms of technology selection. However, it must also work closely with IT to ensure there is a
smooth transition because projects move from tactical experiments into production deployments.
Maintaining this balance is part of the CDO’s job.

Return

15/15

You might also like