Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

A Practical Approach to Discovery Environments

John Thuma, Teradata Corporation, January 2014

1|Page
A Practical Approach to Discovery Environments

Table of Contents

INTRODUCTION: Discovery environments ................................................................................................... 3


Lets Get Started! ...................................................................................................................................... 3
Sandbox..................................................................................................................................................... 4
Production................................................................................................................................................. 4
Test............................................................................................................................................................ 5
Development............................................................................................................................................. 6
Discovery ................................................................................................................................................... 7
Discovery Explained .................................................................................................................................. 7
The Discovery Process............................................................................................................................... 7
There are four parts to the Discovery process...................................................................................... 8
Discovery Environment Conclusion ...................................................................................................... 9
When Discovery becomes Repeatable ................................................................................................... 10
Environment Summary ........................................................................................................................... 11
CONCLUSION: How to select the right environment for your organization .......................................... 12

2|Page
INTRODUCTION: Discovery environments

Your organization has either made a decision or is about to make a decision in investing in a discovery
platform. This is a new paradigm for many organizations. Most organizations struggle to figure out how
to maintain, support, or invest. This manuscript will attempt to demystify many questions including:

What is discovery and how do I support it in my organization?


What kinds of discovery environments are there?
What are the differences between the discovery environments?
What happens when discovery becomes repeatable and relied upon?
When do I need a production discovery environment?
How do I govern the environments?

The rest of this document will attempt to answer some of these questions. It will describe the five types
of environments involved in discovery: Sandbox, Discovery, Development, Test, and Production. It will
also describe the discovery process and how a discovery becomes production ready. We will also
discuss how development and test environments can be leveraged to support Discovery and Production
activities. The document concludes with three separate types of discovery environments and their
limitations.

When reading this document, it is very important to understand that one size does not fit all. Every
organization will have its own nomenclature, guidelines, SDLC, controls, and support governance
processes. It is recommended that you use this document as a guide to associate with your
organizations practices and procedures. There are other factors in force with respect to this document
too: data security, privacy, and risk management. These areas cannot be ignored and will also change
how the environments are governed and managed.

Lets Get Started!

The next section of this document will describe the following types of Discovery environments:
Sandbox, Discovery, Development, Test, and Production.

3|Page
Sandbox: A sandbox environment is a computer system created primarily to support a department or
business team for the purpose of exploring technology or data. The business might use the environment
to explore solutions to problems, to test a technology direction, or for training purposes.

A sandbox environment is not under the support umbrella of the IT infrastructure team. The
environment does not have any availability or reliability service level agreements. If the environment is
unavailable the IT support environment does not have any real responsibilities to re-establish the
environment.

The business also takes full responsibility for security, data protection, and governance requirements
over the environment. The business owner of the environment understands these risks and the
potential harm that could be done to the business if a breach were to occur.

A sandbox environment is great for testing out new computing solutions for a department or specialized
area of the business. These environments are best used for proof of concept or proof of value projects
and have very specific definitions of scope, timelines, and data contents. The benefits of a sandbox
environment are that they dont have to fall into change management processes and can allow for fast
capability assessments. Sandbox environments also dont have any performance expectations and
usually are built on underpowered hardware and systems.

A sandbox environment is not a good solution for a project that may become business critical or if the
environment contains private or potentially harmful data if a breach were to occur. These environments
are usually not built with fault tolerance in mind or adherence to support service level agreements.
Sandbox environments should be considered tactical, not strategic, and have a short/scheduled life
cycle.

Production: On the opposite side of the environment spectrum is the production environment.
These environments are very strategic to the organization and if fail could have material impact to the
business as a whole. These environments are highly governed by Information Technology and fall under
strict change control, source control, and service quality controls. Production environments typically
have service level agreements with penalties and fines associated with adherence anomalies.
Disruptions to these systems can cause loss of revenue, increase cost, or damage organizational
reputation.

Information Technology support organizations will place very tight authentication and authorization
governance over who has access and what kind of access those persons will have within the system.
They will have very specific workflow process relating to system access, patch management, software
installation, and data manipulation. All activities on production environments will be deliberate and
planned. Performance, reliability, and security will be measured and reported by IT on a regular basis
based on well-defined service level requirements. Production environments should be predictable.

A production environment is suited best for mission critical organizational goals that are operationalized
and seamless to the business. They are managed by very strict change controls, software development
lifecycles, and access restrictions. Because of these factors production environments are not suited for

4|Page
ad-hoc change or data analysis. They also contain very private information and transaction data that if
changed or lost could have regulatory compliance risk associated. A security breach to a production
system could result in very sensitive private information leaks and could do serious harm to the
reputation of the organization.

Production environments have very tight change control governance including development platforms,
software code management, and test infrastructure. Change is strictly managed and tested prior to its
implementation. As a result ad-hoc practices are rare if occur at all. Production environments are
considered strategic and therefore are built for years of implementation and business/organizational
support.

Environment Pendulum: It is critical that we


understand the two extreme edges of the
discovery environment chain: Production and
Sandbox. These two environments are directly
opposite of one another as far as how they are
used, managed, and serviced. There are several
environments in between the two edges that
we will now explore.

Test: Test environments are used to support production environments. They typically have the same
environment artifacts that a production environment has including: software versions installed,
operating systems, patches, and software versions and editions. They also have the same application
objects that a production environment has, including: data models, analytical algorithms, application
data, and security infrastructure. If the organization has the funds to invest, many like to have the same
power and capabilities of the production environment within their test infrastructure. This is ideal as
you want to ensure that changes dont have a negative impact on the production environment.

Scope of changes that are tested in test environments: There are two basic types of changes: functional
and non-functional changes. Functional changes are changes to the way software works or the business
rules that define an application or set of algorithms. Functional changes would include new data science
capabilities or changes to existing models. Non-functional changes are changes that effect performance,
security, reliability, or scalability. Examples of non-functional changes would include adding more
capacity, patching an operating system, or upgrading software packages and operating systems.

Typically Test environments are locked down much like production. It is important that the test
environment contain the same governance that a production environment has so that when tests are

5|Page
executed they are under the same constraints and limitations of the production infrastructure. Test
environments are a vital part of software and application promotion activities and change activities. The
major justification of a test environment is to protect the production environment from disruptions in
service by finding mistakes, defects, or bugs and making the required repairs.

Simply put, test environments are a vital part of change management and software development
lifecycles when it comes to managing a true production environment. Production environments were
discussed in the previous section of this document.

Development: Development environments are environments specifically for software solution


creation and support data science and analytical development projects. These environments are not
typically robust as far as capability, power, and disk size. They are used to create solutions with a
limited set of test data to support solution/software creation. Development environments can include
source code repositories, interactive development environments, query tools, defect/design change
management tools, and database platforms. They can be used for sandbox, discovery, and production
software development.

Special Considerations for Development Environments: Virtualization and software only solutions are
viable options when considering building out a development platform. These platforms enable backup
and control over the software development platform. The development team can also impose controls
over who has access to that development environment and its capabilities. This may be very critical if
there are data protection requirements.

Sometimes having a completely portable and standalone development environment is also beneficial.
This is the case when solution developers are not in the office or are unable to reach the office via a
virtual private network. It is entirely possible to have a completely contained software only coding
environment on a laptop to support discovery/production development.

Development Support in a Production Discovery Environment: When a development project is for a


production discovery platform it is important that implementation follow a software deployment
process including testing. It is also important that source code is managed through a source control
process and the proper technology applied. This is especially the case when developing custom
analytics and business critical solutions. This can be as sophisticated as your organization chooses.
Most source control solutions and basic web sites contain a simple check in and check out versioning
system. This is the case even in pure discovery environments as reusing code for other solutions is very
useful. We also want to be able to see how an analytical output was created so that it can be vetted and
tested.

Most organizations that deploy a discovery platform already have these tools in place and they should
be used with varying degrees of governance and System Development Lifecycles.

6|Page
Discovery: Discovery systems are the new kid on the block and are extremely exciting and
somewhat misunderstood. Your organization may be considering or has already implemented a
discovery platform but is struggling to understand how to manage it and use it properly. The good thing
is you are not alone and most organizations are having this same challenge. The best way to understand
how to manage a discovery platform is to understand what Discovery is so the next section will focus
on this area.

Discovery Explained: Discovery is process used to ascertain or learn something about an


organization, its client base, a process, or operational activity. Organizations that implement discovery
systems are attempting to use data in new ways to implement directional change. These changes
should have material impact on the organization such as cost avoidance, revenue creation, or potentially
both. The discovery outputs should be usable by people within the organization to operationalize and
take action. That means that the people must be willing to make changes but also have the time to
operationalize change in direction. Discovery systems can be vital in understanding and adapting
change to customer behaviors over a variety of channels, manufacturing inefficiencies, supply chain
management, as well as the creation of new products and services all around collecting and analyzing
data products such as mobile phones, exercise sensors, healthcare sensors, and telematics devices. The
diagram below, Data Discovery Process, details the 4 step process of data discovery and the people
involved.

Diagram: Data Discovery Process

The Discovery Process: Rapid Analytics Development: We have all heard of RAD or Rapid
Application Development and now we are starting hear more and more about Rapid Analytics

7|Page
Development. Rapid Analytics Development is a process which encompasses tools and people that
enable fail fast, change fast and succeed fast analytical outputs on a massive scale of data. The
discovery process starts with an intuition, or idea with respect to new forms of data both structured and
unstructured. For many years we have focused on structured data but have lacked the tools and
platforms to add in customer interaction data including: (email, clickstream, chat, productivity
documents, and machine logs) The idea is to be able to rapidly construct analytics on a variety of data
sources and structures in order to go from transactions to interactions between people, process, and/or
equipment.

There are four parts to the Discovery process: Data Acquisition, Data Preparation, Analysis, and
Visualization or Information Delivery. The next sections will discuss each part of the process and how it
could impact how you manage your discovery platform. As you read these sections keep in mind Rapid
Analytics Development.

Data Acquisition: Data Acquisition is the process of attaining and loading data into a discovery
platform. It is critical that you are able to absorb massive amounts of data quickly through a variety of
channels and software/network capabilities. Moving data between systems has always been a major
area of cost to more traditional discovery solutions and it also represents an opportunity cost to an
organization if it is not fast or delays the ability to exploit the data through analytics. There should be
multiple ways to move data between systems including the network or through more traditional file
based mechanisms. Spending less time in data acquisition enables more time for analytics. More time
for analytics means that analytical outputs can have more time to be implemented operationally
through changes in the operations of an organization.

Data Preparation: Data Preparation is the process of transforming the data into an analytical ready
state. Does my platform not only provide the ability to rapidly ingest data but also simple tools that
enable me to prepare data for analytics rapidly? Data preparation should be able to take advantage of
the hardware platform where the data is located, shared nothing infrastructures and high speed
networks enable this activity more efficiently. Again, not spending abundant time in data preparation
enables Rapid Analytic Development.

Analysis: How many lines of code does it require for you to get from an idea to an answer? Does the
platform you are using require highly specialized skills or does it require more commodity skills like SQL
or SQL like commands. If you have to write thousands of lines of code to get to an analytic output then
you will be spending more time in solution development and require detailed tests to validate the
output. The greater the code surface area the greater the risk your answer could be incorrect. Not only
is testing impacted but your ability to change course will also take longer. It is vital that you select a
platform that requires fewer lines of code to get to an answer.

Fail Fast Option: It is important that we understand how the discovery process is impacted by the ability
to rapidly change course for any reason. Does my output accurately answer the use case? Do I have the
right data? Have I prepared it accurately? Have I implemented the right genre of analytics? Rapid
Analytic Development requires a platform that enables the people using it to change fast by providing

8|Page
the ability to add new data, modify data, or change analytics quickly in order to refine or get the answer
that satisfies the business requirements.
Visualization or Information Delivery: Once my analytics are complete we must be able to quickly
show and demonstrate the results. Discovery use cases can be in the form of Graphs, Sankeys, and
Hierarchy forms. It is important to be able to not only show data but also be able to show different
output styles that show relationships and their strengths that show decision patterns and behavior
patterns. These types of visualizations are atypical in traditional business intelligence systems and offer
a very powerful what if style of communicating with your data. We often refer to this as having a
conversation with your data. Many times new intuitions are formed as a result of interacting with the
outputs of a discovery platform and thus start a new analytic process. This process just emboldens the
need for a fail fast, change fast infrastructure platform to support discovery.

Discovery Environment Conclusion: Rapid analytic development requires a hybrid of the


pendulums of environments (Sandbox and Production). It must enable flexibility, fast change, and a
more fluid set of standards, practices and procedures. Data models can swiftly change, new data
sources can be absorbed, new analytic capabilities adapted, and so forth. Naming conventions, and
strict waterfall type software development lifecycle approaches are not optimal.

However, there must be some health and monitoring controls over the environment. There must be
some sort of security to protect the data and the sensitivity of the data within. We also still want to test
the reliability of the outputs and they should be defended by explanation of the process it took to
achieve them. Unlike a sandbox we do not want the discovery platform to become unavailable.

Key differentiators between a Sandbox and a Discovery platform

1. Sandboxes are underpowered; Discovery platforms are massively powerful and scalable.
2. Sandboxes have a short life span; Discovery platforms have long lifespans.
3. Sandboxes have little to know support from Information Technology; Discovery platforms are
expected to be available 99.99% of the time.
4. Capacity adds on are not expected for a Sandbox; Discovery platforms are easily expanded.
5. A Sandbox may not have source control; a Discovery platform should have source control either
by a source control tool or a Discovery team website.

Key differentiators between a Production and a Discovery platform

1. Production environments support operational and organizational strategic goals; Discovery


environments support organizational change and esoteric understanding.
2. Production environments do not support ad-hoc activities and almost all of its activities are
scheduled and deliberate; Discovery environments support ad-hoc activities and scheduled
activities.
3. Production environments are tightly controlled by change control and SDLC governance,
Discovery platforms provide change flexibility and relaxed SDLC governance.
4. Production environments support repeatable and predictable activities; Discovery environments
support intuitions and unpredictable activities.

9|Page
Discovery and the Environment Pendulum: In previous
sections of this document we discussed the Environmental
Pendulum and its extreme edges: Sandbox and Production. On
the left edge is Sandbox and on the right edge is Production.
Somewhere in the middle is the Discovery environment. We
think that that the Discovery platform is center/right leaning on
the pendulum. Every organization will have its own
requirements and also the type of data will also determine its
locus along the pendulum band. Also, if the data contained in
the Discovery system is the single version of the truth of that
source then it will definitely be leaning toward the right edge of
the pendulum.

When Discovery becomes Repeatable: Given a real world scenario: A team of data
scientists worked with the business to discover patterns of behavior related to churn along one of its
subscription based product lines. They were given eight weeks to show if they could do it or not. If not,
on to the next problem or the drawing board.

They collected data from three sources(data acquisition), applied some transformations to the data in
preparation for analytics (data preparation), they built a churn prediction model using analytics
(analytics), and then present their findings to the business through visualizations and churning customer
lists(Visualization and Information Delivery). They test the model for several months and sure enough
by making some business adjustments to service center operations and other customer interfacing
systems they save the organization two million dollars by retaining high quality customers. The system
was build using the standard Discovery Process. They were so successful that they were then tasked to
turn over requirements to make that operation repeatable.

Lets frame this activity properly. Someone had an idea, what if we could bring together a set of
customer information from a variety of sources and channels and identify a list of customers that could
be on the path to churn. What we are not doing is: being burdened by process, project approvals,
paperwork, and standards and procedures. We also have a discovery environment in place to support
massive and rapid data collection and analytical application development. We arent concerned with
project management, timelines, standards, naming conventions and other internal teams. The
leadership of the organization is behind them because they hope to raise revenue and lower costs by
retaining customers. So cooperation from data source owners is not a problem. Leadership is also
stressing that the business work with the data team to understand how to operationalize findings.

Conclusion: The organization is behind the team, they have the equipment, and commodity based data
science talent around ANSI SQL. The business is willing to share data and operationalize findings. They

10 | P a g e
are successful and then asked to make it repeatable, reliable, and it is expected to produce outputs at
the end of every month. This is no longer an ad-hoc project but is now a repeatable project, with
deadlines and deliverables, and is now business critical. It will become production.

Now that this is production we need to wrap all the traditional software project methodologies,
standards, practices, and other Information Technology governance practices.

We also have to consider investing in a new environment for Product Discovery or other options. As
mentioned in the previous section in this document, production environments are not best suited for
ad-hoc activities like ones supported in a discovery environment. Discovery is just one operationalized
discovery project that will be implemented so it might make sense for now to add capacity instead of
investing in a new environment.

Environment Summary:
The following matrix summarizes the differences between the environments and how they are different.

11 | P a g e
CONCLUSION: How to select the right environment for your organization

All organizations will have different names for each of the environments mentioned in the previous
section of this document. It is important to understand how your organization defines these
environments, their limitations, and the scope of support that is provided. You can use the diagram
below to determine what environments you will need given the category of support required on the left.

Diagram: Environment/Support Matrix

It is recommended that you build your own support matrix based on your organizations naming
conventions and environment support standards. The diagram above is just a way to provide you with a
guideline to determine your level of maturation along the Discovery Environment pendulum.

Investment also plays a significant role in developing production environments. There are three
approaches that can be followed. One approach is to create different schemas on a single platform that
represent, Discovery, Development, Test, and Production functionalities. This works while the
organization is trying to figure out how to productionize repeatable discovery assets and may not have
the equipment to separate them physically. There are serious limitations however. For example if you
are running production jobs on a schedule on a discovery platform then discovery activities will have to
stop while that process is running. That way service levels and predictable performance can be

12 | P a g e
guaranteed. This may not be a big deal if scheduled properly. Sometimes however, discovery activities
can run for long durations of time and data loading between production and discovery activities can
overlap thus causing unpredictable performance. Another point is that as more discovery and
production activities are added the risk of this occurring increase. This is when an investment in a
physically separate production environment is required. Organizations can also add capacity to their
existing environments to offset the load. This is the least expensive option but has the highest level of
risk.

Another approach is to separate discovery and production activities physically. This means investing in a
separate environment to support production and test activities only. A test and production schema can
be added so that proper software implementation standards can be implemented. There are limitations
to this approach as well as there will be testing going on within the production environment. Again,
adding capacity to the environment can help offset the load but this is not an exact science. This is
probably a good enough approach.

The third approach is to have three physically separate environments: discovery, test, and production.
That way test is isolated to its own environment and does not negatively impact production. This is the
ideal approach however it is the most expensive option. As organizations gain maturation in the
discovery process and repeatable and governed discovery activities are a priority this environment type
will be required. The good thing is that there should be plenty of financial benefits to be able to justify
the environment investments.

Development environments can be Aster Express located on a desktop or on a virtual environment. It is


important to develop good software development practices along with any software development
program. Discovery requires fewer controls with respect to practices, but it is always a good idea to
maintain a copy of the discovery artifacts.

Development environments that support a production discovery platform will definitely want to follow
strict SDLC, software development tools, source code control, and other tools to ensure vital intellectual
property is not lost.

13 | P a g e

You might also like