ATW115 Slides Chp02

ATW115
INTRODUCTION TO
DATA ANALYTICS
Data Analytics Life
Cycle
Course
Topics
LEARNING OBJECTIVES
§ Explain the six phases in the data analytics life
cycle.
§ Identify the key roles for a successful analytics

project.
What is Data Analytics Life Cycle?
§ The data analytics lifecycle is
designed for Big Data problems
and data science projects. The
cycle is iterative to represent real
project.
§ To address the distinct
requirements for performing
analysis on Big Data, step-by-step
methodology is needed to organize
the activities and tasks involved
with acquiring, processing,
analyzing, and repurposing data.
Importance of Data Analytics Life Cycle
§ The data analytics life cycle is the road map
on how data is generated, collected,
processed, used, and analyzed to achieve
business objectives/goals.
§ It offers a systematic way for managing
data into useful information that can help
achieve organizational or project goals;
§ It provides guidance and strategies for
extracting this information and moving in
the appropriate direction in order to
accomplish business goals.
Phases of the Data Analy7cs Life Cycle
The six phases of the data
analytics lifecycle that is followed
one phase after another to
complete one cycle. It is
interesting to note that these six
phases of data analytics can
follow both forward and
backward movement between
each phase and are iterative.
Phase 1: Data Discovery
§ Stakeholders regularly perform the following tasks: examine the
business trends, make case studies of similar data analytics, and study
the domain of the business industry.
§ The entire team assesses the in-house resources, the in-house
infrastructure, total time involved, and technology requirements.
§ Once all these assessments and evaluations are

completed, the stakeholders start formulating the initial
hypothesis for resolving all business challenges in terms of
the current market scenario.
Key Points
§ The data science team learn and investigate
the problem, identify the key stakeholders,
Identifying potential data source, interviewing
the analytics sponsors.
§ Develop context and understanding.
§ Framing the problem to know about data

sources needed and available for the project.
§ Developing initial hypotheses that can be later

tested with data.
The team needs to assess the

Understanding the domain area resources available to support the
of the problem is essenJal. In project. In this context, resources
many cases, data scienJsts will include technology, tools, systems,
have deep computa6onal and data, and people. During this
quan6ta6ve knowledge that can scoping, consider the available tools
be broadly applied across many and technology the team will be
disciplines. Team needs to using, and the types of systems
determine the knowledge needed for later phases to
needed to develop the models. operationalize the models.
Another important step is to

Framing is the process of stating identify the key stakeholders and
the analytics problem to be solved. their interests in the project. During
At this point, it is a best practice to these discussions, the team can
write down the problem statement identify the success criteria, key
and success criteria. It is also risks, and stakeholders, which
important to establish failure should include anyone who will
criteria to guide the team when it is benefit from the project or will be
best to stop trying. Share with significantly impacted by the
key stakeholders. project.
The team should plan to collaborate with the stakeholders

to clarify and frame the analytics problem. At the outset,
project sponsors may have a predetermined solution that
may not necessarily realize the desired outcome. In these
cases, the team must use its knowledge and expertise to
identify the true underlying problem and
appropriate solution.
As part of the discovery phase,

Developing a set of iniJal identify the kinds of data the team
hypothesis (IHs) is a key facet of will need to solve the problem.
the discovery phase. This step Consider the volume, type, and
involves forming ideas that the time span of the data needed to
team can test with data. test the hypotheses.
The team should perform FIVE main activities when
identifying potential data sources:
§ Identify data sources
§ Capture aggregate data sources
§ Review the raw data
§ Evaluate the data structures and tools
needed
§ Scope the sort of data infrastructure
needed for this type of problem
Phase 2: Data Preparation
§ A critical stage as the quality of the data used for analysis has a direct
impact on the accuracy and reliability of the results.
§ Data is collected, cleaned, and transformed into a format that is suitable

for analysis. This may involve data integration, data cleansing, data
enrichment, and data transformation activities.
§ Data visualization techniques may also be used to gain a better

understanding of the data and identify any data quality issues.
In this phase data is prepared by transforming it from a legacy system
into a data analyJcs form by using the sandbox plaRorm. A sandbox is
a scalable plaRorm commonly used by data scienJsts for data
preprocessing. It includes huge CPUs, high-capacity storage, and high
I/O capacity.
The IBM Netezza 1000 is one such data sandbox platform

used by the IBM Company for handling data marts. The
stakeholders involved during this phase are mostly involved
in the preprocessing of data for preliminary results by using
a standard sandbox platform.
Key Points
§ Steps to explore, preprocess, and
data conditioning prior to modeling
and analysis.
§ Preparing the analytics sandbox, the
team execute, load, and transform,
to get data into the sandbox,
performing ETLT.
§ Survey and visualize
§ Data preparation tasks are likely to
be performed multiple times and
not in predefined order.
Abbreviated from ETL (extract,

The first subphase of data transform, load) and ELT (extract, load,
preparation requires the team to transform).
obtain an analytic sandbox (also Data is extracted in its raw form and
commonly referred to as a loaded into the datastore, where analyst
workspace), in which the team can choose to transform the data into a
can explore the data without new state or leave it in its original raw
interfering with live production conditions. Need to consider how to
databases. parallelize the movement of datasets
into the sandbox.
§ Clarifies the data that the data science team has access to at the
start of the project.
§ Highlights gaps by idenJfying datasets within an organizaJon that
the team may find useful but may not be accessible to the team
today.
§ IdenJfies datasets outside the organiza6on that may be useful to
obtain, through open APis, data sharing, or purchasing data to
supplement already exisJng datasets.
Data conditioning refers to the process of cleaning

data, normalizing datasets, and performing
transformations on the data.
A critical step, data conditioning can involve many
complex steps to join or merge data sets or
otherwise get datasets into a state that enables
analysis in further phases.
§ What are the data sources? What are the target fields (for example, columns of the
tables)?
§ How clean is the data?
§ How consistent are the contents and files? Determine to what degree the data contains
missing or inconsistent values and if the data contains values deviating from normal.
§ Assess the consistency of the data types. For instance, if the team expects certain data to
be numeric, confirm it is numeric or if it is a mixture of alphanumeric strings and text.
§ Review the content of data columns or other inputs and check to ensure they make sense.
For instance, if the project involves analyzing income levels, preview the data to confirm
that the income values are positive or if it is acceptable to have zeros or negative values.
§ Look for any evidence of systematic error.
After the team has collected and obtained at least some of the
datasets needed for the subsequent analysis, a useful step is to
leverage data visualization tools to gain an overview of the data.
Seeing high-level patterns in the data enables one to understand
characteristics about the data very quickly.
One example is using data visualization to examine data quality,
such as whether the data contains many unexpected values or
other indicators of dirty data.
When pursuing with a data visualizaAon tool or staAsAcal package, the
following guidelines and consideraAons are recommended.
A. Review data to ensure that calculaAons remained consistent within
columns or across tables for a given data field. For instance, did
customer lifeAme value change at some point in the middle of data
collecAon? Or if working with financials, did the interest calculaAon
change from simple to compound at the end of the year?
B. Does the data distribuAon stay consistent over all the data? If not,
what kinds of acAons should be taken to address this problem?
C. Assess the granularity of the data, the range of values, and the level
of aggregaAon of the data.
D. Does the data represent the population of interest? For marketing
data, if the project is focused on targeting customers of child-rearing
age, does the data represent that, or is it full of senior citizens and
teenagers?
E. For time-related variables, are the measurements daily, weekly,
monthly? Is that good enough? Is time measured in seconds
everywhere? Or is it in milliseconds in some places? Determine the
level of granularity of the data needed for the analysis and assess
whether the current level of timestamps on the data meets that
need.
F. Is the data standardized/normalized? Are the scales consistent? If
not, how consistent or irregular is the data?
G. For geospatial datasets, are state or country abbreviations consistent
across the data? Are personal names normalized? English
units? Metric units?
Several tools are commonly used for this

phase:
1. Hadoop
2. Alpine Miner
3. Open Refine
4. Data Wrangler
The figure shows summary of the data

preparation steps.
Phase 3: Model Planning
§ Data analytics team makes proper planning of the methods to be
adapted and the various workflow to be followed.
§ Various division of work among the team is decided to clearly define the
workload among the team members.
§ The data prepared in the previous phase is further explored to

understand the various features and their relationships and also
perform feature selection for applying it to the model.
ANer mapping out your business goals and collecAng a glut of data (structured,
unstructured, or semi-structured), it is Ame to build a model that uAlizes the
data to achieve the goal.
There are several techniques available to load data into the system:
§ETL (Extract, Transform, and Load) transforms the data first using a set of
business rules, before loading it into a sandbox.
§ELT (Extract, Load, and Transform) first loads raw data into the sandbox and
then transform it.
§ETLT (Extract, Transform, Load, Transform) is a mixture; it has two
transformaAon levels.
§ This step also includes the teamwork to determine the methods,
techniques, and workflow to build the model in the subsequent phase.
The model's building initiates with identifying the relation between data
points to select the key variables and eventually find a suitable model.
§ Data sets are developed by the team to test, train and produce the
data. In the later phases, the team builds and executes the models that
were created in the model planning stage.
Key Points
§ Learn about relationships between
variables and subsequently, model
selection.
§ Data science team develop data sets for
training, testing, and production
purposes.
§ Team builds and executes models based
on the work done in the model planning
phase.
The team's main goal is to choose

§ Aim to capture the most essential an analytical technique, or a
predictors and variables rather than short list of candidate techniques,
considering every possible variable based on the end
that may influence the outcome. goal of the project.
§ Test a range of variables to include Rules and conditions are grouped
in the model, and then focus on the into several general sets of
most important and influential techniques, e.g. classification,
variables association rules, and clustering.
Common Tools for the Model Planning Phase
§ R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality
code.
§ SQL Analysis services can perform in-database analytics of
common data mining functions, involved aggregations, and basic
predictive models.
§ SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connectors such as ODBC,
JDBC, and OLE DB.
Phase 4: Model Building
§ The next phase of the lifecycle is model building in which the team
works on developing datasets for training and testing as well as
for production purposes.
§ The execution of the model, based on the planning made in the

previous phase, is carried out.
§ The kind of environment needed for the execution of the model is

decided and prepared so that if a more robust environment is
required, it is accordingly applied.
Key Points
§ An analytical model is developed and fit on the training data and
evaluated/scored against the test data
§ Although the modelling techniques and logic required to develop
models can be highly complex, the actual duration can be short
compared to the time spent preparing the data and defining the
approaches
§ Considers whether existing tools will suffice for running the models
or if they need more robust environment for executing models.
§ Refine the models to optimise the results, such as by modifying
variable inputs or reducing correlated variables where appropriate
§ Does the model appear valid and accurate on the test data?
§ Does the model output/behaviour make sense to the domain experts? That is,
does it appear as if the model is giving answers that make sense in this context?
§ Do the parameter values of the fitted model make sense in the context of the
domain?
§ Is the model sufficiently accurate to meet the goal?
§ Does the model avoid intolerable mistakes? Depending on context, false positives
may be more serious or less serious than false negatives
§ Are more data or more inputs needed? Do any of the inputs need to be
transformed or eliminated? Will the kind of model chosen support the runtime
requirements?
§ Is a different form of the model required to address the business problem? If so, go
back to the model planning phase and revise the modelling approach.
Phase 4: Model Building
Commercial Tools: Free or Open-Source tools:
§ SAS Enterprise Miner § R and PL/R

(PL/R is a procedural language for
§ SPSS Modeler (provided
PostgreSQL with R – Using this
by IBM and now called approach means that R commands
IBM SPSS Modeler) can be executed in database.)
§ MATLAB § Octave
§ Alpine Miner § WEKA
§ STATISTICA § Python
§ Mathematica § SQL in-database
implementations, such as MADlib
Phase 5: Communicate Results
§ Checks the results of the project to find whether it is a success or

failure.
§ The result is scrutinized by the entire team along with its
stakeholders to draw inferences on the key findings and
summarize the entire work done.
§ Business values are quantified and an elaborate narrative on the
key findings is prepared that is discussed among the various
stakeholders.
Key Points
§ After executing model, team need to compare outcomes of
modeling to criteria established for success and failure.
§ Team considers how best to articulate findings and outcomes to
various team members and stakeholders, taking into account
warning, assumptions.
§ Team should identify key findings, quantify business value, and
develop narrative to summarize and convey findings to
stakeholders.
Phase 6: Operationalise
§ A final report is prepared by the team along with the briefings,

source codes, and related documents.
§ Run the pilot project to implement the model and test it in a real-
6me environment.
§ As data analyJcs help build models that lead to beder decision-
making, it, in turn, adds value to individuals, customers, business
sectors, and other organizaJons.
Key Points
§ The team communicates benefits of project more broadly and
sets up pilot project to deploy work in controlled way before
broadening the work to full enterprise of users.
§ This approach enables team to learn about performance and
related constraints of the model in production environment on
small scale and make adjustments before full deployment.
§ The team delivers final reports, briefings, codes.
Key Outputs for Each Stakeholders
Example:
Consider an example of a retail store chain that wants to optimize its products' prices
to boost its revenue. The store chain has thousands of products over hundreds of
outlets, making it a highly complex scenario. Once you identify the store chain's
objective, you find the data you need, prepare it, and go through the Data Analytics
lifecycle process.
You observe different types of customers, such as ordinary customers and customers
like contractors who buy in bulk. According to you, treating various types of customers
differently can give you the solution. However, you don't have enough information
about it and need to discuss this with the client team.
In this case, you need to get the definition, find data, and conduct hypothesis testing to
check whether various customer types impact the model results and get the right
output. Once you are convinced with the model results, you can deploy the model, and
integrate it into the business, and you are all set to deploy the prices you think are the
most optimal across the outlets of the store.
Advantages of Data Analytics
Identification of Potential Risks
Businesses operate in high-risk settings and thus need efficient risk
management solutions to deal with problems. Creating efficient risk
management procedures and strategies depends heavily on big data. Data
analytics life cycle and tools quickly minimize risks by optimizing
complicated decisions for unforeseen occurrences and prospective
threats.
§ Reducing Cost
§ Increase efficiency
Key Roles of a Successful Analy7cs Project
While proceeding through these six phases, the various stakeholders that
can be involved in the planning, implementation, and decision-making are
data analysts, business intelligence analysts, database administrators, data
engineers, executive project sponsors, project managers, and data
scientists. All these stakeholders are rigorously involved in the proper
planning and completion of the project, keeping in note the various crucial
factors to be considered for the success of the project.
Key Roles of Analytics Project
Each one of
these roles play
a critical part in a
successful
analytics project
§ Understand the domain
§ Usually benefits from the result
§ Can consult and advice the team on the context of the
project, the value of the result, and how the outputs will be
operationalised
§ Business analyst, line manager, or deep subject matter expert
§ Provides the impetus and requirements for the project

§ Defines the core business problems
§ Provides the funding and gauges the degree of value
from the final output
§ Sets the priorities for the project
§ Clarifies the desired outputs
§ Ensures key milestones and objectives are
met on time and at the expected quality
§ Provides key performance indicators (KPIs),

key metrics, and business intelligence from a
reporting perspective
§ Create dashboards and reports
§ Configures the database environment to support

the analyAcs needs of the working team
§ Provide access to key databases or tables
§ Ensures the appropriate security levels are in place
§ Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction
§ Provides support for data ingestion into the analytic
sandbox
§ Executes the actual data extractions
§ Performs substantial data manipulation to facilitate the
analytics
§ Provides subject matter expertise for analytical

techniques, data modelling, and applying valid analytical
techniques to given business problems
SUMMARY
§ Systematic way for managing data

into useful information
§ Step-by-step methodology that
provides guidance and strategies
§ 6 phases that are iterative
Summary
Phase 1 – Discovery
§ Learn about the business domain
§ Assesses available resources
§ Framing business problems and
formulating initial hypotheses
Phase 2 – Data Preparation

§ Requires an analytic sandbox
§ Get data into the sandbox
Phase 3 – Model Planning

§ Determine methods, techniques
and workflow
Summary
Phase 4 – Model Building
§ Develop datasets
§ Build and execute models
§ Is the existing tool adequate?
Phase 5 – Communicate Results

§ Decide with stakeholder if the project
is a success or a failure
§ Convey findings to stakeholders
Phase 6 – Operationalise
§ Deliver reports, briefings, code and
technical documents
Summary
§ The seven (7) key roles needed in a
team for a successful analytics project
51
Tutorial Discussions
1. In which phase would the team expect to invest most of the project
time? Why? Where would the team expect to spend the least time?
2. What are the benefits of doing a pilot program before a full-scale
rollout of a new analytical methodology? Discuss this in the context of
the mini case study.
3. What kinds of tools would be used in the following phases, and for
which kinds of use scenarios?
a. Phase 2: Data preparation

b. Phase 4: Model building

ATW115 Slides Chp02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ATW115 Slides Chp02

Uploaded by

Copyright:

Available Formats

ATW115

§ Identify the key roles for a successful analytics

§ Once all these assessments and evaluations are

§ Develop context and understanding.

§ Framing the problem to know about data

§ Developing initial hypotheses that can be later

The team needs to assess the

Another important step is to

The team should plan to collaborate with the stakeholders

As part of the discovery phase,

§ Data is collected, cleaned, and transformed into a format that is suitable

§ Data visualization techniques may also be used to gain a better

The IBM Netezza 1000 is one such data sandbox platform

Abbreviated from ETL (extract,

Data conditioning refers to the process of cleaning

Several tools are commonly used for this

The figure shows summary of the data

§ The data prepared in the previous phase is further explored to

The team's main goal is to choose

§ The execution of the model, based on the planning made in the

§ The kind of environment needed for the execution of the model is

§ SAS Enterprise Miner § R and PL/R

§ Checks the results of the project to find whether it is a success or

§ A ﬁnal report is prepared by the team along with the brieﬁngs,

§ Provides the impetus and requirements for the project

§ Provides key performance indicators (KPIs),

§ Conﬁgures the database environment to support

§ Provides subject matter expertise for analytical

§ Systematic way for managing data

Phase 2 – Data Preparation

Phase 3 – Model Planning

Phase 5 – Communicate Results

a. Phase 2: Data preparation

You might also like