Chapter 14 Big Data & Data Science

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

KEMENTERIAN KEUANGAN

REPUBLIK INDONESIA
INSPEKTORAT JENDERAL

WORKSHOP DAMA-DMBOK
BIG DATA & DATA SCIENCE

INSPEKTORAT I
Jakarta, 29 Januari 2021

Integritas Profesionalisme Sinergi Pelayanan Kesempurnaan


AGENDA
1. Introduction
2. Essential Concepts
3. Activities
4. Tools
5. Techniques
6. Implementation & Guidelines
7. Big Data & Data Science Governance

2/3/2021 2
Inspektorat Jenderal toward IACM 4 2
1. Introduction

Big Data Data Science

2/3/2021 3
Inspektorat Jenderal toward IACM 4 3
1. Introduction (2)

❑ Big Data and Data Science are connected to


significant technological changes that have
allowed people to generate, store, and
analyze larger and larger amounts of data.
❑ People can use that data to predict and
influence behavior, as well as to gain insight
on a range of important subjects
❑ To take advantage of Big Data, requires
change in the way that data is managed.

2/3/2021 4
Inspektorat Jenderal toward IACM 4 4
1. Introduction (3)
Diagram Context:
Big Data & Data Science

2/3/2021 5
Inspektorat Jenderal toward IACM 4 5
2 Essential Concepts

Big Data Architecture


Data Science Data Science Process Big Data
Components

Service Based
Sources of Big Data Data Lake Machine Learning
Architecture

Sentiment Analysis Data & Text Mining Predictive Analysis Prescriptive Analysis

Unstructured Data
Operational Analysis Data Visualization Data Mashups
Analysis

2/3/2021 6
Inspektorat Jenderal toward IACM 4 6
2 Essential Concepts (2)
Big Data
Data Data Science Architecture
Process Big Data
Science Components

• Developing predictive 1.Define Big Data strategy 1. Volume • The selection,


models that explore and business needs; 2. Velocity installation, and
data content patterns 2.Choose data sources; 3. Variety configuration of a Big
uses the scientific 3.Acquire and ingest data Data and Data Science
4. Viscosity
method sources; environment require
5. Volatility specialized expertise.
• The Data Science 4.Develop Data Science
process follows the 6. Veracity • End-to end architectures
hypotheses and
scientific method of methods; must be developed and
refining knowledge by 5.Integrate and align data rationalized against
making observations, for analysis; existing data exploratory
formulating and testing tools and new
6.Explore data using
hypotheses, observing acquisitions
models;
results, and formulating
general theories that 7.Deploy and monitor
explain results.

2/3/2021 7
Inspektorat Jenderal toward IACM 4 7
2 Essential Concepts (3)
Sources of Service Based Machine
Big Data Data Lake Architecture Learning

• Big Data is produced • A data lake is an • SBA is emerging as a • Programming machines


through email, social environment where a way to provide to quickly learn from
media, online orders, vast amount of data of immediate data, as well queries and adapt to
and even online video various types and as update a complete, changing data sets led
games, etc. structures can be accurate historical data to a completely new
• Devices that interact ingested, stored, set, using the same field within Big Data
directly with the assessed, and analyzed source • Explores the
Internet generate a • Data lake can quickly construction and study
large portion of Big become a data swamp of learning algorithms:
Data. – messy, unclean, and ➢Supervised learning:
inconsistent. In order Based on generalized
to establish an rules
inventory of what is in ➢Unsupervised
a data lake, it is critical learning: Based on
to manage Metadata as identifying hidden
the data is ingested patterns
➢Reinforcement
learning: Based on
achieving a goal

2/3/2021 8
Inspektorat Jenderal toward IACM 4 8
2 Essential Concepts (4)
Sentimen Data & Text Predictive Prescriptive
Analysis Mining Analysis Analysis

• Media monitoring and • Data mining: analysis • Sub-field of supervised • To define actions that
text analysis are that reveals patterns in learning where users will affect outcomes,
automated methods for data using various attempt to model data rather than just
retrieving insights from algorithms elements and predict predicting the
large unstructured or • Text mining: analyzes future outcomes outcomes from actions
semi-structured data documents to classify through evaluation of that have occurred
• This is used to content automatically probability estimates • Prescriptive analytics
understand what • Data and text mining • Insight: What is likely to can continually take in
people say and feel use a range of happen? new data to repredict
about brands, techniques: and re-prescribe. This
products, or services, ➢ Profiling process can improve
etc. prediction accuracy and
➢ Data Reduction result in better
• Using Natural Language ➢ Association
Processing (NLP). prescriptions
Semantic analysis can ➢ Clustering • Scenario: What should
detect sentiment and ➢ Self-organizing we do to make things
also reveal changes in maps happen?
sentiment to predict
possible scenarios.

2/3/2021 9
Inspektorat Jenderal toward IACM 4 9
2 Essential Concepts (5)

Unstructured Operational Data Data


Data Analysis Analysis Visualization Mashups

• Combines text mining, • Activities like user • Process of interpreting • Combine data and
association, clustering, segmentation, concepts, ideas, and services to create
and other unsupervised sentiment analysis, facts by using pictures visualization for insight
learning techniques to geocoding, and other or graphical or analysis.
codify large data sets techniques applied to representations.
data sets for marketing • Data visualizations
campaign analysis, etc. condense and
• Operational analytics encapsulate
involves tracking and characteristics data,
integrating real-time making them easier to
streams of information, see.
deriving conclusions • In doing so, they can
based on predictive surface opportunities,
models of behavior, identify risks, or
and triggering highlight messages
automatic responses
and alerts

2/3/2021 10
Inspektorat Jenderal toward IACM 4 10
3 Activities

1. Define Big Data Strategy & Business Need 2. Choose Data Sources

❑Define the requirements that identify desired ❑Identify gaps in the current data asset base and find
outcomes with measurable tangible benefits. data sources to fill those gaps.
❑A Big Data strategy must include criteria to evaluate: ❑As more data becomes available, data needs to be
✓What problems the organization is trying to solve? evaluated for worth and reliability.
✓What data sources to use or acquire?
✓The timeliness and scope of the data to provision
✓The impact on and relation to other data structure
✓Influence to existing modelled data

2/3/2021 11
Inspektorat Jenderal toward IACM 4 11
3 Activities (2)

3. Acquire and Ingest Data 4. Develop Data Hypotheses and 5. Integrate and Align Data for
Sources Methods Analysis
❑Data sources need to be found and ❑Define model algorithm inputs, ❑Preparing the data for analysis
loaded into the Big Data types, or model hypotheses and involves understanding what is in the
environment. During this process, methods of analysis. data, finding links between data from
capture critical Metadata about the ❑Each model will operate depending the various sources, and aligning
source. on the analysis method chosen. It common data for use.
❑Once the data is in a data lake, it can should be tested for a range of ❑Apply appropriate data integration
be assessed for suitability for outcomes. and cleansing techniques to increase
multiple analysis efforts ❑Models depend on both the quality quality and usefulness of
❑Before integrating the data, assess of input data and the soundness of provisioned data sets.
its quality. The assessment process the model itself.
provides valuable insight into how
the data can be integrated with
other data sets.

2/3/2021 12
Inspektorat Jenderal toward IACM 4 12
3 Activities (3)

6. Explore Data Using Models 7. Deploy and Monitor


❑Process: ❑The presentation of findings and data insights is
✓Populate Predictive Model the final step in a Data Science investigation.
✓Train the Model ❑Insights should be connected to action items so
✓Evaluate Model that the organization benefits from the Data
Science work.
✓Create Data Visualizations
✓Influence to existing modelled data ❑The presentation of findings and data insights
usually generates questions that start a new
❑Training entails repeated runs of the model against
process of research.
actual data to verify assumptions and make
adjustments, such as identifying outliers. ❑Data Science is iterative, so Big Data development
is iterative to support it. This process of learning
from a specific set of data sources often leads to
the need for different or additional data sources to
both support the conclusions found and to add
insights to the existing model(s).

2/3/2021 13
Inspektorat Jenderal toward IACM 4 13
4 Tools
Advances in technology have created
the Big Data and Data Science industry

MPP Shared-nothing Distributed File- In-database Big Data Cloud Statistical Computing Data Visualization
Technologies & based Databases Algorithms Solutions and Graphical Tools
Architecture Distributed file-based An in-database algorithm There are vendors who Languages Advanced visualization
In MPP databases, data is solutions technologies, uses the principle that provide cloud storage R is an open source and discovery tools use
logically distributed such as the open source each of the processors in and integration for Big scripting language and in-memory architecture
across multiple Hadoop, are an a MPP Shared-nothing Data, including analytic environment for to allow users to interact
processing servers, with inexpensive way to store platform can run queries capabilities statistical computing and with the data. A visual
each server having its large amounts of data in independently, so a new graphics. It provides a pattern can be picked up
own dedicated memory different formats form of analytics wide variety of statistical quickly when thousands
to process data locally processing could be techniques of data points are loaded
accomplished by into a sophisticated
providing mathematical display.
and statistical functions
at the computing node
level

2/3/2021 14
Inspektorat Jenderal toward IACM 4 14
5 Techniques

Analytic Modelling

Analytic models are associated with different depths of analysis:


❑Descriptive modelling summarizes or represents the data structures in a compact manner. This approach does not
always validate a causal hypothesis or predict outcomes. However, it does use algorithms to define or refine
relationships across variables in a way that could provide input to such analysis.
❑Explanatory modelling is the application of statistical models to data for testing causal hypothesis about theoretical
constructs. While it uses techniques similar to data mining and predictive analytics, its purpose is different. It does
not predict outcomes; it seeks to match model results only with existing data.
Key to predictive analytics is to learn by example through training the model. Performance of a learning method
relates its predictive abilities on independent test data

Big Data Modelling

❑Apply proven data modelling techniques while accounting for the variety of sources.
❑Develop the subject area model so it can be related to proper contextual entities and placed into the overall
roadmap.
❑Understand how the data links between data sets.

2/3/2021 15
Inspektorat Jenderal toward IACM 4 15
6 Implementation Guidelines

Readiness Assessment/ Organization and Cultural


Strategy Alignment
Risk Assessment Change
Any Big Data/Data Science program Assess organizational readiness in Business people must be fully
should be strategically aligned with relation to critical success factors: engaged in order to realize benefits
organizational objectives. ✓Business relevance from the advanced analytics. A
Establishing a Big Data strategy ✓Business readiness communications and education
drives activities related to user ✓Economic viability program is required to affect this.
community, data security, ✓Prototype A Center of Excellence can provide
Metadata management, including training, start-up sets, design best
lineage, and Data Quality Likely the most challenging practices, data source tips and
Management. decisions will be around data tricks, and other point solutions or
procurement, platform artifacts to help empower business
The strategy should document development, and resourcing
goals, approach, and governance users towards a selfservice model
principles. Big Data implementation will bring
The ability to leverage Big Data together of a number of key cross-
requires building organizational functional roles, including: Big Data
skills and capabilities. Use capability Platform Architect; Ingestion
management to align business and Architect; Metadata Specialist;
IT initiatives and project a Analytic Design Lead; Data Scientist
roadmap.

2/3/2021 16
Inspektorat Jenderal toward IACM 4 16
7 Big Data & Data Science Governance

Sourcing:
What to source, when to source, what is the best source of
data for particular study

Sharing:
What data sharing agreements and contracts to enter into,
terms and conditions both inside and outside the organization

Metadata:
What the data means on the source side, how to interpret the
results on the output side

Enrichment:
Whether to enrich the data, how to enrich the data, and the
benefits of enriching the data

Access:
What to publish, to whom, how, and when

2/3/2021 17
Inspektorat Jenderal toward IACM 4 17
TERIMA KASIH

2/3/2021 18
Inspektorat Jenderal toward IACM 4 18

You might also like