Professional Documents
Culture Documents
Sne20130717 Bigdata Arch Brainstorming v02 NBDWG20130724
Sne20130717 Bigdata Arch Brainstorming v02 NBDWG20130724
Framework (BDAF)
Outcome of the Brainstorming Session
at the University of Amsterdam
(17 July 2013)
Yuri Demchenko (facilitator, reporter),
SNE Group, University of Amsterdam
Presented to NIST Big Data WG, Subgroup Architecture
25 July 2013
Available http://bigdatawg.nist.gov/_uploadfiles/M0055_v1_7606723276.pdf
Outline
Big Data definition
5 Vs of Big Data: Volume, Velocity, Variety, Value, Veracity
Data Origin and Target
Slide_2
* Big Data Architecture Framework (BDAF) Proposed Context for the discussion
Data Models, Structures, Types
Data formats, non/relational, file systems, etc.
Big data is high-volume, high-velocity and high-variety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision
making. Gartner, http://www.gartner.com/it-glossary/big-data/
Big Data: a massive volume of both structured and unstructured data that is so large that
it's difficult to process using traditional database and software techniques.
From The Big Data Long Tail blog post by Jason Bloomberg (Jan 17, 2013). http://www.devx.com/blog/the-big-datalong-tail.html
Data that exceeds the processing capacity of conventional database systems. The data is
too big, moves too fast, or doesnt fit the structures of your database architectures. To gain
value from this data, you must choose an alternative way to process it.
*) The Fourth Paradigm: Data-Intensive Scientific Discovery. Edited by Tony Hey, Stewart Tansley, and Kristin Tolle. Microsoft, 2009.
* 5 Vs of Big Data
Volume
Terabytes
Records/Arch
Transactions
Tables, Files
Variety
Velocity
Batch
Real/near-time
Processes
Streams
5 Vs of
Big Data
Structured
Unstructured
Multi-factor
Probabilistic
Value
Statistical
Events
Correlations
Hypothetical
Trustworthiness
Authenticity
Origin, Reputation
Availability
Accountability
Commonly accepted
3Vs of Big Data
Veracity
Velocity
Volume
Terabytes
Records/Arch
Tables, Files
Distributed
Batch
Real/near-time
Processes
Streams
Variety
Structured
Unstructured
Multi-factor
Probabilistic
Linked
Dynamic
5 Vs of
Big Data
Trustworthiness
Authenticity
Origin, Reputation
Availability
Accountability
Statistical
Events
Correlations
Hypothetical
Timeliness
Mobility (mobile/remote access; from
other domain roaming; federation)
Accountability
Availability
Value
Veracity
Service Oriented Architecture (SOA): First proposed in 1996 and revived with the
Web Services advent in 2001-2002
Currently standard for industry, and widely used
Provided a conceptual basis for Web Services development
Computer Grids: Initially proposed in 1998 and finally shaped in 2003 with the
Open Grid Services Architecture (OGSA) by Open Grid Forum (OGF)
Currently remains as a collaborative environment
Migrates to cloud and inter-cloud platform
10
11
New
Technology
Manufactur
Transport
Personal
services,
campaigns
Living
Environmnt,
Infrastruct,
Utility
Healthcare
support
Science
+++++
++++
++
+++
Telecom
++++
++
++++
Industry
++
++++
+++++
++
Business
+++
++
++
Living
environment,
Cities
++
++
++
++
+++++
Social media,
networks
++
++++
++
Healthcare
+++
++
++
+++++
12
?
Big Data
Network
Big
Computer
Traditional model
BIG Storage and BIG computer
with FAT pipe
Move compute to data vs Move
data to compute
Data Bus
Visu
alisa
tion
New Paradigm
Continuous data production
Continuous data processing
17 July 2013, UvA
Distributed Compute
13
14
15
Architecture vs Ecosystem
Big Data undergo and number of transformation during
their lifecycle
Big Data fuel the whole transformation chain
16
17
Model
Data (raw)
Metadata
Relations
Functions
Information
Presentation
Data: The lowest layer of abstraction (?) from which information can be
derived
Information: A combination of contextualised data that can provide meaningful
value or usage/action (scientific, business)
Actionable data
Presentation (?)
Where is knowledge (as a target of learning)?
18
Data/Process model(s)
DatasetID={PID+Pfj}
Metadata
Data
Source
Model data,
statistical data
Metadata
PID
Metadata
Data (raw)
PID
Data
Source
Data
Filter/Enrich,
Classification
PID
Datasets
PID
Metadata
Data (archival,
actionable)
Data (structured,
datasets)
Data
Collection
and
Registratn
ModelID?=?
Data
Analytics,
Modeling,
Prediction
Security issues
Data
Delivery,
Visualisation
Referral integrity
Traceability
Opacity
Visualised
models;
Biz reports,
Trends;
Controlled
Processes;
Social
Actions
Consumer
PID=UID+time+Prj
19
20
Data
Models
Structrs
Data
Models
Data
Lifecycle
BigData
Infrastr
BigData
Analytics
BigData
Aplicatn
BigData
Mangnt
Operatn
BigData
Security
+++
++
+++
+++
+++
+++
+++
++
++
+++
+++
++
++
+++
+++
+++
++
++
++
Data
Lifecycle
+++
BigData
Infastruct
+++
+++
BigData
Analytics
+++
++
BigData
Applicatn
++
+++
++
BigData
Mangnt
+++
+++
+++
++
BigData
Security
+++
+++
+++
+++
++
21
22
Data
Models
Structrs
Data Models
& Structures
Data
Managmnt
& Lifecycle
BigData
Infrastr &
Operations
BigData
BigData
Analytics & Security
Applicatn
++
++
++
++
++
++
+++
Data
Managmnt &
Lifecycle
++
BigData
Infrastruct &
Operations
+++
+++
BigData
Analytics &
Applications
++
++
BigData
Security
+++
+++
+++
++
23
Data models
Depend on target/goal, or process/object?
Evolve or chain/stack?
[ref] NIST Big Data WG discussion http://bigdatawg.nist.gov/home.php
24
Papers/Reports
Archival Data
Usable Data
Classified/Structured Data
Classified/Structured Data
Classified/Structured Data
Raw Data
Referrals
Control information
Policy
Data patterns
25
Data
Collection&
Registration
Data
Source
Data
Filter/Enrich,
Classification
Data
Analytics,
Modeling,
Prediction
Data
Delivery,
Visualisation
Consumer
Storage
General
Purpose
Data Management
Compute
General
Purpose
High
Performance
Computer
Clusters
Storage
Specialised
Databases
Archives
(analytics DB,
In memory,
operstional)
Data
categories: metadata,
(un)structured, (non)identifiable
Data
Datacategories:
categories:metadata,
metadata,(un)structured,
(un)structured,(non)identifiable
(non)identifiable
Network Infrastructure
Internal
Infrastructure
Management/Monitoring
26
Analytics :
Realtime, Interactive,
Batch, Streaming
Storage
General
Purpose
Data Management
Compute
General
Purpose
Analytics Applications
Link Analysis
Cluster Analysis
Entity Resolution
Complex Analysis
High
Performance
Computer
Clusters
Federated
Access
and
Delivery
Infrastructure
(FADI)
Storage
Specialised
Databases
Archives
Data
categories: metadata,
(un)structured, (non)identifiable
Data
Datacategories:
categories:metadata,
metadata,(un)structured,
(un)structured,(non)identifiable
(non)identifiable
Network Infrastructure
Internal
Infrastructure
Management/Monitoring
27
Data (inter)linking?
Data
Collection&
Registration
Data
Source
Data
Filter/Enrich,
Classification
Data
Analytics,
Modeling,
Prediction
Data
Delivery,
Visualisation
Consumer
Data Analitics
Application
Data repurposing,
Analitics re-factoring,
Secondary processing
Persistent identifier
Traceability vs Opacity
Referral integrity
28
Protected data
Encrypted data
Masked data (scrambled, padded, manipulated, etc.)
Anonymised and privacy enhanced
Identifiable and non-identifiable
Policy attached/enforced
Tiered/auto-tired
29
30
31
Publications
and Linked
Data
Published Data
Structured Data
Raw Data
32
User
Researcher
Data discovery
Data Curation
(including retirement and clean up)
Data
recycling
Raw Data
Experimental
Data
Project/
Experiment
Planning
Data collection
and
filtering
Structured
Scientific
Data
Data analysis
DB
Data archiving
Data
Re-purpose
Data linkage
to papers
Data sharing/
Data publishing
Data Re-purpose
Data Linkage Issues
Lined Data
Data Detainment
End of project
Open
Public
Use
Data Links
Data
archiving
Metadata &
Mngnt
33
Additional Information
Existing proposed Big Data architectures
e-Science and Scientific Data Infrastructure (SDI)
Cloud computing as a platform for SDI
34
35
Using integrated/unified
storage
New DB/storage
technologies allow
storing data during all
lifecycle
36
37
38
39
40
41
[ref] HPCC Systems: Introduction to HPCC (High Performance Computer Cluster), Author: A.M.
Middleton, LexisNexis Risk Solutions, Date: May 24, 2011
17 July 2013, UvA
42
43
44
45
txt
Cloud Standardisation
46
RORA model provides basis for business processes definition, SLA and access control
47
Security Infrastructure
Management
User/Client Services
* Identity services (IDP)
* Visualisation
Administration and
Management Functions
(Client)
Content/Data Services
* Data * Content * Sensor * Device
Endpoint Functions
* Service Gateway
* Portal/Desktop
Inter-cloud Functions
* Registry and Discovery
* Federation Infrastructure
Access/Delivery
Infrastructure
SaaS
PaaS
PaaS-IaaS IF
PaaS-IaaS Interface
Cloud Management
Software
(Generic Functions)
Virtualisation Platform
OpenStack
KVM
VM
VM
Storage
Resources
Compute
Resources
Contrl&Mngnt Links
VPN
Other
CMS
XEN
VMware
Network
Virtualis
Layer C5
Services
Access/Delivery
Layer C4
Cloud Services
(Infrastructure,
Platforms,
Applications,
Software)
IaaS
Layer C6
User/Customer
side Functions
Hardware/Physical
Resources
Network
Infrastructure
Layer C3
Virtual Resources
Composition and
Control
(Orchestration)
CSM layers
(C6) User/Customer side
Functions
(C5) Intrecloud Services
Access and Delivery
(C4) Cloud Services
(Infrastructure, Platform,
Applications)
(C3) Virtual Resources
Composition and
Orchestration
(C2) Virtualisation Layer
(C1) Hardware platform and
dedicated network
infrastructure
Layer C2
Virtualisation
Layer C1
Physical
Hardware
Platform and
Network
Control/
Mngnt Links
Data Links
Data Links
GN3+ Cloud+
Slide_48
E-Science Features
49
Support for long running experiments and large data volumes generated
at high speed
Multi-tier inter-linked data distribution and replication
On-demand infrastructure provisioning to support data sets and
scientific workflows, mobility of data-centric scientific applications
Support of virtual scientists communities, addressing dynamic user
groups creation and management, federated identity management
Support for the whole data lifecycle including metadata and data source
linkage
Trusted environment for data storage and processing
50
51
Scientific
Dataset
Applic
Scientific
Applic
Scientific
Applic
Scientific
Layer B5
Federated Access
and Delivery
Layer
Layer B4
Scientific Platform
and Instruments
Technologies and
solutions
Scientific specialist
applications
Library resources
Optical Network
Infrastructure
Federated Identity
Management:
eduGAIN, REFEDS,
VOMS, InCommon
PRACE/DEISA
Cloud/Grid Infrastructure
Virtualisation and Management
Middleware
Compute
Resources
Sensors and
Devices
Network infrastructure
Layer B6
Scientific
Applications
User
portals
Layers
Layer B3
Infrastructure
Virtualisation
Middleware
security
Storage
Resources
Layer B2
Datacenter and
Computing Facility
Layer B1
Network Infrastructure
Grid/Cloud
Clouds
Autobahn, eduroam
52
53
54
Trust
Broker
Trust
Broker
Broker
Broker
Federated Cloud
Instance Customer B
(VO B)
Common FADI
Services
Trusted
Introducer
Discovery
FedIDP
Gateway
Gateway
Gateway
Gateway
AAA
AAA
AAA
AAA
(I/P/S)aaS
Provider
(I/P/S)aaS
Provider
(I/P/S)aaS
Provider
IDP
Directory
Directory
(RepoSLA)
(RepoSLA)
IDP
IDP
(I/P/S)aaS
Provider
IDP
55