Professional Documents
Culture Documents
Bd2013 Fineberg
Bd2013 Fineberg
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Overview
Introduction
What is big data?
Big Data I/O
Hadoop/HDFS
SAN
Distributed FS
Cloud
Summary
Research Areas
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Examples
MapReduce
Pattern matching
Log analysis
Unstructured data analysis
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Sequential scans
Map/reduce
Sorting
Shuffle
Index/graph traversal
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
MapReduce in Hadoop
Input (HDFS)
Mapper
Reducer
Output (HDFS)
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
HDFS
Protocol
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
HDFS basics
Data is organized into files & directories
Files are divided into 64-128MB blocks, distributed across nodes
Block placement is handled by the NameNode
Placement coordinated with job tracker = writes always co-located, reads co-located with computation whenever possible
Blocks replicated to handle failure, replica blocks can be used by compute tasks
Checksums used to ensure data integrity
Replication: one and only strategy for error handling, recovery and fault tolerance
Self Healing
Makes multiple copies (typically 3)
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
3-way
replication
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Cons
10
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Compute
nodes
iSCSI or FC SAN
Hadoop
Cluster
11
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Compute
nodes
iSCSI or FC SAN
Cache
C a ch
Hadoop
Cluster
12
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Array
Replicatio
n
13
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Array redundancy,
means only a single
source for data
14
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Pros
Cons
Cost? It depends
Unless if multiple arrays are used, scale is
limited
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
16
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
17
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Distributed
FS inter-node
Protocol
Distributed FS
inter-node
Protocol
18
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
19
Cons
HDFS is highly optimized for Hadoop, unlikely
to get same optimization for a general
purpose DFS
Large file striping is not regular, based on compute
distribution
Copies are simultaneously readable
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Differences
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Cloud Storage
VM Host
SAS/
SATA
VM
VM
VM
VM
VM
VM
Object Storage
REST
VM Host
VM
VM
VM
VM
VM
VM
21
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Hybrid Options
Local disk
Dedicated cluster
Object storage
22
Remote DFS
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Summary
There are lots of storage architectures to choose from
What is the predominant access mode?
23
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Research areas
What types of storage make the most sense for what workloads?
Choices are often based on religion more than data hardware cost vs. TCO
Consider entire chain, not just a micro benchmark
Small improvements dont matter if you have to give up manageability/ease of use
24
Does NVM address the entire supply chain, or just the data processing part?
How can we take advantage of NVM without losing manageability?
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you
Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.