Bigdatafundamentals Part1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Big data fundamentals

Understanding the optimization choices in big data components

© Cloudera, Inc. All rights reserved. 1


Presentation goals
✓ Teach you something
✓ Help you see the potential of Big Data beyond Map Reduce
✓ Be fair to Cloudera’s competitors
✓ Inspire you to learn more

If something doesn’t make sense, please ask.

© Cloudera, Inc. All rights reserved. 2


Notification
• The information in this document is proprietary to Cloudera. No part of this document may be reproduced,
copied or transmitted in any form for any purpose without the express prior written permission of Cloudera.

• This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.

• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.

• Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.

© Cloudera, Inc. All rights reserved. 3


Agenda
• Open source software
• Data storage and stewardship
• Data integration
• Data engineering
• Data analytics
• Life after Lambda architectures and IoT
• Data science at scale
• Big data in the clouds
• Cybersecurity as a Big Data problem
• Cluster management and security

• Customer success stories


• Question and answers

© Cloudera, Inc. All rights reserved. 4


Big data fundamentals

Open source software


Optimizing to benefit from community innovation

© Cloudera, Inc. All rights reserved. 5


3 Reasons open source is good for companies
[1] [2] [3]

Free evaluation Freedom from lock-in Scalable innovation


Install, test, inspect, and evaluate Multiple vendors supporting same The collective work of a global,
open source code in perpetuity, with core technology makes it easier to passionate community keeps the
no financial obligation move code base evolving

These benefits derive from use of the


permissive Apache License
© Cloudera, Inc. All rights reserved. 6
3 Reasons open source adds risk for companies
[1 [2 [3]
] ]

Not business focus Real cost hard to measure Multiple projects


Company assets should be working Time developers spend solving Each project is managed by a
on core competency problems or adding features often separate committee and there is not
isn’t visible necessarily an overriding design

© Cloudera, Inc. All rights reserved. 7


“Open source software is free
like a puppy is free”

- Scott McNealy
CEO Sun Microsystems

© Cloudera, Inc. All rights reserved. 8


What if you got a dog for a
reason?
• Can take years to mature
• Months of intensive training (when your
attention should be elsewhere)
• Dog becomes very bonded to the
handler (and vice versa)
• Poor training results in a misbehaving
dog

Developers don’t want to be tied to one system


You don’t want your developers tied to one system

© Cloudera, Inc. All rights reserved. 9


What is a distribution?

© Cloudera, Inc. All rights reserved. 10


Benefits of using a distribution

Stability Regular upgrades 24x7 Support and bug fixes


Each Apache project has its own Code in Open Source changes Distribution Vendors should employ
dependencies and release cycle. constantly. Cloudera provides a Open Source Committers that can
Getting them to work together new feature release every quarter make sure fixes are added to the
requires effort and thorough testing. that is tested and supported. Open Source base.

© Cloudera, Inc. All rights reserved. 11


More benefits of using a distribution

Faster to market Minimize risk Focus on business problems


With a Distribution, you can start With a distribution, you know what it Building an environment from
developing applications right away. will cost and you know that it will scratch would require the focus of a
Building an environment from work. Building an environment from few of your best developers. Get
scratch would take months. scratch provides no such them working on the real problem.
guarantees.

© Cloudera, Inc. All rights reserved. 12


The big data ecosystem vendors
Comprehensive distributions

Proprietary + Hadoop in the gaps

Google Cloud Dataproc

Single+ project specialists

(Spark) (Kafka) (Cassandra)

© Cloudera, Inc. All rights reserved. 13


Apache software foundation
ASF board of directors

For each project


appoints

Project management committee chair – ensures the project complies with ASF requirements

PMC members – decide the architecture, feature set and direction of the project, usually are also
Committers

Committers – have write access to the code, although contributions are approved by the PMC

Developers (aka contributors) – anyone may propose changes to the code or


documentation, but those changes have to be picked up and used by a committer

Users – provide feedback, bug reports and feature suggestions

© Cloudera, Inc. All rights reserved. 14


Apache project requirements
• Must be Apache licensed (may include compatibly licensed elements)
• Free to download and use for any purpose
• Branding requirements and restrictions
• Source code must be open and available on the ASF website
• Must provide sufficient documentation to use the project on website
• Releases must follow the ASF PMC voting policies
• Corporations may not directly contribute – only individuals
• Must govern themselves independently of undue commercial influence
• Must not discourage new contributions from competing vendors
• Low diversity may incur ‘extra scrutiny’ from the board

However, there are NO requirements to:


• Have more than one commercial entity involved (random community members are ok)
• Contribute to an existing project when there is overlap in functionality (competitive projects are ok)
• Contribute modifications or enhancements back to the project
• Employ Committers or PMC members if you are a commercial vendor

© Cloudera, Inc. All rights reserved. 15


Cloudera’s commitment to our customers
Open source Free to use forever Provide enterprise value
Anything that stores your data Keeping your cluster running Ensure your success
Any APIs your applications call Managing your applications Minimize your risk

Uses open source code Cloudera express edition 24x7 support


Our contributions and fixes go No limit to number of servers Rolling upgrades
back to open source first
High availability features Data governance and lineage
When possible, use projects
RBAC over your data Automated backup and recovery
supported by multiple
commercial vendors License expiration won’t stop Full disk encryption
Employ* committers, if not PMC the cluster
Multi-tenant usage reports
members, on the projects we
support
* People manage their own careers. Temporary gaps may exist

© Cloudera, Inc. All rights reserved. 16


Big data fundamentals

Data storage and stewardship


Optimizing for inexpensive, reliable storage accessed by
multiple execution engines

© Cloudera, Inc. All rights reserved. 17


Anatomy of a big data cluster Masters
Impala Catalog Impala
YARN Store Statestore

Secondary
Name Node Name Node HiveServer

Cloudera Zookeeper ⭐️ Zookeeper Zookeeper


Director
(optional) Cloudera Kudu Master Kudu Master ⭐️ Kudu Master
Manager
Cloud Plugin HUE Server Sentry Server Oozie Server
Metadata
Database(s) HMaster HMaster HMaster
CM Agent CM Agent CM Agent

CM Agent CM Agent CM Agent CM Agent CM Agent CM Agent

YARN Resource YARN Resource YARN Resource YARN Resource YARN Resource YARN Resource CM Agent CM Agent
Pool(s) Pool(s) Pool(s) Pool(s) Pool(s) Pool(s)

Impala Impala Impala


Search Search Search
Daemon Daemon Daemon
HBase HBase HBase CDSW
CDSW Session
Kudu Tablet Kudu Tablet Kudu Tablet
Region Region Region CDSW Session

CDSW
Server Server Server
Server Server Server CDSW Session
User App CDSW Session
Data Node Data Node Data Node Data Node Data Node Data Node User App CDSW Session

Workers Gateway(s)
© Cloudera, Inc. All rights reserved. 18
HDFS

Standby Secondary
Name Node
Name Node Name Node
FileQ
B B BZ
X Y

Data NodeA Data NodeB Data NodeC Data NodeD

BX1 BX2 BX3

BY1 BY3 BY2


Default block
size = 256 MB
BZ2 BZ3 BZ1

Rack1 Rack2 Rack3


© Cloudera, Inc. All rights reserved. 19
HDFS Snapshots
Data Node

Name Node
BX1 BY1 BZ1

user BY2 BX2 BZ2

hive
tables
sales
subscriptions BY1 BX2 BY2

BX1 BZ1 BZ2


Data1.parquet

Data2.parquet

.snapshot BX1 BY1 BY2

BX2 BZ1 BZ2


snap1

Data1.parquet

Data2.parquet
© Cloudera, Inc. All rights reserved. 20
Public cloud blob storage
Public clouds are offering low cost, highly available storage
Designed for access inside and outside of Hadoop

Amazon Simple Storage Service (S3)


Uses ‘bucket’ paradigm
Requires S3 Guard (Apache Open Source) to achieve consistency
Use protocol s3a://<bucket name>/<filename>

• Microsoft Azure Data Lake Store (ADLS)


‘Feels’ more like a normal (POSIX) file system
Use protocol adl://<directory>/<directory>/filename

© Cloudera, Inc. All rights reserved. 21


Compute over storage

Hive Pig

Compute Search Impala Spark MapReduce

HBase

Storage HDFS Kudu


S3 ADLS
Filesystem

© Cloudera, Inc. All rights reserved. 22


Schema on write or ‘structured data’

3. Map known fields


1. Define schema

2. Create table(s)
4. Discard unknown fields

© Cloudera, Inc. All rights reserved. 23


Schema on read or ‘unstructured data’
2. Register schema with metastore

3. Query engine
applies schema to
data

1. Write whole record(s) to


filesystem (compressed)

© Cloudera, Inc. All rights reserved. 24


Popular file format options
XML, JSON Files
Can’t be both split and compressed

Text/Delimited/CSV/JSON Records
Usable everywhere
File type Example size
Schema on read
Poor performance, poor compression Uncompressed CSV 1.8 GB

Avro 1.5 GB
Avro
Contain schema, but also allow schema on read Avro w/ snappy compression 750 MB
Usable inside and outside of Hadoop
Parquet w/ snappy compression 300 MB
Parquet
Columnar, splitable, query performance benefits, excellent compression
Support schema evolution (adding columns)
Skips columns well during scans
ORC (not supported by Cloudera, HDP Hive Only)
Similar to Parquet but with higher compression but poor data skip
Hortonworks working on ACID transactions, secondary indexes

© Cloudera, Inc. All rights reserved. 25


Raw and formatted data copies

• Keep the raw version if there is an opportunity that information


will be lost in the translation

• Use Columnar storage on formatted data to improve analytic


performance immensely

• Think about a metadata tagging policy (e.g. Cloudera


Navigator) to assist with Data stewardship

© Cloudera, Inc. All rights reserved. 26


Big data pipelines

Data ingestion Data engineering Data stewardship Data science Data analytics

Capture Cleanse Store Model BI

Move Conform Secure Score Online

Stream Transform Govern Enrich APIs

Enrich Tag Predict

© Cloudera, Inc. All rights reserved. 27


Which do you want?

Data lake Data hub

© Cloudera, Inc. All rights reserved. 28


Data lake to a data hub
• Comprehensive, planned and enforced data hierarchy
• Carefully administered versioning and retention policies
• Comprehensive, unified security, governance and
lineage
• Encourage and support metadata
• Establish standards for data, metadata and analytic
models
• Maximize reuse of data without making copies
• Balanced with security and performance concerns – don’t be an
ideologue!
• Plan staffing around new roles

© Cloudera, Inc. All rights reserved. 29


Big data fundamentals

Data integration
Optimizing for data ingestion with volume, velocity and variety

© Cloudera, Inc. All rights reserved. 30


Apache Flume
• Collect data as it is
Flume Agent Flume Agent produced
Flume Agent
Filter
Flume Agent
Transform
Flume Agent • Files, syslogs, stdout
Flume Agent
or custom source
• Process in place
Flume Agent • Such as encrypt
Encrypt or compress
Flume Agent • Pre-process data before
Compress storing
• Such as transform, scrub
or enrich

Flume Agent(s)
• Write in parallel
• Scalable
throughput

HDFS • Store in any format


• Text, compressed, binary,
or custom sink

© Cloudera, Inc. All rights reserved. 31


Apache Kafka
Broker1

TopicA- Partition0
Producer ConsumerA

Broker2 ConsumerB
TopicA- Partition1

Consumer Group
Broker3

TopicA- Partition2
Producer
Consumer

Producers push to Kafka Consumers pull from Kafka

© Cloudera, Inc. All rights reserved. 32


Kafka redundancy
Broker3
TopicA- Partition0

TopicA- Partition1 -Replica

TopicA- Partition2 -Replica

Broker3
TopicA- Partition1

TopicA- Partition0 -Replica

TopicA- Partition2 -Replica

Broker3
TopicA- Partition2

TopicA- Partition0 -Replica

TopicA- Partition1 -Replica

© Cloudera, Inc. All rights reserved. 33


Apache Sqoop
▪ Rapidly moves large amounts of data
between relational databases and HDFS
– Import tables (or partial tables)
from an RDBMS intoHDFS
– Export data from HDFS to a database table
RDBMS
▪ Uses JDBC to connect to the database
– Works with virtually all standard RDBMSs

HDFS

▪ Custom “connectors” for some RDBMSs provide much higher throughput


– Available for certain databases, such as Teradata and Oracle

© Cloudera, Inc. All rights reserved. 34


Thank you

The modern platform for machine learning and


analytics, optimized for the cloud

© Cloudera, Inc. All rights reserved. 35

You might also like