Bigdatafundamentals Part1

Big data fundamentals
Understanding the optimization choices in big data components
© Cloudera, Inc. All rights reserved. 1

Presentation goals
✓ Teach you something
✓ Help you see the potential of Big Data beyond Map Reduce
✓ Be fair to Cloudera’s competitors
✓ Inspire you to learn more
If something doesn’t make sense, please ask.

Notification
• The information in this document is proprietary to Cloudera. No part of this document may be reproduced,
copied or transmitted in any form for any purpose without the express prior written permission of Cloudera.
• This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.

Agenda
• Open source software
• Data storage and stewardship
• Data integration
• Data engineering
• Data analytics
• Life after Lambda architectures and IoT
• Data science at scale
• Big data in the clouds
• Cybersecurity as a Big Data problem
• Cluster management and security
• Customer success stories

• Question and answers

Open source software

Optimizing to benefit from community innovation

3 Reasons open source is good for companies
[1] [2] [3]
Free evaluation Freedom from lock-in Scalable innovation

Install, test, inspect, and evaluate Multiple vendors supporting same The collective work of a global,
open source code in perpetuity, with core technology makes it easier to passionate community keeps the
no financial obligation move code base evolving
These benefits derive from use of the

permissive Apache License
3 Reasons open source adds risk for companies
[1 [2 [3]
] ]
Not business focus Real cost hard to measure Multiple projects

Company assets should be working Time developers spend solving Each project is managed by a
on core competency problems or adding features often separate committee and there is not
isn’t visible necessarily an overriding design

“Open source software is free
like a puppy is free”
- Scott McNealy
CEO Sun Microsystems

What if you got a dog for a
reason?
• Can take years to mature
• Months of intensive training (when your
attention should be elsewhere)
• Dog becomes very bonded to the
handler (and vice versa)
• Poor training results in a misbehaving
dog
Developers don’t want to be tied to one system

You don’t want your developers tied to one system

What is a distribution?

Benefits of using a distribution
Stability Regular upgrades 24x7 Support and bug fixes

Each Apache project has its own Code in Open Source changes Distribution Vendors should employ
dependencies and release cycle. constantly. Cloudera provides a Open Source Committers that can
Getting them to work together new feature release every quarter make sure fixes are added to the
requires effort and thorough testing. that is tested and supported. Open Source base.

More benefits of using a distribution
Faster to market Minimize risk Focus on business problems

With a Distribution, you can start With a distribution, you know what it Building an environment from
developing applications right away. will cost and you know that it will scratch would require the focus of a
Building an environment from work. Building an environment from few of your best developers. Get
scratch would take months. scratch provides no such them working on the real problem.
guarantees.

The big data ecosystem vendors
Comprehensive distributions
Proprietary + Hadoop in the gaps
Google Cloud Dataproc
Single+ project specialists
(Spark) (Kafka) (Cassandra)

Apache software foundation
ASF board of directors
For each project

appoints
Project management committee chair – ensures the project complies with ASF requirements
PMC members – decide the architecture, feature set and direction of the project, usually are also
Committers
Committers – have write access to the code, although contributions are approved by the PMC
Developers (aka contributors) – anyone may propose changes to the code or

documentation, but those changes have to be picked up and used by a committer
Users – provide feedback, bug reports and feature suggestions

Apache project requirements
• Must be Apache licensed (may include compatibly licensed elements)
• Free to download and use for any purpose
• Branding requirements and restrictions
• Source code must be open and available on the ASF website
• Must provide sufficient documentation to use the project on website
• Releases must follow the ASF PMC voting policies
• Corporations may not directly contribute – only individuals
• Must govern themselves independently of undue commercial influence
• Must not discourage new contributions from competing vendors
• Low diversity may incur ‘extra scrutiny’ from the board
However, there are NO requirements to:

• Have more than one commercial entity involved (random community members are ok)
• Contribute to an existing project when there is overlap in functionality (competitive projects are ok)
• Contribute modifications or enhancements back to the project
• Employ Committers or PMC members if you are a commercial vendor

Cloudera’s commitment to our customers
Open source Free to use forever Provide enterprise value
Anything that stores your data Keeping your cluster running Ensure your success
Any APIs your applications call Managing your applications Minimize your risk
Uses open source code Cloudera express edition 24x7 support

Our contributions and fixes go No limit to number of servers Rolling upgrades
back to open source first
High availability features Data governance and lineage
When possible, use projects
RBAC over your data Automated backup and recovery
supported by multiple
commercial vendors License expiration won’t stop Full disk encryption
Employ* committers, if not PMC the cluster
Multi-tenant usage reports
members, on the projects we
support
* People manage their own careers. Temporary gaps may exist

Data storage and stewardship

Optimizing for inexpensive, reliable storage accessed by
multiple execution engines

Anatomy of a big data cluster Masters
Impala Catalog Impala
YARN Store Statestore
Secondary
Name Node Name Node HiveServer
Cloudera Zookeeper ⭐️ Zookeeper Zookeeper

Director
(optional) Cloudera Kudu Master Kudu Master ⭐️ Kudu Master
Manager
Cloud Plugin HUE Server Sentry Server Oozie Server
Metadata
Database(s) HMaster HMaster HMaster
CM Agent CM Agent CM Agent
CM Agent CM Agent CM Agent CM Agent CM Agent CM Agent
YARN Resource YARN Resource YARN Resource YARN Resource YARN Resource YARN Resource CM Agent CM Agent
Pool(s) Pool(s) Pool(s) Pool(s) Pool(s) Pool(s)
Impala Impala Impala

Search Search Search
Daemon Daemon Daemon
HBase HBase HBase CDSW
CDSW Session
Kudu Tablet Kudu Tablet Kudu Tablet
Region Region Region CDSW Session
CDSW
Server Server Server
Server Server Server CDSW Session
User App CDSW Session
Data Node Data Node Data Node Data Node Data Node Data Node User App CDSW Session
Workers Gateway(s)
HDFS
Standby Secondary
Name Node
Name Node Name Node
FileQ
B B BZ
X Y
Data NodeA Data NodeB Data NodeC Data NodeD
BX1 BX2 BX3
BY1 BY3 BY2

Default block
size = 256 MB
BZ2 BZ3 BZ1
Rack1 Rack2 Rack3

HDFS Snapshots
Data Node
Name Node
BX1 BY1 BZ1
user BY2 BX2 BZ2
hive
tables
sales
subscriptions BY1 BX2 BY2
BX1 BZ1 BZ2

Data1.parquet
Data2.parquet
.snapshot BX1 BY1 BY2
BX2 BZ1 BZ2

snap1
…
Data1.parquet
Data2.parquet
Public cloud blob storage
Public clouds are offering low cost, highly available storage
Designed for access inside and outside of Hadoop
Amazon Simple Storage Service (S3)

Uses ‘bucket’ paradigm
Requires S3 Guard (Apache Open Source) to achieve consistency
Use protocol s3a://<bucket name>/<filename>
• Microsoft Azure Data Lake Store (ADLS)

‘Feels’ more like a normal (POSIX) file system
Use protocol adl://<directory>/<directory>/filename

Compute over storage
Hive Pig
Compute Search Impala Spark MapReduce
HBase
Storage HDFS Kudu

S3 ADLS
Filesystem

Schema on write or ‘structured data’
3. Map known fields

1. Define schema
2. Create table(s)
4. Discard unknown fields

Schema on read or ‘unstructured data’
2. Register schema with metastore
3. Query engine
applies schema to
data
1. Write whole record(s) to

filesystem (compressed)

Popular file format options
XML, JSON Files
Can’t be both split and compressed
Text/Delimited/CSV/JSON Records
Usable everywhere
File type Example size
Schema on read
Poor performance, poor compression Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro
Contain schema, but also allow schema on read Avro w/ snappy compression 750 MB
Usable inside and outside of Hadoop
Parquet w/ snappy compression 300 MB
Parquet
Columnar, splitable, query performance benefits, excellent compression
Support schema evolution (adding columns)
Skips columns well during scans
ORC (not supported by Cloudera, HDP Hive Only)
Similar to Parquet but with higher compression but poor data skip
Hortonworks working on ACID transactions, secondary indexes

Raw and formatted data copies
• Keep the raw version if there is an opportunity that information

will be lost in the translation
• Use Columnar storage on formatted data to improve analytic

performance immensely
• Think about a metadata tagging policy (e.g. Cloudera

Navigator) to assist with Data stewardship

Big data pipelines
Data ingestion Data engineering Data stewardship Data science Data analytics
Capture Cleanse Store Model BI
Move Conform Secure Score Online
Stream Transform Govern Enrich APIs
Enrich Tag Predict

Which do you want?
Data lake Data hub

Data lake to a data hub
• Comprehensive, planned and enforced data hierarchy
• Carefully administered versioning and retention policies
• Comprehensive, unified security, governance and
lineage
• Encourage and support metadata
• Establish standards for data, metadata and analytic
models
• Maximize reuse of data without making copies
• Balanced with security and performance concerns – don’t be an
ideologue!
• Plan staffing around new roles

Data integration
Optimizing for data ingestion with volume, velocity and variety

Apache Flume
• Collect data as it is
Flume Agent Flume Agent produced
Flume Agent
Filter
Flume Agent
Transform
Flume Agent • Files, syslogs, stdout
Flume Agent
or custom source
• Process in place
Flume Agent • Such as encrypt
Encrypt or compress
Flume Agent • Pre-process data before
Compress storing
• Such as transform, scrub
or enrich
Flume Agent(s)
• Write in parallel
• Scalable
throughput
HDFS • Store in any format

• Text, compressed, binary,
or custom sink

Apache Kafka
Broker1
TopicA- Partition0
Producer ConsumerA
Broker2 ConsumerB
TopicA- Partition1
Consumer Group
Broker3
TopicA- Partition2
Producer
Consumer
Producers push to Kafka Consumers pull from Kafka

Kafka redundancy
Broker3
TopicA- Partition0
TopicA- Partition1 -Replica
Broker3
TopicA- Partition1
Broker3
TopicA- Partition2

Apache Sqoop
▪ Rapidly moves large amounts of data
between relational databases and HDFS
– Import tables (or partial tables)
from an RDBMS intoHDFS
– Export data from HDFS to a database table
RDBMS
▪ Uses JDBC to connect to the database
– Works with virtually all standard RDBMSs
HDFS
▪ Custom “connectors” for some RDBMSs provide much higher throughput

– Available for certain databases, such as Teradata and Oracle

Thank you
The modern platform for machine learning and

analytics, optimized for the cloud

Bigdatafundamentals Part1

Uploaded by

Copyright:

Available Formats

You might also like

Bigdatafundamentals Part1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bigdatafundamentals Part1

Uploaded by

Copyright:

Available Formats

Big data fundamentals

Understanding the optimization choices in big data components

© Cloudera, Inc. All rights reserved. 1

If something doesn’t make sense, please ask.

© Cloudera, Inc. All rights reserved. 2

© Cloudera, Inc. All rights reserved. 3

• Customer success stories

© Cloudera, Inc. All rights reserved. 4

Open source software

© Cloudera, Inc. All rights reserved. 5

Free evaluation Freedom from lock-in Scalable innovation

These benefits derive from use of the

Not business focus Real cost hard to measure Multiple projects

© Cloudera, Inc. All rights reserved. 7

© Cloudera, Inc. All rights reserved. 8

Developers don’t want to be tied to one system

© Cloudera, Inc. All rights reserved. 9

© Cloudera, Inc. All rights reserved. 10

Stability Regular upgrades 24x7 Support and bug fixes

© Cloudera, Inc. All rights reserved. 11

Faster to market Minimize risk Focus on business problems

© Cloudera, Inc. All rights reserved. 12

Proprietary + Hadoop in the gaps

Google Cloud Dataproc

Single+ project specialists

(Spark) (Kafka) (Cassandra)

© Cloudera, Inc. All rights reserved. 13

For each project

Developers (aka contributors) – anyone may propose changes to the code or

Users – provide feedback, bug reports and feature suggestions

© Cloudera, Inc. All rights reserved. 14

However, there are NO requirements to:

© Cloudera, Inc. All rights reserved. 15

Uses open source code Cloudera express edition 24x7 support

© Cloudera, Inc. All rights reserved. 16

Data storage and stewardship

© Cloudera, Inc. All rights reserved. 17

Cloudera Zookeeper ⭐️ Zookeeper Zookeeper

CM Agent CM Agent CM Agent CM Agent CM Agent CM Agent

Impala Impala Impala

Data NodeA Data NodeB Data NodeC Data NodeD

BX1 BX2 BX3

BY1 BY3 BY2

Rack1 Rack2 Rack3

user BY2 BX2 BZ2

BX1 BZ1 BZ2

.snapshot BX1 BY1 BY2

BX2 BZ1 BZ2

Amazon Simple Storage Service (S3)

• Microsoft Azure Data Lake Store (ADLS)

© Cloudera, Inc. All rights reserved. 21

Compute Search Impala Spark MapReduce

Storage HDFS Kudu

© Cloudera, Inc. All rights reserved. 22

3. Map known fields

© Cloudera, Inc. All rights reserved. 23

1. Write whole record(s) to

© Cloudera, Inc. All rights reserved. 24