Professional Documents
Culture Documents
Bigdatafundamentals Part1
Bigdatafundamentals Part1
Bigdatafundamentals Part1
• This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.
- Scott McNealy
CEO Sun Microsystems
Project management committee chair – ensures the project complies with ASF requirements
PMC members – decide the architecture, feature set and direction of the project, usually are also
Committers
Committers – have write access to the code, although contributions are approved by the PMC
Secondary
Name Node Name Node HiveServer
YARN Resource YARN Resource YARN Resource YARN Resource YARN Resource YARN Resource CM Agent CM Agent
Pool(s) Pool(s) Pool(s) Pool(s) Pool(s) Pool(s)
CDSW
Server Server Server
Server Server Server CDSW Session
User App CDSW Session
Data Node Data Node Data Node Data Node Data Node Data Node User App CDSW Session
Workers Gateway(s)
© Cloudera, Inc. All rights reserved. 18
HDFS
Standby Secondary
Name Node
Name Node Name Node
FileQ
B B BZ
X Y
Name Node
BX1 BY1 BZ1
hive
tables
sales
subscriptions BY1 BX2 BY2
Data2.parquet
Data2.parquet
© Cloudera, Inc. All rights reserved. 20
Public cloud blob storage
Public clouds are offering low cost, highly available storage
Designed for access inside and outside of Hadoop
Hive Pig
HBase
2. Create table(s)
4. Discard unknown fields
3. Query engine
applies schema to
data
Text/Delimited/CSV/JSON Records
Usable everywhere
File type Example size
Schema on read
Poor performance, poor compression Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro
Contain schema, but also allow schema on read Avro w/ snappy compression 750 MB
Usable inside and outside of Hadoop
Parquet w/ snappy compression 300 MB
Parquet
Columnar, splitable, query performance benefits, excellent compression
Support schema evolution (adding columns)
Skips columns well during scans
ORC (not supported by Cloudera, HDP Hive Only)
Similar to Parquet but with higher compression but poor data skip
Hortonworks working on ACID transactions, secondary indexes
Data ingestion Data engineering Data stewardship Data science Data analytics
Data integration
Optimizing for data ingestion with volume, velocity and variety
Flume Agent(s)
• Write in parallel
• Scalable
throughput
TopicA- Partition0
Producer ConsumerA
Broker2 ConsumerB
TopicA- Partition1
Consumer Group
Broker3
TopicA- Partition2
Producer
Consumer
Broker3
TopicA- Partition1
Broker3
TopicA- Partition2
HDFS