Big Data Management

HCMC UN IV ER S IT Y OF TEC H N O LOG Y
Big Data
Management
CO4033 - L01 - Group 04
Tuesday, October 25, 2022

Today's Agenda
APACHE SQOOP
APACHE DRILL
APACHE KAFKA
APACHE CASSANDRA
2
APACHE SQOOP
3
01 What is Apache Sqoop?
02 Why do we use Apache

Sqoop?
03 How does Apache Sqoop

work?
04 Features of Apache Sqoop
4
What is Apache Sqoop?
Apache Sqoop is a tool designed for data transfer between the

Hadoop Distributed File System and the relational databases or
mainframes. 5
Why do we use
Apache Sqoop?
6
Problems with Relational Database?
7
Needs of Sqoop
01
01 Apache Sqoop automates the process of data
import and export.
01 Internally, Sqoop converts the Sqoop command

02
into the MapReduce tasks. These MapReduce
tasks are then executed over the HDFS.
Sqoop uses YARN (Yet Another Resource

03
Negotiator) framework for importing and
exporting the data. This provides fault tolerance
on the top of parallelism.
8
How does
Apache Sqoop
work?
9
Sqoop Architecture
01 The client submits the import/ export command to import or export data.
Sqoop fetches data from different databases. Here, we have an enterprise data
02 warehouse, document-based systems, and a relational database. We have a
connector for each of these; connectors help to work with a range of accessible 10
databases.
Sqoop Architecture
03 Multiple mappers perform map tasks to load the data on to HDFS.
Similarly, numerous map tasks will export the data from HDFS on to
04
RDBMS using the Sqoop export command.
11
Sqoop Import
All this metadata is sent to the Sqoop import. Sqoop then performs
01
an introspection of the database to gather metadata (primary key
information).
It then submits a map-only job. Sqoop divides the input dataset into
02 12
splits and uses individual map tasks to push the splits to HDFS.
Sqoop Export
The first step is to gather the metadata through introspection.

01
Sqoop then divides the input dataset into splits and uses individual
02
map tasks to push the splits to RDBMS. 13
Features of
Apache Sqoop
14
Features of Sqoop
15
Advantages of Sqoop
01 Sqoop allows data transfer with different 03 Apache Sqoop executes data transfer in parallel,
structured data stores such as Teradata, Postgres, so its execution is quick and cost-effective.
Oracle, and so on.
02 Since the data from RDBMS is transferred and 04 Sqoop helps in integration with the sequential
stored into the Hadoop, Apache Sqoop allows us data from the mainframe. This helps reduce high
to offload the processing done in the ETL (Extract, costs in executing specific jobs using mainframe
Load, and Transform) process into the fast, low- hardware.
cost, and effective Hadoop processes.
16
Disadvantages of Sqoop
We cannot pause or resume the Apache Sqoop once It has a bulkier connector for a few
01 05
it is started. It is an automatic step. If in case it fails, databases.
then we have to clear the things and start it again.
The performance of Sqoop Export depends on Sqoop 1 uses a JDBC connection for
02 06
hardware configuration such as Memory, Hard connecting with RDBMS. This can be less
disk of the RDBMS server. performance and inefficient.
It is slow because it uses MapReduce in backend Sqoop 1 does not provide a Graphical User
03 07
processing. Interface for easy use.
Failures need special handling in the case of partial

04
export or import.
17
APACHE DRILL
18
01 What is Apache Drill?
02 Key Features
03 Why Drill?
04 Drill Query Execution
19
What is Apache Drill?
Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from
the ground up to support high-performance analysis on the semi-structured and rapidly evolving
data coming from modern Big Data applications, while still providing the familiarity and
ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play
integration with existing Apache Hive and Apache HBase deployments.
20
Key Features
01 Low-latency SQL queries. 04 Nested data support.
Dynamic queries on self-describing data in files (such Integration with Apache Hive (queries on Hive
02 as JSON, Parquet, text) and HBase tables, without 05 tables and views, support for all Hive file
requiring metadata definitions in the Hive metastore. formats and Hive UDFs).
03 ANSI SQL. 06 BI/SQL tool integration using standard

JDBC/ODBC drivers.
21
Why Drill?
01 Get started in minutes. 06 Interactive queries on Hive tables.
02 Schema-free JSON model. 07 Access multiple data sources.
03 Query complex, semi-structured data in-situ. 08 User-Defined Functions (UDFs) for Drill and
Hive.
04 Real SQL – not “SQL-like”. 09 High performance.
05 Leverage standard BI tools. 10 Scales from a single laptop to a 1000-node

cluster.
22
Drill Query Execution
23
Use cases 01 Cloud JSON and Sensor Analytics.
02 Works well with Hive.
03 SQL for NoSQL.
24
APACHE KAFKA
25
01 Ideal Publish - Subscribe System
02 Kafka Architecture
03 Why Kafka?
04 The Data Ecosystem
26
Publish/Subscribe Messaging
Publish/subscribe messaging is a pattern that is characterized by the sender (publisher) of a
piece of data (message) not specifically directing it to a receiver.
A single, direct metrics publisher

(Source: Kafka: The Definitive Guide, Neha Narkhede)
27
Publish/Subscribe Messaging
Many metrics publishers, using direct connections

28
Individual Queue Systems
Multiple publish/subscribe systems

29
Enter Kafka
Apache Kafka s often described as a “distributed commit log” or more recently as a
“distrib‐ uting streaming platform.”
A filesystem or database commit log is designed to provide a durable record of all
transactions so that they can be replayed to consistently build the state of a system.
Data within Kafka is stored durably, in order, and can be read deterministically.
30
Topics and Partitions
Topics in a Publish-Subscribe System

(Source: Apache Kafka Guide, Cloudera, Inc.)
31
Topics and Partitions
Representation of a topic with multiple partitions

32
Producers and Consumers
A consumer group reading from a topic

33
Brokers and Clusters
:Brokers in a Publish-Subscribe System

(Source: Apache Kafka Guide, Cloudera, Inc.) 34
Replication of partitions in a cluster

(Source: Kafka: The Definitive Guide, Neha Narkhede) 35
Multiple datacenter architecture

Why Kafka?
Multiple producers
Multiple Consumers
Disk-Based Retention
Scalable
High Performance
37
Disadvantages of Kafka
No Complete Set of Monitoring Tools
Issues with Message Tweaking
Not support wildcard topic selection
Lack of Pace
Reduces Performance
38
The Data Ecosystem
A big data ecosystem

Use Cases
Activity tracking
Messaging
Metrics and logging
Commit log
Stream processing
40
APACHE
CASSANDRA
41
01 Architecture
02 Data Model
03 Partitioner and Refactor Strategies
04 Read and Write
42
Architecture
43
Architecture
01. Node
02. Rack
03. Datacenter
04. Cluster
44
Data Model
01 Keyspaces
02 Tables (Column families)
03 Columns
45
Primary Key Partition Key Composite Partition Key
CREATE TABLE application_logs ( CREATE TABLE application_logs (
id INT, id INT,
app_name VARCHAR, app_name VARCHAR,
hostname VARCHAR, hostname VARCHAR,
log_datetime TIMESTAMP, log_datetime TIMESTAMP,
env VARCHAR, env VARCHAR,
01 Partition Key log_level VARCHAR, log_level VARCHAR,
log_message TEXT, log_message TEXT,
PRIMARY KEY (app_name) PRIMARY KEY ((app_name, env))
); );
Clustering Key Clustering Order By

CREATE TABLE application_logs ( CREATE TABLE application_logs (
id INT, id INT,
app_name VARCHAR, app_name VARCHAR,
hostname VARCHAR, hostname VARCHAR,
log_datetime TIMESTAMP, log_datetime TIMESTAMP,
02 Clustering Key env VARCHAR, env VARCHAR,
log_level VARCHAR, log_level VARCHAR,
log_message TEXT, log_message TEXT,
PRIMARY KEY ((app_name, env), PRIMARY KEY ((app_name,env),
hostname, log_datetime) hostname, log_datetime)
); )
WITH CLUSTERING ORDER BY
(hostname ASC, log_datetime DESC); 46
Partitioner and Refactor Strategies
01 Partitioner
Random Murmur3
Partitioner Partitioner
02 Refactor Strategies
Simple Network
Strategy Topology
Strategy
47
Data Writing
Steps
01 Writing Request
02 Commit log and Memtable
03 Coordinator node
04 Response
48
Data Reading
Steps
01 Reading Request
02 Memtable
03 SSTable
04 Result
49
Consistency
level
01 ONE
02 QUORUM
03 LOCAL_QUORUM
04 ALL
Strong Consistency
W + R > RF
50
Pros and Cons
Pros Cons
01 Open Source 01 Query
02 Scalable Elastic 02 Sorting
03 Highly Avalability and Fault Tolerance 03 Aggregation
04 Consistency 04 Storage
51
References
01 Apache Drill - Introduction (tutorialspoint.com)
02 Cassandra Partition Key, Composite Key, and Clustering Key | Baeldung
03 Cassandra Performance: The Most Comprehensive Overview You’ll Ever See (scnsoft.com)
04 Cluster, Datacenters, Racks and Nodes in Cassandra | Baeldung
05 Drill Introduction - Apache Drill
06 What is a Cassandra Data Model? Definition & FAQs | ScyllaDB
52
References
07 Apache Kafka Guide, Cloudera, Inc
08 Kafka: The Definitive Guide, Neha Narkhede
09 KAFKA: The Modern Platform for Data Management and Analysis in Big Data Domain, Rishika Shree,
53
Thank you!
Q&A
Any question?
Tuesday, October 25, 2022

Big Data Management

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Management

Uploaded by

Copyright:

Available Formats

HCMC UN IV ER S IT Y OF TEC H N O LOG Y

Tuesday, October 25, 2022

02 Why do we use Apache

03 How does Apache Sqoop

04 Features of Apache Sqoop

Apache Sqoop is a tool designed for data transfer between the

01 Internally, Sqoop converts the Sqoop command

Sqoop uses YARN (Yet Another Resource

03 Multiple mappers perform map tasks to load the data on to HDFS.

The first step is to gather the metadata through introspection.

Failures need special handling in the case of partial

04 Drill Query Execution

01 Low-latency SQL queries. 04 Nested data support.

03 ANSI SQL. 06 BI/SQL tool integration using standard

02 Schema-free JSON model. 07 Access multiple data sources.

04 Real SQL – not “SQL-like”. 09 High performance.

05 Leverage standard BI tools. 10 Scales from a single laptop to a 1000-node

02 Works well with Hive.

03 SQL for NoSQL.

04 The Data Ecosystem

A single, direct metrics publisher

Many metrics publishers, using direct connections

Multiple publish/subscribe systems

Topics in a Publish-Subscribe System

Representation of a topic with multiple partitions

A consumer group reading from a topic

:Brokers in a Publish-Subscribe System

Replication of partitions in a cluster

Multiple datacenter architecture

Issues with Message Tweaking

Not support wildcard topic selection

A big data ecosystem

Metrics and logging

03 Partitioner and Refactor Strategies

04 Read and Write

02 Tables (Column families)

Clustering Key Clustering Order By

02 Commit log and Memtable

01 Open Source 01 Query

02 Scalable Elastic 02 Sorting

03 Highly Avalability and Fault Tolerance 03 Aggregation

02 Cassandra Partition Key, Composite Key, and Clustering Key | Baeldung

04 Cluster, Datacenters, Racks and Nodes in Cassandra | Baeldung

05 Drill Introduction - Apache Drill

06 What is a Cassandra Data Model? Definition & FAQs | ScyllaDB

08 Kafka: The Definitive Guide, Neha Narkhede

Tuesday, October 25, 2022

You might also like