Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

HCMC UN IV ER S IT Y OF TEC H N O LOG Y

Big Data
Management
CO4033 - L01 - Group 04

Tuesday, October 25, 2022


Today's Agenda
APACHE SQOOP

APACHE DRILL

APACHE KAFKA

APACHE CASSANDRA

2
APACHE SQOOP

3
01 What is Apache Sqoop?

02 Why do we use Apache


Sqoop?

03 How does Apache Sqoop


work?

04 Features of Apache Sqoop

4
What is Apache Sqoop?

Apache Sqoop is a tool designed for data transfer between the


Hadoop Distributed File System and the relational databases or
mainframes. 5
Why do we use
Apache Sqoop?

6
Problems with Relational Database?

7
Needs of Sqoop
01
01 Apache Sqoop automates the process of data
import and export.

01 Internally, Sqoop converts the Sqoop command


02
into the MapReduce tasks. These MapReduce
tasks are then executed over the HDFS.

Sqoop uses YARN (Yet Another Resource


03
Negotiator) framework for importing and
exporting the data. This provides fault tolerance
on the top of parallelism.

8
How does
Apache Sqoop
work?

9
Sqoop Architecture

01 The client submits the import/ export command to import or export data.

Sqoop fetches data from different databases. Here, we have an enterprise data
02 warehouse, document-based systems, and a relational database. We have a
connector for each of these; connectors help to work with a range of accessible 10
databases.
Sqoop Architecture

03 Multiple mappers perform map tasks to load the data on to HDFS.

Similarly, numerous map tasks will export the data from HDFS on to
04
RDBMS using the Sqoop export command.
11
Sqoop Import

All this metadata is sent to the Sqoop import. Sqoop then performs
01
an introspection of the database to gather metadata (primary key
information).

It then submits a map-only job. Sqoop divides the input dataset into
02 12
splits and uses individual map tasks to push the splits to HDFS.
Sqoop Export

The first step is to gather the metadata through introspection.


01

Sqoop then divides the input dataset into splits and uses individual
02
map tasks to push the splits to RDBMS. 13
Features of
Apache Sqoop

14
Features of Sqoop

15
Advantages of Sqoop

01 Sqoop allows data transfer with different 03 Apache Sqoop executes data transfer in parallel,
structured data stores such as Teradata, Postgres, so its execution is quick and cost-effective.
Oracle, and so on.

02 Since the data from RDBMS is transferred and 04 Sqoop helps in integration with the sequential
stored into the Hadoop, Apache Sqoop allows us data from the mainframe. This helps reduce high
to offload the processing done in the ETL (Extract, costs in executing specific jobs using mainframe
Load, and Transform) process into the fast, low- hardware.
cost, and effective Hadoop processes.

16
Disadvantages of Sqoop

We cannot pause or resume the Apache Sqoop once It has a bulkier connector for a few
01 05
it is started. It is an automatic step. If in case it fails, databases.
then we have to clear the things and start it again.

The performance of Sqoop Export depends on Sqoop 1 uses a JDBC connection for
02 06
hardware configuration such as Memory, Hard connecting with RDBMS. This can be less
disk of the RDBMS server. performance and inefficient.

It is slow because it uses MapReduce in backend Sqoop 1 does not provide a Graphical User
03 07
processing. Interface for easy use.

Failures need special handling in the case of partial


04
export or import.

17
APACHE DRILL

18
01 What is Apache Drill?

02 Key Features

03 Why Drill?

04 Drill Query Execution

19
What is Apache Drill?
Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from
the ground up to support high-performance analysis on the semi-structured and rapidly evolving
data coming from modern Big Data applications, while still providing the familiarity and
ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play
integration with existing Apache Hive and Apache HBase deployments.

20
Key Features

01 Low-latency SQL queries. 04 Nested data support.

Dynamic queries on self-describing data in files (such Integration with Apache Hive (queries on Hive
02 as JSON, Parquet, text) and HBase tables, without 05 tables and views, support for all Hive file
requiring metadata definitions in the Hive metastore. formats and Hive UDFs).

03 ANSI SQL. 06 BI/SQL tool integration using standard


JDBC/ODBC drivers.

21
Why Drill?
01 Get started in minutes. 06 Interactive queries on Hive tables.

02 Schema-free JSON model. 07 Access multiple data sources.

03 Query complex, semi-structured data in-situ. 08 User-Defined Functions (UDFs) for Drill and
Hive.

04 Real SQL – not “SQL-like”. 09 High performance.

05 Leverage standard BI tools. 10 Scales from a single laptop to a 1000-node


cluster.

22
Drill Query Execution

23
Use cases 01 Cloud JSON and Sensor Analytics.

02 Works well with Hive.

03 SQL for NoSQL.

24
APACHE KAFKA

25
01 Ideal Publish - Subscribe System

02 Kafka Architecture

03 Why Kafka?

04 The Data Ecosystem

26
Publish/Subscribe Messaging
Publish/subscribe messaging is a pattern that is characterized by the sender (publisher) of a
piece of data (message) not specifically directing it to a receiver.

A single, direct metrics publisher


(Source: Kafka: The Definitive Guide, Neha Narkhede)
27
Publish/Subscribe Messaging

Many metrics publishers, using direct connections


(Source: Kafka: The Definitive Guide, Neha Narkhede)
28
Individual Queue Systems

Multiple publish/subscribe systems


(Source: Kafka: The Definitive Guide, Neha Narkhede)
29
Enter Kafka
Apache Kafka s often described as a “distributed commit log” or more recently as a
“distrib‐ uting streaming platform.”
A filesystem or database commit log is designed to provide a durable record of all
transactions so that they can be replayed to consistently build the state of a system.
Data within Kafka is stored durably, in order, and can be read deterministically.

30
Topics and Partitions

Topics in a Publish-Subscribe System


(Source: Apache Kafka Guide, Cloudera, Inc.)
31
Topics and Partitions

Representation of a topic with multiple partitions


(Source: Kafka: The Definitive Guide, Neha Narkhede)
32
Producers and Consumers

A consumer group reading from a topic


(Source: Kafka: The Definitive Guide, Neha Narkhede)
33
Brokers and Clusters

:Brokers in a Publish-Subscribe System


(Source: Apache Kafka Guide, Cloudera, Inc.) 34
Brokers and Clusters

Replication of partitions in a cluster


(Source: Kafka: The Definitive Guide, Neha Narkhede) 35
Brokers and Clusters

Multiple datacenter architecture


(Source: Kafka: The Definitive Guide, Neha Narkhede) 36
Why Kafka?
Multiple producers

Multiple Consumers

Disk-Based Retention

Scalable

High Performance

37
Disadvantages of Kafka
No Complete Set of Monitoring Tools

Issues with Message Tweaking

Not support wildcard topic selection

Lack of Pace

Reduces Performance

38
The Data Ecosystem

A big data ecosystem


(Source: Kafka: The Definitive Guide, Neha Narkhede) 39
Use Cases
Activity tracking

Messaging

Metrics and logging

Commit log

Stream processing

40
APACHE
CASSANDRA

41
01 Architecture

02 Data Model

03 Partitioner and Refactor Strategies

04 Read and Write

42
Architecture

43
Architecture
01. Node

02. Rack

03. Datacenter

04. Cluster

44
Data Model

01 Keyspaces

02 Tables (Column families)

03 Columns

45
Primary Key Partition Key Composite Partition Key
CREATE TABLE application_logs ( CREATE TABLE application_logs (
id INT, id INT,
app_name VARCHAR, app_name VARCHAR,
hostname VARCHAR, hostname VARCHAR,
log_datetime TIMESTAMP, log_datetime TIMESTAMP,
env VARCHAR, env VARCHAR,
01 Partition Key log_level VARCHAR, log_level VARCHAR,
log_message TEXT, log_message TEXT,
PRIMARY KEY (app_name) PRIMARY KEY ((app_name, env))
); );

Clustering Key Clustering Order By


CREATE TABLE application_logs ( CREATE TABLE application_logs (
id INT, id INT,
app_name VARCHAR, app_name VARCHAR,
hostname VARCHAR, hostname VARCHAR,
log_datetime TIMESTAMP, log_datetime TIMESTAMP,
02 Clustering Key env VARCHAR, env VARCHAR,
log_level VARCHAR, log_level VARCHAR,
log_message TEXT, log_message TEXT,
PRIMARY KEY ((app_name, env), PRIMARY KEY ((app_name,env),
hostname, log_datetime) hostname, log_datetime)
); )
WITH CLUSTERING ORDER BY
(hostname ASC, log_datetime DESC); 46
Partitioner and Refactor Strategies
01 Partitioner

Random Murmur3
Partitioner Partitioner

02 Refactor Strategies

Simple Network
Strategy Topology
Strategy

47
Data Writing

Steps

01 Writing Request

02 Commit log and Memtable

03 Coordinator node

04 Response

48
Data Reading
Steps

01 Reading Request

02 Memtable

03 SSTable

04 Result

49
Consistency
level
01 ONE

02 QUORUM

03 LOCAL_QUORUM

04 ALL

Strong Consistency

W + R > RF

50
Pros and Cons
Pros Cons

01 Open Source 01 Query

02 Scalable Elastic 02 Sorting

03 Highly Avalability and Fault Tolerance 03 Aggregation

04 Consistency 04 Storage

51
References
01 Apache Drill - Introduction (tutorialspoint.com)

02 Cassandra Partition Key, Composite Key, and Clustering Key | Baeldung

03 Cassandra Performance: The Most Comprehensive Overview You’ll Ever See (scnsoft.com)

04 Cluster, Datacenters, Racks and Nodes in Cassandra | Baeldung

05 Drill Introduction - Apache Drill

06 What is a Cassandra Data Model? Definition & FAQs | ScyllaDB

52
References
07 Apache Kafka Guide, Cloudera, Inc

08 Kafka: The Definitive Guide, Neha Narkhede

09 KAFKA: The Modern Platform for Data Management and Analysis in Big Data Domain, Rishika Shree,

53
Thank you!
Q&A
Any question?

Tuesday, October 25, 2022

You might also like