Professional Documents
Culture Documents
Big Data Management
Big Data Management
Big Data
Management
CO4033 - L01 - Group 04
APACHE DRILL
APACHE KAFKA
APACHE CASSANDRA
2
APACHE SQOOP
3
01 What is Apache Sqoop?
4
What is Apache Sqoop?
6
Problems with Relational Database?
7
Needs of Sqoop
01
01 Apache Sqoop automates the process of data
import and export.
8
How does
Apache Sqoop
work?
9
Sqoop Architecture
01 The client submits the import/ export command to import or export data.
Sqoop fetches data from different databases. Here, we have an enterprise data
02 warehouse, document-based systems, and a relational database. We have a
connector for each of these; connectors help to work with a range of accessible 10
databases.
Sqoop Architecture
Similarly, numerous map tasks will export the data from HDFS on to
04
RDBMS using the Sqoop export command.
11
Sqoop Import
All this metadata is sent to the Sqoop import. Sqoop then performs
01
an introspection of the database to gather metadata (primary key
information).
It then submits a map-only job. Sqoop divides the input dataset into
02 12
splits and uses individual map tasks to push the splits to HDFS.
Sqoop Export
Sqoop then divides the input dataset into splits and uses individual
02
map tasks to push the splits to RDBMS. 13
Features of
Apache Sqoop
14
Features of Sqoop
15
Advantages of Sqoop
01 Sqoop allows data transfer with different 03 Apache Sqoop executes data transfer in parallel,
structured data stores such as Teradata, Postgres, so its execution is quick and cost-effective.
Oracle, and so on.
02 Since the data from RDBMS is transferred and 04 Sqoop helps in integration with the sequential
stored into the Hadoop, Apache Sqoop allows us data from the mainframe. This helps reduce high
to offload the processing done in the ETL (Extract, costs in executing specific jobs using mainframe
Load, and Transform) process into the fast, low- hardware.
cost, and effective Hadoop processes.
16
Disadvantages of Sqoop
We cannot pause or resume the Apache Sqoop once It has a bulkier connector for a few
01 05
it is started. It is an automatic step. If in case it fails, databases.
then we have to clear the things and start it again.
The performance of Sqoop Export depends on Sqoop 1 uses a JDBC connection for
02 06
hardware configuration such as Memory, Hard connecting with RDBMS. This can be less
disk of the RDBMS server. performance and inefficient.
It is slow because it uses MapReduce in backend Sqoop 1 does not provide a Graphical User
03 07
processing. Interface for easy use.
17
APACHE DRILL
18
01 What is Apache Drill?
02 Key Features
03 Why Drill?
19
What is Apache Drill?
Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from
the ground up to support high-performance analysis on the semi-structured and rapidly evolving
data coming from modern Big Data applications, while still providing the familiarity and
ecosystem of ANSI SQL, the industry-standard query language. Drill provides plug-and-play
integration with existing Apache Hive and Apache HBase deployments.
20
Key Features
Dynamic queries on self-describing data in files (such Integration with Apache Hive (queries on Hive
02 as JSON, Parquet, text) and HBase tables, without 05 tables and views, support for all Hive file
requiring metadata definitions in the Hive metastore. formats and Hive UDFs).
21
Why Drill?
01 Get started in minutes. 06 Interactive queries on Hive tables.
03 Query complex, semi-structured data in-situ. 08 User-Defined Functions (UDFs) for Drill and
Hive.
22
Drill Query Execution
23
Use cases 01 Cloud JSON and Sensor Analytics.
24
APACHE KAFKA
25
01 Ideal Publish - Subscribe System
02 Kafka Architecture
03 Why Kafka?
26
Publish/Subscribe Messaging
Publish/subscribe messaging is a pattern that is characterized by the sender (publisher) of a
piece of data (message) not specifically directing it to a receiver.
30
Topics and Partitions
Multiple Consumers
Disk-Based Retention
Scalable
High Performance
37
Disadvantages of Kafka
No Complete Set of Monitoring Tools
Lack of Pace
Reduces Performance
38
The Data Ecosystem
Messaging
Commit log
Stream processing
40
APACHE
CASSANDRA
41
01 Architecture
02 Data Model
42
Architecture
43
Architecture
01. Node
02. Rack
03. Datacenter
04. Cluster
44
Data Model
01 Keyspaces
03 Columns
45
Primary Key Partition Key Composite Partition Key
CREATE TABLE application_logs ( CREATE TABLE application_logs (
id INT, id INT,
app_name VARCHAR, app_name VARCHAR,
hostname VARCHAR, hostname VARCHAR,
log_datetime TIMESTAMP, log_datetime TIMESTAMP,
env VARCHAR, env VARCHAR,
01 Partition Key log_level VARCHAR, log_level VARCHAR,
log_message TEXT, log_message TEXT,
PRIMARY KEY (app_name) PRIMARY KEY ((app_name, env))
); );
Random Murmur3
Partitioner Partitioner
02 Refactor Strategies
Simple Network
Strategy Topology
Strategy
47
Data Writing
Steps
01 Writing Request
03 Coordinator node
04 Response
48
Data Reading
Steps
01 Reading Request
02 Memtable
03 SSTable
04 Result
49
Consistency
level
01 ONE
02 QUORUM
03 LOCAL_QUORUM
04 ALL
Strong Consistency
W + R > RF
50
Pros and Cons
Pros Cons
04 Consistency 04 Storage
51
References
01 Apache Drill - Introduction (tutorialspoint.com)
03 Cassandra Performance: The Most Comprehensive Overview You’ll Ever See (scnsoft.com)
52
References
07 Apache Kafka Guide, Cloudera, Inc
09 KAFKA: The Modern Platform for Data Management and Analysis in Big Data Domain, Rishika Shree,
53
Thank you!
Q&A
Any question?