Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

ECS640U/ECS765P Big Data Processing

Big data Ingestion and Storage


Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Big data Ingestion and Storage
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …


Today’s Lecture Contents

● Big data sources


● Data ingestion
● Data storage
● Ingestion and storage in the cloud
Apache Hadoop components
Apache Hadoop: A framework for distributed storage and processing.

Processing

Storage
Apache Hive: Querying HDFS
How can we make simple queries to extract data
stored in a Hadoop cluster?

• Option 1: Writing Java MapReduce code?


• Option 2: Writing Python MRjob code?
• Option 3: Using Apache Hive, which extends the functionality of
Hadoop by providing an SQL interfaces and translating SQL into
Java MapReduce

• Apache Hive can be used to build warehouse solutions on top of


Hadoop – SQL is the status-quo for building database applications
• These solutions are decoupled from Hadoop, hence Hadoop can
be used to build other frameworks atop of it (e.g., Apache Tez for
processing complex directed acyclic graphs (DAG) )
Apache HBase: Hadoop’s database
Is Hadoop all about MapReduce?

• Since Apache HDFS and Apache MapReduce are


decoupled, we can have processing frameworks other
than Hadoop in combination with HDFS storage
• Apache HBase allows for building a NoSQL big data
store on top of Hadoop HDFS.
Revisit the Pipeline
Big Data storage and ingestion

• Hadoop HDFS is one Big Data storage solution.


What other options exist?
• How did data get there in the first place?
• This week we will focus on big data storage and the process by which data are
made available to Big Data platforms, namely, Ingestion

Ingestion Storage Processing


Big Data Processing: Week 6
Topic List:

● Big data sources


● Data ingestion
● Data storage
● Ingestion and storage in the cloud
Big data sources
Where do our data come from?

Data can be produced by a variety of data sources, including:


• Data stores
• Measurement systems
• Application interfaces
• Communication networks
• Systems processing data

From a production perspective, it is convenient to distinguish between data generated:


• In bulk, i.e. large data batches
• Continuously, as a stream of small data units, such as messages or time-stamped events
Big data sources
Three basic kinds of data

Another classification of data distinguishes between:

• Structured data consist of items described by a consistent, well-defined set of individual features.
• Semi-structured data consist of items with individual features, but these are not consistent across all
the items.
• With unstructured data, no individual feature is distinguished.

Structured Semi-structured Unstructured


Tables/Databases XML, JSON, Objects such as
Comma Separated Values (CSV) Images/voice
have features and consistency
have features, but no consistency no features, no consistency
=>each line may has diff number of features
or like HTML, each object has its own schema
Big data sources
The 3 Vs: Variety, volume and velocity

In Big Data scenarios we should expect:

• Multiple data sources of different nature producing data of different types (Variety),
• in large quantities (Volume),
• at high rates (Velocity).

Big data platforms such as Apache Hadoop provide the computing, storage and networking
infrastructure necessary to deal with the 3 Vs.

Q: how to ingest the data?


Big Data Processing: Week 6
Topic List:

● Big data sources


● Data ingestion
● Data storage
● Ingestion and storage in the cloud
The Data Warehouse
From OLTP to OLAP

Many database systems are optimised for OnLine Transaction


Processing (OLTP) tasks, where many applications initiate OLTP DB
small, simple transactions in real-time.

These databases store data that might be useful for business


intelligence and decision-making, but require retrieving and
analysing long-term, large volumes of data.

A common solution is to create data warehouses, which are Warehouse


optimised for OnLine Analytical Processing (OLAP).

Examples: Amazon Redshift, Google BigQuery, Azure Synapse


Analytics, IBM Db2, Snowflake
Extract-Transform-Load (ETL)
Building warehouses
Data Source 3
Data Source 1

Data warehouses integrate data from different data sources to provide a


single solution for OLAP workloads. Data Source 2

Warehouses are built by following ETL process consisting of three steps:


• Extraction: Dara are retrieved from possibly heterogeneous sources
ETL
• Transformation: Data are prepared to match the destination
warehouse’s schema ensuring quality. Typical tasks include formatting,
cleansing, aggregation and enrichment.
• Loading: The warehouse is populated with the transformed data.

Warehouse
Data in warehouses are added and retrieved for OLAP, but not frequently
updated, hence they can be seen as historical snapshots.
The notion of data ingestion
Beyond Extract-Transform-Load

Data ingestion is the process by which data are made available for processing, either immediately or
after storage.

The ETL process constitutes one approach to data ingestion. However, ETL might not be suitable in many
Big Data scenarios because ETL :
• Requires the schema of the destination warehouse and the type of processing to be known in advance.
• Requires Data sources to be clearly identified.
• It is a time-consuming task.

In many Big Data problems the processing workloads are expected to change course dynamically, and/or
outputs need to be produced in real-time. Hence, rigid solutions provided by ETL-type data ingestion
might be unsuitable.
The notion of data ingestion
This step: only extract data from the (diff) sources and load, no transformation is executed here.
From data warehouses to data streams and lakes
data stream/lake
In many Big Data scenarios it is more convenient to have access to raw, untransformed data, to
produce immediate outputs or to iteratively experiment with the data.
In these cases, data ingestion excludes rigid, time-consuming transformations, and focuses on making
large volumes of raw data available.

Data streams describe the continuous generation of data and data lakes refer to large repositories of raw
data from heterogeneous sources. NOTE: Data transformations are left to subsequent processing stages.

Warehouse Stream Lake


Back to Apache Hadoop
Ingestion in the Hadoop ecosystem

In Apache Hadoop, MapReduce (processing) can be applied to data


stored in HDFS (storage). The HDFS component can be used to
implement a data lake, where raw data ingested from a variety of
sources are stored.

Data ingestion is however a separate component from data


storage.

There exist several frameworks that provide a data ingestion service


to Apache Hadoop (and other Big Data frameworks), including:
• Sqoop: relational databases Data
• Flume: streaming data Ingestion
Apache Sqoop (https://sqoop.apache.org)
Ingesting batches of data from relational databases

Sqoop (“SQL to Hadoop”) automates the process of


transferring data between relational databases and HDFS.

• By using MapReduce, Sqoop import tables in a database


row by row.
• Importing is performed in parallel and in a fault tolerant
manner, resulting in ingested data being stored in multiple
files in HDFS.
• After processing, results can be inserted back into a target
database. two ways
Apache Flume (https://flume.apache.org)
Ingesting streaming data in real time

Flume collects, aggregates and loads large amounts of


streaming data from multiple sources into HDFS.

• Flume acts as a buffer between multiple streaming


sources and HDFS.
• Flume can scale out easily, hence it can handle irregular
bursts of streaming data from multiple sources
• Avoids having multiple sources writing simultaneously
to the target Hadoop cluster.
• Examples of streaming data sources include social
network applications, IoT systems and network
traffic.
Big Data Processing: Week 6
Topic List:

● Big data sources


● Data ingestion
● Data storage
● Ingestion and storage in the cloud

Quiz and Break


Big Data storage
Essential Functions of Storage Systems: Sharding, Scaling and Replication

Big data storage solutions need to deal with large volumes of data.

• Partitioning the data into smaller chunks known as shards, allows for easy distribution of data among
the nodes in a cluster.
• Increasing volumes of data can be accommodated by scaling out (horizontally), i.e. adding more nodes
to the cluster.
• Data availability and fault tolerance constraints can be met by implementing replication strategies.
Distributed File System
The file as a storage unit

Distributed file systems store files physically across the nodes of a cluster and organise them logically in a
tree structure. One obvious example is HDFS.

Since distributed file systems treat files as storage units without any consideration about the data being
stored, they are suitable to store semi-structured and unstructured data which creates data lakes.
In addition, distributed file systems can provide high availability and redundancy by implementing
replication mechanisms.

Distributed file systems work best with few large files, rather than multiple small ones, as disk-seeking
activity can slow down data retrieval.
NoSQL databases
Databases designed for flexibility

Traditional databases are optimised for transaction processing (or OLTP). However, they are not well
suited as a primary Big Data storage solution à the usual option is to store/load database data into other
Big Data storage solutions.

NoSQL databases constitute a suitable Big Data storage solution as they:


• Are schema-less and can store structured, semi-structured or unstructured data.
• Can easily scale-out
• Are highly available and fault-tolerant via replication mechanisms
• Can handle large amounts of data
• Many are open source
1 2 3
NoSQL database are classified as key-value stores, document stores, column-oriented databases and
4 graph-based databases.
1 Key-Value Stores
Leveraging hash functions

In key-value stores, data are stored as (key, value) pairs. The value is not assumed to have any structure
and is stored as an object known as blob. Blobs are retrieved by using the key as the only search criterion.

Keys are mapped to values via a hash function, which produces an address for the location of the value.
The hashing stage facilitates distributed storage via sharding and horizontal scaling, e.g., the different
portion of the prefix of the hash can represents the node address and/or shard address.
Node 1
Index Value
1 John Smith
Key= MyIP Hash 4
3 011101101110
function
Node 2
Index Value
2 <name>Jo<\name>
4 127.0.0.1
2 Document stores
Key-values stores with “query-able” document objects/blobs

Document stores can be seen as key-value stores, where value objects, known as documents, consist of a
collection of labelled and unordered attributes. Documents can have different attributes and new
attributes can be added. Document stores hence offer flexible, semi-structured storage.

In contrast to key-value stores, where blobs are opaque, documents can be queried and retrieved based
on the value of their attributes. Common implementations of document stores include JSON and XML text
files.
{
"year" : 2014,
"title" : ”The Grand Hotel Budapest",
"info" : {
"directors" : [ ”Wes Anderson"],
"actors" : [”Ralph Fiennes", "F. Murray Abraham"]
}
}
3 Column-oriented databases
Tabular data as collections of attribute values (or columns of items)

Rows and columns in tabular data represent items and attributes respectively.
OLTP-oriented databases are designed to retrieve the attributes of individual items, hence by storing a
table row-by-row, OLTP databases can retrieve data more efficiently.

Many Big Data applications are not concerned with individual items and process all the data values from
each attribute. Hence, efficiency can be enhanced by storing tables column-by-column. This brings
additional storage benefits, such as compression.
An example is Google’s BigTable : https://searchdatamanagement.techtarget.com/definition/Google-BigTable

First name Last name Salary


Pepe Lopez 30
Mary Jones 40
Amit Bhatia 35
4 Graph databases
Navigating relationships and connections between entities

In graph databases relationships between data items are as important as the data items themselves.
Alongside data, graph databases store the connections between data items and are optimised to traverse
the resulting graph structure efficiently.

Graph databases are suitable for scenarios such as social networking, recommendation engines or fraud
detection, where it is necessary to query relationships between data items.
Big Data Processing: Week 6
Topic List:

● Big data sources


● Data ingestion
● Data storage
● Ingestion and storage in the cloud
Introduction to Cloud Computing
Big Data in the Cloud

Storage, compute and network resources can be hosted on-premises or in external datacentres hosted on
the cloud accessible via the internet as a service. The delivery of such services over the internet is known
as cloud computing.

Cloud computing platforms provide with on-demand and flexible access to storage, compute, network and
other IT services. This makes cloud computing a suitable choice to build Big Data solutions, as the amount
of available resources are virtually unlimited and can easily scale as the needs grow.

Examples of cloud computing platforms include AWS (Amazon Web Services), GCP (Google Cloud Platform)
Microsoft Azure, IBM Cloud and Alibaba Cloud.
Cloud services and architectures
Building Big Data solutions in the Cloud

Cloud Computing platforms offer the tools necessary to build Big Data solutions by combining decoupled,
independent services that communicate via interfaces known as Application Programming Interface (API).

Cloud services can offer different levels of control, flexibility and management. It is common to
distinguish between three families of services:
• Infrastructure as a Services (IaaS): Includes fundamental services (storage, compute and network) and
offers the highest level of flexibility, control and management.
• Platform as a Service (PaaS): Focused on building and delivering applications, the cloud provider
manages the underlying infrastructure (resource procurement, OS maintenance, patching).
• Software as a Service (SaaS): End-user application, whose software and underlying infrastructure is
managed by the cloud provider.
Amazon Web Services
Big Data solutions on AWS

DynamoDB: EMR cluster:


NoSQL database Hadoop and Spark

Glue: ETL
Redshift:
Warehouse

S3: EC2:
Storage Compute
Google Cloud Platform
Big Data solutions on GCP

Bigtable: DataProc:
NoSQL database Hadoop and Spark

Dataflow: ETL BigQuery:


Warehouse

Cloud Storage: Compute Engine:


Storage Compute
Microsoft Azure
Big Data Solutions on Azure

Cosmos DB: HDInsight:


NoSQL database Hadoop and Spark

Data Factory: Synapse:


ETL Warehouse

Blob Storage: Virtual Machines:


Storage Compute
Big Data Processing: Week 6
Topic List:

● Big data sources


● Data ingestion
● Data storage
● Ingestion and storage in the cloud

Quiz and End

You might also like