Professional Documents
Culture Documents
ECS765P - W6 - Big Data Ingestion and Storage
ECS765P - W6 - Big Data Ingestion and Storage
Processing
Storage
Apache Hive: Querying HDFS
How can we make simple queries to extract data
stored in a Hadoop cluster?
• Structured data consist of items described by a consistent, well-defined set of individual features.
• Semi-structured data consist of items with individual features, but these are not consistent across all
the items.
• With unstructured data, no individual feature is distinguished.
• Multiple data sources of different nature producing data of different types (Variety),
• in large quantities (Volume),
• at high rates (Velocity).
Big data platforms such as Apache Hadoop provide the computing, storage and networking
infrastructure necessary to deal with the 3 Vs.
Warehouse
Data in warehouses are added and retrieved for OLAP, but not frequently
updated, hence they can be seen as historical snapshots.
The notion of data ingestion
Beyond Extract-Transform-Load
Data ingestion is the process by which data are made available for processing, either immediately or
after storage.
The ETL process constitutes one approach to data ingestion. However, ETL might not be suitable in many
Big Data scenarios because ETL :
• Requires the schema of the destination warehouse and the type of processing to be known in advance.
• Requires Data sources to be clearly identified.
• It is a time-consuming task.
In many Big Data problems the processing workloads are expected to change course dynamically, and/or
outputs need to be produced in real-time. Hence, rigid solutions provided by ETL-type data ingestion
might be unsuitable.
The notion of data ingestion
This step: only extract data from the (diff) sources and load, no transformation is executed here.
From data warehouses to data streams and lakes
data stream/lake
In many Big Data scenarios it is more convenient to have access to raw, untransformed data, to
produce immediate outputs or to iteratively experiment with the data.
In these cases, data ingestion excludes rigid, time-consuming transformations, and focuses on making
large volumes of raw data available.
Data streams describe the continuous generation of data and data lakes refer to large repositories of raw
data from heterogeneous sources. NOTE: Data transformations are left to subsequent processing stages.
Big data storage solutions need to deal with large volumes of data.
• Partitioning the data into smaller chunks known as shards, allows for easy distribution of data among
the nodes in a cluster.
• Increasing volumes of data can be accommodated by scaling out (horizontally), i.e. adding more nodes
to the cluster.
• Data availability and fault tolerance constraints can be met by implementing replication strategies.
Distributed File System
The file as a storage unit
Distributed file systems store files physically across the nodes of a cluster and organise them logically in a
tree structure. One obvious example is HDFS.
Since distributed file systems treat files as storage units without any consideration about the data being
stored, they are suitable to store semi-structured and unstructured data which creates data lakes.
In addition, distributed file systems can provide high availability and redundancy by implementing
replication mechanisms.
Distributed file systems work best with few large files, rather than multiple small ones, as disk-seeking
activity can slow down data retrieval.
NoSQL databases
Databases designed for flexibility
Traditional databases are optimised for transaction processing (or OLTP). However, they are not well
suited as a primary Big Data storage solution à the usual option is to store/load database data into other
Big Data storage solutions.
In key-value stores, data are stored as (key, value) pairs. The value is not assumed to have any structure
and is stored as an object known as blob. Blobs are retrieved by using the key as the only search criterion.
Keys are mapped to values via a hash function, which produces an address for the location of the value.
The hashing stage facilitates distributed storage via sharding and horizontal scaling, e.g., the different
portion of the prefix of the hash can represents the node address and/or shard address.
Node 1
Index Value
1 John Smith
Key= MyIP Hash 4
3 011101101110
function
Node 2
Index Value
2 <name>Jo<\name>
4 127.0.0.1
2 Document stores
Key-values stores with “query-able” document objects/blobs
Document stores can be seen as key-value stores, where value objects, known as documents, consist of a
collection of labelled and unordered attributes. Documents can have different attributes and new
attributes can be added. Document stores hence offer flexible, semi-structured storage.
In contrast to key-value stores, where blobs are opaque, documents can be queried and retrieved based
on the value of their attributes. Common implementations of document stores include JSON and XML text
files.
{
"year" : 2014,
"title" : ”The Grand Hotel Budapest",
"info" : {
"directors" : [ ”Wes Anderson"],
"actors" : [”Ralph Fiennes", "F. Murray Abraham"]
}
}
3 Column-oriented databases
Tabular data as collections of attribute values (or columns of items)
Rows and columns in tabular data represent items and attributes respectively.
OLTP-oriented databases are designed to retrieve the attributes of individual items, hence by storing a
table row-by-row, OLTP databases can retrieve data more efficiently.
Many Big Data applications are not concerned with individual items and process all the data values from
each attribute. Hence, efficiency can be enhanced by storing tables column-by-column. This brings
additional storage benefits, such as compression.
An example is Google’s BigTable : https://searchdatamanagement.techtarget.com/definition/Google-BigTable
In graph databases relationships between data items are as important as the data items themselves.
Alongside data, graph databases store the connections between data items and are optimised to traverse
the resulting graph structure efficiently.
Graph databases are suitable for scenarios such as social networking, recommendation engines or fraud
detection, where it is necessary to query relationships between data items.
Big Data Processing: Week 6
Topic List:
Storage, compute and network resources can be hosted on-premises or in external datacentres hosted on
the cloud accessible via the internet as a service. The delivery of such services over the internet is known
as cloud computing.
Cloud computing platforms provide with on-demand and flexible access to storage, compute, network and
other IT services. This makes cloud computing a suitable choice to build Big Data solutions, as the amount
of available resources are virtually unlimited and can easily scale as the needs grow.
Examples of cloud computing platforms include AWS (Amazon Web Services), GCP (Google Cloud Platform)
Microsoft Azure, IBM Cloud and Alibaba Cloud.
Cloud services and architectures
Building Big Data solutions in the Cloud
Cloud Computing platforms offer the tools necessary to build Big Data solutions by combining decoupled,
independent services that communicate via interfaces known as Application Programming Interface (API).
Cloud services can offer different levels of control, flexibility and management. It is common to
distinguish between three families of services:
• Infrastructure as a Services (IaaS): Includes fundamental services (storage, compute and network) and
offers the highest level of flexibility, control and management.
• Platform as a Service (PaaS): Focused on building and delivering applications, the cloud provider
manages the underlying infrastructure (resource procurement, OS maintenance, patching).
• Software as a Service (SaaS): End-user application, whose software and underlying infrastructure is
managed by the cloud provider.
Amazon Web Services
Big Data solutions on AWS
Glue: ETL
Redshift:
Warehouse
S3: EC2:
Storage Compute
Google Cloud Platform
Big Data solutions on GCP
Bigtable: DataProc:
NoSQL database Hadoop and Spark