Unit-4 Relational Database and Big Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Unit-4

Relational Database and Big Data

Department of Informatics, Nizam College


Relational Database and Big Data
• Traditional database systems typically use a relational
model where all the data is stored using predetermined
schemas, and linked using the values in specific columns of
each table.
• Requiring a schema to be applied when data is written may
mean that some information hidden in the data is lost.
• Big data solutions do not force a schema onto the stored
data. Instead, you can store almost any type of structured,
semi-structured, or unstructured data and then apply a
suitable schema when you query this data.
• Big data solutions store the data in its raw format and apply
a schema only when the data is read, which preserves all of
the information within the data.

Department of Informatics, Nizam College


Relational Database and Big Data
• Traditional database systems typically consist of a central node
where all processing takes place, which means that all the data from
storage must be moved to the central location for processing. The
capacity of this central node can be increased only by scaling up, and
there is a physical limitation on the number of CPUs and memory,
depending on the chosen hardware platform.
• The consequence of this is a limitation of processing capacity, as
well as network latency when the data is moved to the central node.
• In contrast, big data solutions are optimized for storing vast
quantities of data using simple file formats and highly distributed
storage mechanisms, and the initial processing of the data occurs at
each storage node.
• This means that, assuming you have already loaded the data into the
cluster storage, the bulk of the data does not need to be moved
over the network for processing.

Department of Informatics, Nizam College


Relational Database and Big Data

Department of Informatics, Nizam College


Relational Database and Big Data

Department of Informatics, Nizam College


Will a big data solution replace relational databases?
• Big data batch processing solutions offer a way to avoid
storage limitations, or to reduce the cost of storage and
processing, for huge and growing volumes of data.
• Big data batch processing solutions are extremely unlikely
ever to replace the existing relational database—in the
majority of cases they complement and augment the
capabilities for managing data and generating BI.
• Big data is also a valuable tool when you need to handle data
that is arriving very quickly, and which you can process later.
• You can dump the data into the storage cluster in its original
format, and then process it when required using a query that
extracts the required result set and stores it in a relational
database, or makes it available for reporting.

Department of Informatics, Nizam College


Combining big data batch processing with a relational database

Department of Informatics, Nizam College


Combining big data batch processing with a relational
database
• In this kind of environment, additional capabilities are
enabled. Big data batch processing systems work with almost
any type of data.
• It is quite feasible to implement a bidirectional data
management solution where data held in a relational
database or BI system can be processed by the big data batch
processing mechanism, and fed back into the relational
database or used for analysis and reporting.
• This is exactly the type of environment that Microsoft
Analytics Platform System (APS) provides.

Department of Informatics, Nizam College


Advantages of Relational Models

• Works with structured data


• Supports strict ACID transactional consistency
• Supports joins
• Built-in data integrity
• Large eco-system
• Relationships via constraints
• Limitless indexing
• Strong SQL
• Most off-the-shelf applications run on RDBMS

Department of Informatics, Nizam College


Issues with Relational Models
• Does not scale out horizontally (concurrency and
data size) – only vertically, unless use sharding.
(Sharding is the process of breaking up large
tables into smaller chunks called shards)
• Data is normalized, meaning lots of joins,
affecting speed
• Difficulty in working with semi-structured data
• Schema-on-write
• Cost

Department of Informatics, Nizam College


Non Relational Databases
• A Database that is not relational or non-
RDBMS styled is known as a Non Relational
Databases. (A database that does not use the
table/Key Model of RDBMS is called Non
Relational Databases.)
• It requires effective data operation techniques
and processes, Custom designed and provides
solution to many Big Data Problems.
• NoSQl is one best example for such emerging
Non Relational Database.
Department of Informatics, Nizam College
Advantages of Non Relational Databases

• Works with semi-structured data (JSON, XML)


• Scales out (horizontal scaling – parallel query
performance, replication)
• High concurrency, high volume random reads and
writes
• Massive data stores
• Schema-free, schema-on-read
• Supports documents with different fields
• High availability
• Cost
• Simplicity of design: no “impedance mismatch”
• Finer control over availability
• Speed, due to not having to join tables

Department of Informatics, Nizam College


Disadvantages of Non Relational Databases
• Weaker or eventual consistency (BASE) instead of ACID
• Limited support for joins, does not support star schema •
Data is denormalized, requiring mass updates (i.e. product
name change)
• Does not have built-in data integrity (must do in code)
• No relationship enforcement
• Limited indexing
• Weak SQL
• Limited transaction support
• Slow mass updates
• Uses 10-50x more space (replication, denormalized,
documents)
• Difficulty tracking schema changes over time
• Most NoSQL databases are still too immature for reliable
enterprise operational applications

Department of Informatics, Nizam College


Polyglot Persistence
• In 2006, my colleague Neal Ford coined the
term Polyglot Programming, to express the idea
that applications should be written in a mix of
languages to take advantage of the fact that
different languages are suitable for tackling
different problems.
• Complex applications combine different types of
problems, so picking the right language for the
job may be more productive than trying to fit all
aspects into a single language.

Department of Informatics, Nizam College


Polyglot Programming
• Polyglot Persistence is a fancy term to mean that when
storing data, it is best to use multiple data storage
technologies, chosen based upon the way data is being
used by individual applications or components of a single
application.
• Different kinds of data are best dealt with different data
stores.
• In short, it means picking the right tool for the right use
case. It’s the same idea behind Polyglot Programming,
which is the idea that applications should be written in a
mix of languages to take advantage of the fact that
different languages are suitable for tackling different
problems.

Department of Informatics, Nizam College


Polyglot Persistence
• Looking at a Polyglot Persistence example, an
e-commerce platform will deal with many
types of data (i.e. shopping cart, inventory,
completed orders, etc).
• Instead of trying to store all this data in one
database, which would require a lot of data
conversion to make the format of the data all
the same, store the data in the database best
suited for that type of data.

Department of Informatics, Nizam College


Polyglot Persistence

So the e-commerce platform might look like


this:

Department of Informatics, Nizam College


Polyglot Persistence

The Web Application platform might look like


this:

Department of Informatics, Nizam College


Integrating Big Data with Traditional
Data Warehouse
• In Data Warehouses, Data is
- is homogeneous in nature
- Highly structured
- Adjusted for Custom Purposes
- These structures of records happen to be Highly
centralized.
• While the worlds of big data and the traditional data
warehouse will intersect, they are unlikely to merge
anytime soon.
• Think of a data warehouse as a system of record for
business intelligence, much like a customer relationship
management (CRM) or accounting system. These systems
are highly structured and optimized for specific purposes.
In addition, these systems of record tend to be highly
centralized.
Department of Informatics, Nizam College
Integrating Big Data with Traditional
Data Warehouse
• Organizations will inevitably continue to use data warehouses to
manage the type of structured and operational data that
characterizes systems of record.
• These data warehouses will still provide business analysts with the
ability to analyze key data, trends, and so on. However, the advent
of big data is both challenging the role of the data warehouse and
providing a complementary approach.

“Think of the relationship between the data warehouse and big data
as merging to become a hybrid structure. In this hybrid model, the
highly structured optimized operational data remains in the
tightly controlled data warehouse, while the data that is highly
distributed and subject to change in real time is controlled by a
Hadoop-based (or similar NoSQL) infrastructure.”

Department of Informatics, Nizam College


Integrating Big Data with Traditional
Data Warehouse
• It’s inevitable that operational and structured data
will have to interact in the world of big data, where
the information sources have not (necessarily) been
cleansed or profiled.
• Increasingly, organizations are understanding that
they have a business requirement to be able to
combine traditional data warehouses with their
historical business data sources with less structured
and vetted big data sources. A hybrid approach
supporting traditional and big data sources can help
to accomplish these business goals.

Department of Informatics, Nizam College


Integrating Big Data with Traditional
Data Warehouse
• The main challenges confronting the physical architecture
of next generation Data warehouse platform includes
✓ Data Availability
✓ Loading
✓ Storage Performance Data volume
✓ Scalability, Assorted and varying query demands against the
data,
✓ Operational cost of maintaining the environment.

Department of Informatics, Nizam College

You might also like