SSZG554 Merged DDS PDF

BITS Pilani Presentation
BITS Pilani
Pilani | Dubai | Goa | Hyderabad
Anil Kumar Ghadiyaram
BITS Pilani
DISTRIBUTED DATA SYSTEMS - SSZG554

L1 : Distributes Data Storage Technology
Reference : R1 - Ch. 1 & 2
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
Text Books
T1 : M. Tamer Özsu • Patrick Valduriez Principles
of Distributed Database Systems Third Edition
T2 : Distributed Operating Systems: Concepts And
Design By Pradeep K. Sinha, PHI Learning Pvt.
Ltd
R1 : Storage Networks Explained– by Ulf Troppens, Wolfgang Muller-
Freidt, Rainer Wolafka, IBM Storage Software Development, Germany.
Publishers: Wiley
Note: In order to broaden understanding of concepts as applied to Indian IT industry,

students are advised to refer books of their choice and case-studies in their own
organizations
SSZG554 - Distributed Data Systems 8th Aug 2020 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
PRESENTATION OVERVIEW
• Server-centric IT architecture and its limitations
• Storage-centric IT architecture and its advantages
• Architecture of intelligent disk subsystems
• Hard disks and internal I/O channels
• JBOD: Just a bunch of disks
• Storage virtualisation using RAID
• Different RAID levels
– RAID 0: block-by-block striping
– RAID 1: block-by-block mirroring
– RAID 0+1/RAID 10: striping and mirroring combined
– RAID 0+1: striping and mirroring combined
– RAID 10: striping and mirroring combined

SERVER-CENTRIC IT ARCHITECTURE AND ITS
LIMITATIONS
• Storage devices are normally only connected to a single server
• To increase fault tolerance, storage devices are sometimes connected to two
servers, with only one server actually able to use the storage device at any
one time
• In both cases, the storage device exists only in relation to the server to which
it is connected. Other servers cannot directly access the data; they always
have to go through the server that is connected to the storage device
• Servers and storage devices are generally connected together by SCSI
cables

SERVER-CENTRIC IT ARCHITECTURE AND ITS LIMITATIONS
In a server-centric IT
architecture storage devices
exist only in relation to
servers.

• The failure of both of these computers would make it impossible to access
this data Company must be available around the clock data for example,
patient files, websites
• Each computer can accommodate only a limited number of I/O cards (for
example, SCSI cards).
• The length of scsi cables is limited to a maximum of 25m. This means that
the storage capacity that can be connected to a computer using conventional
technologies is limited
• Conventional technologies are therefore no longer sufficient to satisfy the
growing demand for storage capacity.
• Computer cannot access storage devices that are connected to a different
computer
The storage capacity on server 2
is full. It cannot make use of the
fact that there is still storage
space free on server 1 and
server 3.

STORAGE-CENTRIC IT ARCHITECTURE AND ITS ADVANTAGES
• Storage networks can solve the problems of server-centric IT architecture
• Storage networks open up new possibilities for data management.
• The idea behind storage networks is that the SCSI cable is replaced by a
network that is installed in addition to the existing LAN and is primarily used
for data exchange between computers and storage device in storage
networks storage devices exist completely independently of any computer.
• Several servers can access the same storage device directly over the
storage network without another server having to be involved

In storage-centric IT

architecture the SCSI cables
are replaced by a network.
Storage devices now exist
independently of a server

• When a storage network is introduced, the storage devices are usually also
consolidated.
• This involves replacing the many small hard disks attached to the computers
with a large disk subsystem.
• Disk subsystems currently (in the year 2009) have a maximum storage
capacity of up to a petabyte.
• The storage network permits all computers to access the disk subsystem
and share it.
• Free storage capacity can thus be flexibly assigned to the computer that
needs it at the time. In the same manner, many small tape libraries can be
replaced by one big one

ARCHITECTURE OF INTELLIGENT DISK SUBSYSTEMS
• Disk subsystem can be visualised as a hard disk server.
• Servers are connected to the connection port of the disk subsystem using
standard I/O techniques such as Small Computer System Interface (SCSI),
Fibre Channel or Internet SCSI (iSCSI) and can thus use the storage
capacity that the disk subsystem provides
• The internal structure of the disk subsystem is completely hidden from the
server, which sees only the hard disks that the disk subsystem provides to
the server.



• The connection ports are extended to the hard disks of the disk subsystem
by means of internal I/O channels
• In most disk subsystems there is a controller between the connection ports
and the hard disks.
• The controller can significantly increase the data availability and data
access performance with the aid of a so-called RAID procedure.
• Furthermore, some controllers realize the copying services instant copy and
remote mirroring and further additional services.
• The controller uses a cache in an attempt to accelerate read and write
accesses to the server.

All servers share the storage

capacity of a disk subsystem.
Each server can be assigned
free storage more flexibly as
required

HARD DISKS AND INTERNAL I/O CHANNELS
• Standard I/O techniques such as SCSI, Fibre Channel, increasingly Serial
ATA (SATA) and Serial Attached SCSI (SAS) and, still to a degree, Serial
Storage Architecture (SSA) are being used for internal I/O channels between
connection ports and controller as well as between controller and internal
hard disks

• The I/O channels can be designed with built-in redundancy in order to
increase the fault-tolerance of a disk subsystem.
The following cases can be differentiated here:
ü Active
ü Active/passive
ü Active/active (no load sharing)
ü Active/active (load sharing)

Disk subsystem – active
• In active cabling the individual

physical hard disks are only
connected via one I/O channel
• If this access path fails, then it is no
longer possible to access the data.

Disk subsystem – Active/passive
• In active/passive cabling the individual hard disks
are connected via two I/O channels
• In normal operation the controller communicates
with the hard disks via the first I/O channel and
the second I/O channel is not used.
• In the event of the failure of the first I/O channel,
the disk subsystem switches from the first to the
second I/O channel.

Disk subsystem – Active/active (no load sharing)
• In this cabling method the controller uses both I/O
channels in normal operation
• The hard disks are divided into two groups
• In normal operation the first group is addressed
via the first I/O channel and the second via the
second I/O channel.
• If one I/O channel fails, both groups are
addressed via the other I/O channel.

Disk subsystem – Active/active (load sharing)
• In this approach all hard disks are addressed via
both I/O channels in normal operation
• The controller divides the load dynamically
between the two I/O channels so that the available
hardware can be optimally utilised.
• If one I/O channel fails, then the communication
goes through the other channel only.

JBOD: JUST A BUNCH OF DISKS
If we compare disk subsystems with regard to their controllers we can
differentiate between three levels of complexity:
1. No controller
2. RAID controller
3. Intelligent controller with additional services such as instant copy and
remote mirroring

• If the disk subsystem has no internal controller, it is only an enclosure full of
disks (JBODs). In this instance, the hard disks are permanently fitted into the
enclosure and the connections for I/O channels and power supply are taken
outwards at a single point.
• Therefore, a JBOD is simpler to manage than a few loose hard disks.
• Typical JBOD disk subsystems have space for 8 or 16 hard disks.
• A connected server recognizes all these hard disks as independent disks.
Therefore, 16 device addresses are required for a JBOD disk subsystem
incorporating 16 hard disks

• A connected server recognises all these hard disks as independent disks.
• Therefore, 16 device addresses are required for a JBOD disk subsystem
incorporating 16 hard disks.
• In some I/O techniques such as SCSI and Fibre Channel arbitrated loop this
can lead to a bottleneck at device addresses
• In contrast to intelligent disk subsystems, a JBOD disk subsystem in
particular is not capable of supporting RAID or other forms of virtualisation.
• If required, however, these can be realised outside the JBOD disk
subsystem, for example, as software in the server or as an independent
virtualisation entity in the storage network

STORAGE VIRTUALISATION USING RAID
• RAID has two main goals: to increase performance by striping and to
increase fault-tolerance by redundancy.
• Striping distributes the data over several hard disks and thus distributes the
load over more hardware.
• Redundancy means that additional information is stored so that the operation
of the application itself can continue in the event of the failure of a hard disk.

• The bundle of physical hard disks brought together by the RAID controller
are also known as virtual hard disks.
• A server that is connected to a RAID system sees only the virtual hard disk;
the fact that the RAID controller actually distributes the data over several
physical hard disks is completely hidden to the server
• This is only visible to the administrator from outside.

• The hot spare disks are not used in normal operation. If a disk fails, the RAID
controller immediately begins to copy the data of the remaining intact disk
onto a hot spare disk.
• After the replacement of the defective disk this is included in the pool of hot
spare disks

• The RAID controller combines several physical hard disks to create a virtual
hard disk.
• The server sees only a single virtual hard disk. The controller hides the
assignment of the virtual hard disk to the individual physical hard disks.

DIFFERENT RAID LEVELS
• RAID has developed since its original definition in 1987. Due to technical
progress some RAID levels are now practically meaningless, whilst others
have been modified or added at a later date.

RAID 0: BLOCK-BY-BLOCK STRIPING
• RAID 0 distributes the data that the server writes to the virtual hard disk onto
one physical hard disk after another block-by-block
• RAID 0 increases the performance of the virtual hard disk as follows: the
individual hard disks can exchange data with the RAID controller via the I/O
channel significantly more quickly than they can write to or read from the
rotating disk


• RAID 0 increases the performance of the virtual hard disk, but not its fault-
tolerance.
• If a physical hard disk is lost, all the data on the virtual hard disk is lost.
• To be precise, therefore, the ‘R’ for ‘Redundant’ in RAID is incorrect in the
case of RAID 0, with ‘RAID 0’ standing instead for ‘zero redundancy’.

RAID 1: BLOCK-BY-BLOCK MIRRORING
• In contrast to RAID 0, in RAID 1 fault-tolerance is of primary importance.
• The basic form of RAID 1 brings together two physical hard disks to form a
virtual hard disk by mirroring the data on the two physical hard disks.
• If the server writes a block to the virtual hard disk, the RAID controller writes
this block to both physical hard disks
• The individual copies are also called mirrors. Normally, two or sometimes
three copies of the data are kept (three-way mirror).

• When writing with RAID 1 it tends to be the case that reductions in
performance may even have to be taken into account.
• This is because the RAID controller has to send the data to both hard disks.
• This disadvantage can be disregarded for an individual write operation


RAID 0+1/RAID 10: STRIPING AND MIRRORING
COMBINED
• The problem with RAID 0 and RAID 1 is that they increase either
performance (RAID 0) or fault-tolerance (RAID 1).
• In RAID 10 (striped mirrors) the sequence of RAID 0 (striping) and RAID 1
(mirroring) is reversed in relation to RAID 0+ 1 (mirrored stripes).

RAID 0+1: STRIPING AND MIRRORING COMBINED
• Principle behind RAID 0+ 1 (mirrored stripes). In the example, eight physical
hard disks are used.
• The RAID controller initially brings together each four physical hard disks to
form a total of two virtual hard disks that are only visible within the RAID
controller by means of RAID 0 (striping).
• In the second level, it consolidates these two virtual hard disks into a single
virtual hard disk by means of RAID 1 (mirroring); only this virtual hard disk is
visible to the server.

RAID 0+1:
STRIPING AND
MIRRORING
COMBINED

RAID 10: STRIPING AND MIRRORING COMBINED
• The principle underlying RAID 10 based again on eight physical hard disks.
• In RAID 10 the RAID controller initially brings together the physical hard
disks in pairs by means of RAID 1 (mirroring) to form a total of four virtual
hard disks that are only visible within the RAID controller.
• In the second stage, the RAID controller consolidates these four virtual hard
disks into a virtual hard disk by means of RAID 0 (striping). Here too, only
this last virtual hard disk is visible to the server.

RAID 10
STRIPED MIRRORS

• In both RAID 0+ 1 and RAID 10 the server sees only a single hard disk,
which is larger, faster and more fault-tolerant than a physical hard disk.
• We now have to ask the question: which of the two RAID levels, RAID 0+ 1
or RAID 10, is preferable?

SSZG554 - Distributed Data Systems 8th Aug 2020 44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
THANK YOU

BITS Pilani
BITS Pilani

L2 : Distributes Data Storage Technology
Text Books
Ltd
Publishers: Wiley

organizations
• RAID 4 AND RAID 5
• RAID 6: DOUBLE PARITY
• RAID 2
• RAID 3
• Comparison of the raid levels
• Basic forms of storage

RAID 4 AND RAID 5
• RAID 10 provides excellent performance at a high level of fault-tolerance.
The problem with this is that mirroring using RAID 1 means that all data is
written to the physical hard disk twice.
• RAID 10 thus doubles the required storage capacity.
• The idea of RAID 4 and RAID 5 is to replace all mirror disks of RAID 10 with
a single parity hard disk.

RAID 4 AND RAID 5
• RAID controller calculates a parity block for every four blocks and writes this
onto the fifth physical hard disk.
• For example, the RAID controller calculates the parity block PABCD for the
blocks A, B, C and D.
• If one of the four data disks fails, the RAID controller can reconstruct the
data of the defective disks using the three other data disks and the parity
disk.

RAID 4 AND RAID 5
• From a mathematical point of view the parity block is calculated with the aid
of the logical XOR operator (Exclusive OR).
• For example, the equation PABCD = A XOR B XOR C XOR D applies.
• The space saving offered by RAID 4 and RAID 5, comes at a price in relation
to RAID 10.
• Changing a data block changes the value of the associated parity block.

RAID 4 AND RAID 5
This means that each write operation to the virtual hard disk requires
(1) the physical writing of the data block,
(2) the recalculation of the parity block and
(3) the physical writing of the newly calculated parity block.
This extra cost for write operations in RAID 4 and RAID 5 is called the write
penalty of RAID 4 or the write penalty of RAID 5

RAID 4 AND RAID 5
• RAID 4 (parity disk) is designed to reduce the storage requirement of RAID
0+1 and RAID 10.
• In the example, the data blocks are distributed over four physical hard disks
by means of RAID 0 (striping).
• Instead of mirroring all data once again, only a parity block is stored for each
four blocks.

SSZG554 - Distributed Data Systems 16th Aug 2020 55
RAID 5 (striped parity):
In RAID 4 each write
access by the server is
associated
with a write operation to
the parity disk for the
update of parity
information.
RAID 5 distributes
the load of the parity disk
over all physical hard
SSZG554 - Distributed Data Systems 16th Aug 2020 disks. 56
RAID 6: DOUBLE PARITY
• RAID 6 offers a compromise between RAID 5 and RAID 10 by adding a
second parity hard disk to extend RAID 5, which then uses less storage
capacity than RAID 10.
• There are different approaches available today for calculating the two parity
blocks of a parity group.
• However, none of these procedures has been adopted yet as an industry
standard.
• Irrespective of an exact procedure, RAID 6 has a poor write performance
because the write penalty for RAID 5 strikes twice

RAID 2
• The early work on RAID began at a time when disks were not yet very
reliable bit errors were possible that could lead to a written ‘one’ being read
as ‘zero’ or a written ‘zero’ being read as ‘one’.
• In RAID 2 the Hamming code is used, so that redundant information is stored
in addition to the actual data.
• This additional data permits the recognition of read errors and to some
degree also makes it possible to correct them.
• Today, comparable functions are performed by the controller of each
individual hard disk, which means that RAID 2 no longer has any practical
significance.

RAID 3
• Like RAID 4 or RAID 5, RAID 3 stores parity data. RAID 3 distributes the
data of a block amongst all the disks of the RAID 3 system so that, in
contrast to RAID 4 or RAID 5, all disks are involved in every read or write
access. RAID 3 only permits the reading and writing of whole blocks, thus
dispensing with the write penalty that occurs in RAID 4 and RAID 5.
• The writing of individual blocks of a parity group is thus not possible.
• In addition, in RAID 3 the rotation of the individual hard disks is synchronised
so that the data of a block can truly be written simultaneously.
• RAID 3 was for a long time called the recommended RAID level for
sequential write and read profiles such as data mining and video processing.

RAID 3
• Current hard disks come with a large cache of their own, which means that
they can temporarily store the data of an entire track, and they have
significantly higher rotation speeds than the hard disks of the past.
• As a result of these innovations, other RAID levels are now suitable for
sequential load profiles, meaning that RAID 3 is becoming less and less
important.

COMPARISON OF THE RAID LEVELS

BASIC FORMS OF STORAGE
There are three basic forms of Storage
1. Direct access storage (DAS)
2. Network attached storage (NAS)
3. Storage area network (SAN)
And a number of variations on each (especially the last two)

COMPARISON
DAS NAS SAN

Storage Type sectors shared files blocks
Data Transmission IDE/SCSI TCP/IP, Ethernet Fibre Channel
Access Mode clients or servers clients or servers servers
Capacity (bytes) 109 109 - 1012 Ø1012
Complexity Easy Moderate Difficult

Management
High Moderate Low
Cost (per GB)

COMPARISON

THANK YOU

BITS Pilani
BITS Pilani

L3 : Distributed File System
Text Books
Ltd
Publishers: Wiley

organizations
SSZG554 - Distributed Data Systems 5th Sep 2020 68 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• File • File Replication
• File System • Fault Tolerance
• A distributed file system • File Security
• File Models • Authentication
– Unstructured and Structured Files – Approaches to Authentication
– Mutable and immutable Files – Proof by knowledge
• File Accessing Models – Proof by possession
– Accessing Remote Files – Proof by property
– Unit of Data Transfer • Access Control
• File Sharing Semantics
• File Caching Schemes

File
In a computer system, a file is a named object that comes into
existence by explicit creation, is immune to temporary failures in the system,
and persists until explicitly destroyed. The two main purposes of using files are
as follows:
1. Permanent storage of information. This is achieved by storing a file on a
secondary storage media such as a magnetic disk.
2. Sharing of information. Files provide a natural and easy means of
information sharing. That is, a file can be created by one application and
then shared with different applications at a later time.

File System
• A file system is a subsystem of an operating system that performs file
management activities such as organization, storing, retrieval, naming,
sharing, and protection of files.
• It is designed to allow programs to use a set of operations that characterize
the file abstraction and free the programmers from concerns about the
details of space allocation and layout of the secondary storage device.
• Therefore, a file system provides an abstraction of a storage device; that is, it
is a convenient mechanism for storing and retrieving information from the
storage device,

A distributed file system
• A distributed file system provides similar abstraction to the users of a
distributed system and makes it convenient for them to use files in a
distributed environment.
• In addition to the advantages of permanent storage and sharing of
information provided by the file system of a single-processor system, a
distributed file system normally supports the following:
1. Remote information sharing.
2. User mobility
3. Availability
4. Diskless workstations

A distributed file system typically provides the following three types of
services:
1. Storage Service
2. True File Service
3. Name Service

A good Distributed file system should have the following features
ü Transparency
ü User Mobility
ü Performance
ü Simplicity and Ease of Use
ü Scalability
ü High Availability
ü High Reliability
ü Data Integrity
ü Security
ü Heterogeneity

File Models
• Different file systems use different conceptual models of a file.
• The two most commonly used criteria for file modeling are
– Structure
– Modifiability.

Unstructured and Structured Files
• A file is an unstructured sequence of data.
• In this model, there is no substructure known to the file server and the
contents of each file of the file system appears to the file server as an
uninterpreted sequence of bytes.
• The operating system is not interested in the information stored in the files.
• Hence, the interpretation of the meaning and structure of the data stored in
the files are entirely up to the application programs.
• UNIX and MS-DOS use this file model.

• Another file model that is rarely used nowadays is the structured file model.
• In this model, a file appears to the file server as an ordered sequence of
records.
• Structured files are again of two types
– Files with non-indexed records and
– Files with indexed records.

• In the non-indexed model, a file record is accessed by specifying its position
within the file, for example, the fifth record from the beginning of the file or
the second record from the end of the file.
• In the indexed model, records have one or more key fields and can be
addressed by specifying the values of the key fields.
• In file systems that allow indexed records, a file is maintained as a B-tree or
other suitable data structure or a hash table is used to locate records quickly.

Mutable and immutable Files
• According to the modifiability criteria, files are of two types
– Mutable
– Immutable.
• Most existing operating systems use the mutable file model.
• In this model, an update performed on a file overwrites on its old contents to
produce the new contents
• That is, a file is represented as a single stored sequence that is altered by
each update operation.

Mutable and immutable Files
• On the other hand, some more recent file systems, such as the Cedar File
System (CFS) use the immutable file model.
• In this model, a file cannot be modified once it has been created except to be
deleted.
• The file versioning approach is normally used to implement file updates, and
each file is represented by a history of immutable versions.
• That is, rather than updating the same file, a new version of the file is
created each time a change is made to the file contents and the old version
is retained unchanged.

File Accessing Models
• The manner in which a client's request to access a file is serviced depends
on the file accessing model used by the file system.
• The file-accessing model of a distributed file system mainly depends on two
factors
– The method used for accessing remote files and
– The unit of data access.

Accessing Remote Files
1. Remote service model.

• The processing of the client's request is performed at the server's node.
• That is, the client's request for file access is delivered to the server, the
server machine performs the access request, and finally the result is
forwarded back to the client.
• The access requests from the client and the server replies for the client are
transferred across the network as messages.

• Notice that the data packing and communication overheads per message
can be significant.
• Therefore, if the remote service model is used, the file server interface and
the communication protocols must be designed carefully to minimize the
overhead of generating messages as well as the number of messages that
must be exchanged in order to satisfy a particular request.

2. Data-caching model
• In the remote service model, every remote file access request results in
network traffic.
• The data-caching model attempts to reduce the amount of network traffic by
taking advantage of the locality feature found in file accesses.
• In this model, if the data needed to satisfy the client's access request is not
present locally, it is copied from the server's node to the client’s node and is
cached there.

• The client's request is processed on the client's node itself by using the
cached data.
• Recently accessed data are retained in the cache for some time so that
repeated accesses to the same data can be handled locally.
• A replacement policy, such as the least recently used (LRU), is used to keep
the cache size bounded.

Unit of Data Transfer
• In file systems that use the data-caching model, an important design issue is
to decide the unit of data transfer
• Unit of data transfer refers to the fraction (or its multiples) of a file data that is
transferred to and from clients as a result of a single read or write operation.

1. File-level transfer model.

• In this model, when an operation requires file data to be transferred across
the network in either direction between a client and a server, the whole file is
moved.
• Transmitting an entire file in response to a single request is more efficient
than transmitting it page by page
• The main drawback of this model is that it requires sufficient storage space
on the client's node for storing all the required files in their entirety.
• Therefore, this approach fails to work with very large files

2. Block-level transfer model.

• In this model, file data transfers across the network between a client and a
server take place in units of file blocks.
• A file block is a contiguous portion of a file and is usually fixed in length.
• For file systems in which block size is equal to virtual memory page size, this
model is also called a page-level transfer model.

2. Block-level transfer model.

• The advantage of this model is that it does not require client nodes to have
large storage space.
• It also eliminates the need to copy an entire file when only a small portion of
the file data is needed.

3. Byte-level transfer model

server take place in units of bytes.
• This model provides maximum flexibility because it allows storage and
retrieval of an arbitrary sequential subrange of a file, specified by an offset
within a file, and a length.
• The main drawback of this model is the difficulty in cache management due
to the variable-length data for different access requests.

4. Record-level transfer model.

• The three file data transfer models described above are commonly used with
unstructured file models.
• The record-level transfer model is suitable for use with those file models in
which file contents are structured in the form of records.
server take place in units of records.

File Sharing Semantics
• A shared file may be simultaneously accessed by multiple users.
• In such a situation, an important design issue for any file system is to clearly
define when modifications of file data made by a user are observable by
other users.
• This is defined by the type of file sharing semantics adopted by a file system.

1. UNIX semantics.
• This semantics enforces an absolute time ordering on all operations and
ensures that every read operation on a file sees the effects of all previous
Write operations performed on that file
• In particular, writes to an open file by a user immediately become visible to
other users who have this file open at the same time.
• The UNIX semantics is commonly implemented in file systems for single
processor systems because it is the most desirable semantics and also
because it is easy to serialize all read/write requests.

UNIX semantics

An example explaining why it is difficult

to achieve UNIX semantics in a
distributed file system even when the
shared file is handled by a single server.

2. Session semantics.
• A client opens a file, performs a series of read/write operations on the file,
and finally closes the file when he or she is done with the file.
• A session is a series of file accesses made between the open and close
operations.
• In session semantics, all changes made to a file during a session are initially
made visible only to the client process that opened the session and are
invisible to other remote processes who have the same file open
simultaneously.

• Once the session is closed, the changes made to the file are made visible to
remote processes only in later starting sessions.
• Already open instances of the file do not reflect these changes.

3. Immutable shared-files semantics.

• This semantics is based on the use of the innmutable file model.
• According to this semantics, once the creator of a file declares it to be
sharable, the file is treated as immutable, so that it cannot be modified any
more.
• Each version of the file is treated as an entirely new file.
• Therefore, the semantics allows files to be shared only in the read only
mode.

• With this approach, since shared files cannot be changed at all, the problem
of when to make the changes made to a file by a user visible to other users
simply disappears.

4. Transaction-like semantics.
• This semantics is based on the transaction mechanism, which is a high-level
mechanism for controlling concurrent access to shared, mutable data.
• A transaction is a set of operations enclosed in-between a pair of begin_
transaction and end transaction-like operations.
• The transaction mechanism ensures that the partial modifications made to
the shared data by a transaction will not be visible to other concurrently
executing transactions until the transaction ends
• Therefore, in multiple concurrent transactions operating on a file, the final file
content will be the same as if all the transactions were run in some
sequential order.
File Caching Schemes
• File caching has been implemented in several file systems for centralized
time-sharing systems to improve file I/O performance
• In implementing a file-caching scheme for a centralized file system, one has
to make several key decisions, such as the granularity of cached data (large
versus small), cache size (large versus small, fixed versus dynamically
changing), and the replacement policy.
In addition to these issues, a file-caching scheme for a distributed file
system should also address the following key decisions:
1. Cache location
2. Modification propagation
3. Cache validation
• Cache location refers to the place where the cached data is stored.
• Assuming that the original location of a file is on its server's disk, there are
three possible cache locations in a distributed file system
ü Server’s Main Memory
ü Client’s Disk
ü Client’s Main Memory
Modification propagation
• Keeping file data cached at multiple client nodes consistent is an important
design issue in those distributed file systems that use client caching.
Write-through Scheme
• In this scheme, when a cache entry is modified, the new value is immediately
sent to the server for updating the master copy of the file.
• This scheme has two main advantages—high degree of reliability and
suitability for UNIX-like semantics.
• Since every modification is immediately propagated to the server having the
master copy of the file, the risk of updated data getting lost (when a client
crashes) is very low.
• A major drawback of this scheme is its poor write performance.
Delayed-Write Scheme
• Although the write-through scheme helps on reads, it does not help in
reducing the network traffic for writes.
• Therefore, to reduce network traffic for writes as well, some systems use the
delayed-write scheme.
• In this scheme, when a cache entry is modified, the new value is written only
to the cache and the client just makes a note that the cache entry has been
updated.
• Some time later, all updated cache entries corresponding to a file are
gathered together and sent to the server at a time.
Cache validation
• The modification propagation policy only specifies when the master copy of a
file at the server node is updated upon modification of a cache entry.
• It does not tell anything about when the file data residing in the cache of
other nodes is updated.
• Obviously, a client's cache entry becomes stale as soon as some other client
modifies the data corresponding to the cache entry in the master copy of the
file.
• Therefore, it becomes necessary to verify if the data cached at a client node
is consistent with the master copy.
• If not, the cached data must be invalidated and the updated version of the
data must be fetched again from the server.
• There are basically two approaches to verify the validity of cached data
– the client initiated approach and
– the server-initiated approach
File Replication
• High availability is a desirable feature of a good distributed file system and
file replication is the primary mechanism for improving file availability.
• A replicated file is a file that has multiple copies, with each copy located on a
separate file server.
• Each copy of the set of copies that comprises a replicated file is referred to
as a replica of the replicated file
File Replication
The replication of data in a distributed system offers the following potential
benefits:
ü Increased Availability
ü Increased Reliability
ü Improved Response Time
ü Reduced Network Traffic
ü Improved System throughput
ü Better Scalability
ü Autonomous operation
Fault Tolerance
• Various types of faults could harm the integrity of the data stored by such a
system.
• For instance, a processor loses the contents of its main memory in the event
of a crash.
• Such a failure could result in logically complete but physically incomplete file
operations, making the data that are stored by the file system inconsistent.
• Similarly, during a request processing, the server or client machine may
crash, resulting in the loss of state information of the file being accessed.
Fault Tolerance
• This may have an uncertain effect on the integrity of file data.
• Also, other adverse environmental phenomena such as transient faults
(caused by electromagnetic fluctuations) or decay of disk storage devices
may result in the loss or corruption of data stored by a file system.
• A portion of a disk storage device is said to be “decayed” if the data on that
portion of the device are irretrievable.
File Security
• Computer systems store large amounts of information, some of which is
highly sensitive and valuable to their users.
• Users can trust the system and rely on it only if the various resources and
information of a computer system are protected against destruction and
unauthorized access.
• Obviously, the security requirements are different for different computer
systems depending on the environment in which they are supposed to
operate.
File Security
• For example, security requirements for systems meant to operate in a
military environment are different from those for systems that are meant to
operate in an educational environment.
• The security goals of a computer system are decided by its security policies,
and the methods used to achieve these goals are called security
mechanisms.

File Security
Irrespective of the operation environment, some of the common goals
of computer security are as follows
1. Secrecy. Information within the system must be accessible only to
authorized users
2. Privacy. Misuse of information must be prevented. That is, a piece
of information given to a user should be used only for the purpose
for which it was given.
File Security
3. Authenticity. When a user receives some data, the user must be able to
verify its authenticity. That is, the data arrived indeed from its expected sender
and not from any other source.
4. Integrity. Information within the system must be protected against accidental
destruction or intentional corruption by an unauthorized user.
File Security
• A total approach to computer security involves both external and internal
security.
• External security deals with securing the computer system against external
factors such as fires, floods, earthquakes, stolen disks/tapes, leaking out of
stored information by a person who has access to the information, and so on.
File Security
• For external security, the commonly used methods include maintaining

adequate backup copies of stored information at places far away from the
original information, using security guards to allow the entry of only
authorized persons into the computer center, allowing the access to sensitive
information to only trusted employees/users, and so on.
File Security
Internal security, on the other hand, mainly deals with the following
two aspects:
1. User authentication. Once a user is allowed physical access to the
computer facility, the user's identification must be checked by the system
before the user can actually use the facility.
2. Access control. A computer system contains many resources and several
types of information. Obviously, not all resources and information are
meant for all users.

File Security
• In fact, a secure system requires that at any time a subject (person or
program) should be allowed to access only those resources that it currently
needs to complete its task. This requirement is commonly referred to as the
need-to-know principle or the principle of least privilege.
Authentication
• Authentication deals with the problem of verifying the identity of a user
(person or program) before permitting access to the requested resource.
• That is, an authentication mechanism prohibits the use of the system (or
some resource of the system) by unauthorized users by verifying the identity
of a user making a request.
• Authentication basically involves identification and verification.
Authentication
• Identification is the process of claiming a certain identity by a user, while
verification is the process of verifying the user's claimed identity.
• Thus, the correctness of an authentication process relies heavily on the
verification procedure employed.
Authentication
The main types of authentication normally needed in a distributed

system are as follows:
1. User logins authentication. It deals with verifying the identity of a user by
the system at the time of login.
2. One-way authentication of communicating entities. It deals with
verifying the identity of one of the two communicating entities by the other
entity.
3. Two-way authentication of communicating entities. It deals with mutual
authentication, whereby both communicating entities verify each other's
identity.
Approaches to Authentication
• Proof by knowledge.
• Proof by Possession
• Proof by Property
Proof by knowledge
• In this approach, authentication involves verifying something that can only be
known by an authorized principal.
• Authentication of a user based on the password supplied by him or her is an
example of proof by knowledge.
• Authentication methods based on the concept of proof by knowledge are
again of two types
– Direct demonstration method and
– Challenge-response method.
Proof by knowledge
• In the direct demonstration method, a user claims his or her identity by
supplying information (like typing in a password) that the verifier checks
against pre stored information.
• On the other hand, in the challenge-response method, a user proves his or
her identity by responding correctly to the challenge questions asked by the
verifier.
• For instance, when signing up as a user, the user picks a function, for
example, x + 18.
Proof by knowledge
• For further security improvement, several functions may be used by the
same user.
• At the time of login, the function to be used will depend on when the login is
made.
• For example, a user may use seven different functions, one for each day of
the week.
• In another variation of this method, a list of questions such as what is the
name of your father, what is the name of your mother, what is the name of
your street on which your house is located, is maintained by the system.
Proof by knowledge
• When signing up as a user, the user has to reply to all these questions and
the system stores the answers.
• At login, the system asks one of these questions at random and verifies the
answer supplied by the user.
Proof by possession
• In this approach, a user proves his or her identity by producing some item
that can only be possessed by an authorized principal.
• The system is designed to verify the produced item to confirm the claimed
identity.
• For example, a plastic card with a magnetic strip on it that has a user
identifier number written on it in invisible, electronic form may be used as the
item to be produced by the user.
Proof by possession
• The user inserts the card in a slot meant for this purpose in the system's
terminal, which then extracts the user identifier number from the card and
checks to see if the card produced belongs to an authorized user.
• Obviously, security can be ensured only if the item to be produced is
unforgeable and safely guarded.
Proof by property
• In this approach, the system is designed to verify the identity of a user by
measuring some physical characteristics of the user that are hard to forge.
• The measured property must be distinguishing, that is, unique among all
possible users.
• For example, a special device (known as a biometric device) may be
attached to each terminal of the system that verifies some physical
characteristic of the user, such as the person's appearance, fingerprints,
hand geometry, voice, signature.
Proof by property
• In deciding the physical characteristic to be measured, an important factor to
be considered is that the scheme must be phycologically acceptable to the
user community.
• Biometric systems offer the greatest degree of confidence that a user
actually is who he or she claims to be, but they are also generally the most
expensive to implement.
• Moreover, they often have user acceptance problems because users see
biometric devices as unduly intrusive.
• Of the three authentication approaches described above, proof by
knowledge and possession can be applied to all types of authentication
needs in a secure distributed system, while proof by property is generally
limited to the authentication of human users by a system equipped with
specialized measuring instruments.
• Moreover, in practice, a system may use a combination of two or more of
these authentication methods.
• For example, the authentication mechanism used by automated cash-
dispensing machines usually employ a combination of the first two
approaches.
• That is, a user is allowed to withdraw money only if he or she produces a
valid identification card and specifies the correct password corresponding to
the identification number on the card.
Access Control
• Once a user or a process has been authenticated, the next step in security is
to devise ways to prohibit the user or the process from accessing those
resources/information that he or she or it is not authorized to access.
• This issue is called authorization and is dealt with by using access control
mechanisms.
• Access control mechanisms (also known as protection mechanisms) used in
distributed systems are basically the same as those used in centralized
systems.
Access Control
• The main difference is that since all resources are centrally located in a
centralized system, access control can be performed by a central authority.
• However, in a distributed client-server environment, each server is
responsible for controlling access to its own resources.
Access Control
Objects
• An object is an entity to which access must be controlled.
• An object may be an abstract entity, such as a process, a file, a database, a
semaphore, a tree data structure, or a physical entity, such as a CPU, a
memory segment, a printer, a card reader, a tape drive, a site of a network.
• An object is referenced by its unique name. In addition, associated with each
object is a “type” that determines the set of operations that may be
performed on it.
Access Control
• For example, the set of operations possible on objects belonging to the type
“data file” may be Open, Close, Create, Delete, Read, and Write, whereas
for objects belonging to the type “program file,” the set of possible operations
may be Read, Write, and Execute.
• Similarly, for objects of type “semaphore,” the set of possible operations may
be Up and Down, and for objects of type "tape drive,” the set of possible
operations may be Read, Write, and Rewind.
Access Control
Subjects
• A subject is an active entity whose access to objects must be controlled.
• That is, entities wishing to access and perform operations on objects and to
which access authorizations are granted are called subjects.
• Examples of subjects are processes and users.
• Note that subjects are also objects since they too must be protected.
Therefore, each subject also has a unique name.
Access Control
Protection rules.
• Protection rules define the possible ways in which subjects and objects are
allowed to interact.
• Therefore, associated with each (subject, object) pair is an access right that
defines the subset of the set of possible operations for the object type that
the subject may perform on the object.
• The complete set of access rights of a system defines which subjects can
perform what operations on which objects.
• At any particular instance of time, this set defines the protection state of the
system at that time.
THANK YOU
BITS Pilani Anil Kumar Ghadiyaram
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
DISTRIBUTED DATASYSTEMS - SSZG554

L5 : Distributed Database Design Issues & Integration
Reference : T1 - Ch. 3 & 4
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by
Text Books
T1 : M. Tamer Özsu • Patrick Valduriez Principles of

Distributed Database Systems Third Edition

Design By Pradeep K. Sinha, PHI Learning Pvt. Ltd
Publishers: Wiley
Note: In order to broaden understanding of concepts as applied to Indian IT
industry, students are
advised to refer books of their choice and case-studies in their own organizations
PRESENTATIONOVERVIEW
• Framework of Distribution • Vertical Fragmentation

– Level of sharing – Attribute Usage Matrix
– Behavior of access patterns – Attribute Affinity Matrix
– Level of knowledge • Hybrid Fragmentation
• Top-Down Design Process • Allocation
• Distribution Design Issues • Bottom-Up Design Methodology
– Reasons for Fragmentation – Schema Matching
– Fragmentation Alternatives – Schema Integration
– Degree of Fragmentation – Schema Mapping
– Allocation Alternatives – Data Cleaning
– Information Requirements
• Horizontal Fragmentation
– Primary Horizontal Fragmentation
– Derived Horizontal Fragmentation

Framework of Distribution
• The design of a distributed computer system involves making decisions on the

placement of data and programs across the sites of a computer network, as well as
designing the network
• In the case of distributed DBMSs, the distribution of applications involves two
things: the distribution of the distributed DBMS software and the distribution of the
application programs that run on it.

Framework of Distribution
It has been suggested that the organization of distributed systems can be

investigated along three orthogonal dimensions:
1. Level of sharing
2. Behavior of access patterns
3. Level of knowledge on access pattern behavior

Level of sharing
• In terms of the level of sharing, there are three possibilities.

• First, there is no sharing each application and its data execute at one site, and there
is no communication with any other program or access to any data file at other sites.
• We then find the level of data sharing all the programs are replicated at all the
sites, but data files are not.
• Accordingly, user requests are handled at the site where they originate and the
necessary data files are moved around the network.

Level of sharing
• Finally, in data-plus-program sharing, both data and programs may be shared,

meaning that a program at a given site can request a service from another program
at a second site, which, in turn, may have to access a data file located at a third site.

Behavior of access patterns
• Along the second dimension of access pattern behavior, it is possible to identify two
alternatives.
• The access patterns of user requests may be static, so that they do not change over
time, or dynamic
• The significant question, then, is not whether a system is static or dynamic, but how
dynamic it is

Level Of Knowledge
• The third dimension of classification is the level of knowledge about the access
• pattern behavior. One possibility, of course, is that the designers do not have any
• information about how users will access the database.

Level Of Knowledge
• The more practical alternatives are that the designers have complete information,
where the access patterns can reasonably be predicted and do not deviate
significantly from these predictions, or partial information, where there are
deviations from the predictions

Top-Down Design Process
• A framework for top-down design process activity begins with a requirements

analysis that defines the environment of the system and elicits both the data and
processing needs of all potential database users
• The requirements document is input to two parallel activities:
– View design and
– Conceptual design.

SSZG554 - Distributed Data Systems 12th Sep 2020 14 BITS Pilani, De emed to be University under Section 3 of UGC Act, 1956
• The view design activity deals with defining the interfaces for end users.
• The conceptual design , on the other hand, is the process by which the enterprise is
examined to determine entity types and relationships among these entities
• This process can be divided into two related activity groups
– Entity analysis and
– Functional analysis.

• Entity analysis is concerned with determining the entities, their attributes, and the
relationships among them.
• Functional analysis is concerned with determining the fundamental functions with
which the modeled enterprise is involved.
• The results of these two steps need to be cross-referenced to get a better
understanding of which functions deal with which entities.

• There is a relationship between the conceptual design and the view design.
• The conceptual design can be interpreted as being an integration of user views.
• Even though this view integration activity is very important, the conceptual model
should support not only the existing applications, but also future applications.
• View integration should be used to ensure that entity and relationship
requirements for all the views are covered in the conceptual schema.

• In conceptual design and view design activities the user needs to specify the data
entities and must determine the applications that will run on the database as well
as statistical information about these applications.
• Statistical information includes the specification of the frequency of user
applications, the volume of various information, and like
• Till here we have not yet considered the implications of the distributed
environment

• The global conceptual schema (GCS) and access pattern information collected as a
result of view design are inputs to the distribution design step.
• The objective is to design the local conceptual schemas (LCSs) by distributing the
entities over the sites of the distributed system.
• Rather than distributing relations, it is quite common to divide them into
subrelations, called fragments , which are then distributed.
• Thus, the distribution design activity consists of two steps: fragmentation and
allocation .

• The last step in the design process is the physical design, which maps the local
conceptual schemas to the physical storage devices available at the corresponding
sites.
• The inputs to this process are the local conceptual schema and the access pattern
information about the fragments in them.

• It is well known that design and development activity of any kind is an ongoing
process requiring constant monitoring and periodic adjustment and tuning handled
by observation and monitoring as a major activity in this process.
• Not only the behavior of the database implementation but also the suitability of
user views.
• The result is some form of feedback, which may result in backing up to one of the
earlier steps in the design.

Distribution Design Issues
1. Why fragment at all?

2. How should we fragment?
3. How much should we fragment?
4. Is there any way to test the correctness of decomposition?
5. How should we allocate?
6. What is the necessary information for fragmentation and allocation?

Reasons for Fragmentation
• With respect to fragmentation, the important issue is the appropriate unit of

distribution.
• A relation is not a suitable unit, for a number of reasons.
• First, application views are usually subsets of relations.
• Therefore, the locality of accesses of applications is defined not on entire relations
but on their subsets.
• Hence it is only natural to consider subsets of relations as distribution units

• Second, if the applications that have views defined on a given relation reside at
different sites, two alternatives can be followed, with the entire relation being the
unit of distribution.
• Either the relation is not replicated and is stored at only one site, or it is replicated
at all or some of the sites where the applications reside.

• The former results in an unnecessarily high volume of remote data accesses.

• The latter, on the other hand, has unnecessary replication, which causes problems
in executing updates and may not be desirable if storage is limited.

• Finally, the decomposition of a relation into fragments, each being treated as a unit,
permits a number of transactions to execute concurrently.
• In addition, the fragmentation of relations typically results in the parallel execution
of a single query by dividing it into a set of subqueries that operate on fragments.
• Thus fragmentation typically increases the level of concurrency and therefore the
system throughput.

Fragmentation Alternatives
• Relation instances are essentially tables, so the issue is one of finding alternative
ways of dividing a table into smaller ones.
• There are clearly two alternatives for this:
– dividing it horizontally or
– dividing it vertically

Degree of Fragmentation
• The degree of fragmentation goes from one extreme, that is, not to fragment at all,
to the other extreme, to fragment to the level of individual tuples (in the case of
horizontal fragmentation) or to the level of individual attributes (in the case of
vertical fragmentation).
• Level can only be defined with respect to the applications that will run on the
database. The issue is, how?
• In general, the applications need to be characterized with respect to a number of
parameters.
• According to the values of these parameters, individual fragments can be identified

Allocation Alternatives
• Assuming that the database is fragmented properly, one has to decide on the
allocation of the fragments to various sites on the network.
• When data are allocated, it may either be replicated or maintained as a single copy.
• The reasons for replication are reliability and efficiency of read-only queries.

Allocation Alternatives
• If there are multiple copies of a data item, there is a good chance that some copy of
the data will be accessible somewhere even when system failures occur.
• On the other hand, the execution of update queries cause trouble since the system
has to ensure that all the copies of the data are updated properly

Information Requirements
• One aspect of distribution design is that too many factors contribute to an optimal
design.
• The logical organization of the database, the location of the applications, the access
characteristics of the applications to the database, and the properties of the
computer systems at each site all have an influence on distribution decisions

• The information needed for distribution design can be divided into four categories:
– Database information
– Application information
– Communication network information
– Computer system information

Horizontal Fragmentation
• Horizontal fragmentation partitions a relation along its tuples.

• Each fragment has a subset of the tuples of the relation.
• There are two versions of horizontal partitioning:
– Primary Horizontal fragmentation and
– Derived Horizontal fragmentation

Horizontal Fragmentation
• Primary horizontal fragmentation of a relation is performed using predicates that

are defined on that relation.
• Derived horizontal fragmentation , on the other hand, is the partitioning of a
relation that results from predicates being defined on another relation

Primary Horizontal Fragmentation
• A primary horizontal fragmentation is defined by a selection operation on the

owner relations of a database schema.
• Therefore, given relation R , its horizontal fragments are given by
where Fi is the selection formula used to obtain fragment Ri also called the
fragmentation predicate

• The decomposition of relation PROJ into horizontal fragments PROJ1 and PROJ2 is
defined as follows:



• If a new tuple with a BUDGET value of, say, $600,000 were to be inserted into PROJ,
one would have had to review the fragmentation to decide if the new tuple is to go
into PROJ2 or if the fragments need to be revised and a new fragment needs to be
defined as

• Consider relation PROJ we can define the following horizontal fragments based on
the project location.
• The resulting fragments are


• A horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a

minterm predicate mi .
• Hence, given a set of minterm predicates M , there are as many horizontal
fragments of relation R as there are minterm predicates.
• This set of horizontal fragments is also commonly referred to as the set of minterm
fragments .

Derived Horizontal Fragmentation
• A derived horizontal fragmentation is defined on a member relation of a link

according to a selection operation specified on its owner.
• It is important to remember two points.
• First, the link between the owner and the member relations is defined as an equi-
join.
• Second, an equi-join can be implemented by means of semijoins

• Consider link L1, where owner(L1) = PAYand member(L1) = EMP.

• Then we can group engineers into two groups according to their salary:
– those making less than or equal to $30,000, and
– those making more than $30,000.



• The two fragments EMP1 and EMP2 are defined as follows


• To carry out a derived horizontal fragmentation, three inputs are needed:

– The set of partitions of the owner relation (PAY1and PAY2),
– The member relation, and
– The set of semijoin predicates between the owner and the member (EMP.TITLE=
PAY.TITLE)

Vertical Fragmentation
• A vertical fragmentation of a relation R produces fragments R1,R2,….Rr , each of

which contains a subset of R’s attributes as well as the primary key of R.
• The objective of vertical fragmentation is to partition a relation into a set of smaller
relations so that many of the user applications will run on only one fragment.

• In this context, an “optimal” fragmentation is one that produces a fragmentation

scheme which minimizes the execution time of user applications that run on these
fragments.
• In this context, an “optimal” fragmentation is one that produces a fragmentation
scheme which minimizes the execution time of user applications that run on these
fragments.


• PROJ relation is partitioned vertically into two subrelations, PROJ1 and PROJ2 .
• PROJ1 contains only the information about project budgets, whereas PROJ2
contains project names and locations.
• It is important to notice that the primary key to the relation (PNO) is included in
both fragments


Attribute Usage Matrix
• Consider relation PROJ Assume that the following applications are defined to run on
this relation.


• According to these four applications, the attribute usage values can be defined.
• As a notational convenience, we let
– A1 = PNO,
– A2 = PNAME,
– A3 = BUDGET, and
– A4 = LOC.

• The usage values are defined in matrix form

Attribute Affinity Matrix
• Attribute usage values are not sufficiently general to form the basis of attribute
splitting and fragmentation.
• This is because these values do not represent the weight of application frequencies.
• The frequency measure can be included in the definition of the attribute affinity
measure aff(Ai;Aj) , which measures the bond between two attributes of a relation
according to how they are accessed by applications

• The attribute affinity measure between two attributes Ai and Aj of a relation

R(A1,A2,……… An) with respect to the set of applications Q = (q1,q2,……qq) is
defined as
• where re fl(qk) is the number of accesses to attributes (Ai;Aj) for each execution of
application qk at site Sl and accl(qk) is the application access frequency measure
previously defined and modified to include frequencies at different sites.


Hybrid Fragmentation
• In most cases a simple horizontal or vertical fragmentation of a database schema

will not be sufficient to satisfy the requirements of user applications.
• In this case a vertical fragmentation may be followed by a horizontal one, or vice
versa, producing a treestructured partitioning
• Since the two types of partitioning strategies are applied one after the other, this
alternative is called hybrid fragmentation.
• It has also been named mixed fragmentation or nested fragmentation

Hybrid Fragmentation

Allocation
• The allocation of resources across the nodes of a computer network is an old

problem that has been studied extensively.
• Most of this work, however, does not address the problem of distributed database
design, but rather that of placing individual files on a computer network

Allocation Problems
• Assume that there are a set of fragments F = {F1;F2; : : : ;Fn} and a distributed
system consisting of sites S = {S1;S2; : : : ;Sm} on which a set of applications Q =
{q1;q2; : : : ;qq} is running.
• The allocation problem involves finding the “optimal” distribution of F to S.

Allocation Problems
• The optimality can be defined with respect to two measures

– Minimal cost
– Performance

Allocation Problems
Minimal cost
• The cost function consists of the cost of storing each Fi at a site Sj, the cost of
querying Fi at site Sj , the cost of updating Fi at all sites where it is stored, and the
cost of data communication.
• The allocation problem, then, attempts to find an allocation scheme that minimizes
a combined cost function.
• Why such models are not available ?

Allocation Problems
Performance
• The allocation strategy is designed to maintain a performance metric.
• Two well-known ones are to minimize the response time and to maximize the
system throughput at each site.

Allocation Problems
• It is apparent that the “optimality” measure should include both the performance
and the cost factors.
• In other words, one should be looking for an allocation scheme that, for example,
answers user queries in minimal time while keeping the cost of processing minimal.
• A similar statement can be made for throughput maximization.

• It is at the allocation stage that we need the quantitative data about

– The database
– The applications that run on it
– The communication network
– The processing capabilities and
– Storage limitations of each site on the network.

Bottom-Up Design Methodology
• Top-down distributed database design, is suitable for tightly integrated,

homogeneous distributed DBMSs.
• Bottom-up design is appropriate in multidatabase systems.
• In this case, a number of databases already exist, and the design task involves
integrating them into one database.

• Bottom-up design involves the process by which information from participating

databases can be (physically or logically) integrated to form a single cohesive
multidatabase.
• There are two alternative approaches. In some cases, the global conceptual (or
mediated) schema is defined first, in which case the bottom-up design involves
mapping LCSs to this schema.

• In other cases, the GCS is defined as an integration of parts of LCSs.

• In this case, the bottom-up design involves both the generation of the GCS and the
mapping of individual LCSs to this GCS
• If the GCS is defined up-front, the relationship between the GCS and the local
conceptual schemas (LCS) can be of two fundamental types
– Local-as-view, and
– Global-as-view

• In local-as-view (LAV) systems, the GCS definition exists, and each LCS is treated as a
view definition over it.
• In global-as-view systems (GAV), on the other hand, the GCS is defined as a set of
views over the LCSs.
• These views indicate how the elements of the GCS can be derived, when needed,
from the elements of LCSs


• Bottom-up design occurs in two general steps :

– Schema translation (or simply translation) and
– Schema generation.
• In the first step, the component database schemas are translated to a common
intermediate canonical representation
• In the second step of bottom-up design, the intermediate schemas are used to
generate a GCS

Schema Translation
• In the first step, the component database schemas are translated to a common
intermediate canonical representation
• The use of a canonical representation facilitates the translation process by reducing
the number of translators that need to be written.
• The choice of the canonical model is important it should be one that is sufficiently
expressive to incorporate the concepts available in all the databases that will later
be integrated.

Schema Translation
• Alternatives that have been used include the entity-relationship model, object-
oriented model or a graph that may be simplified to a tree
• The translation step is necessary only if the component databases are
heterogeneous and local schemas are defined using different data models

Schema Generation
• In the second step of bottom-up design, the intermediate schemas are used to
generate a GCS. In some methodologies, local external (or export) schemas are
considered for integration rather than full database schemas
• Local systems may only be willing to contribute some of their data to the
multidatabase

Schema Generation
The schema generation process consists of the following steps:

1. Schema matching to determine the syntactic and semantic correspondences
among the translated LCS elements or between individual LCS elements and the
pre-defined GCS elements
2. Integration of the common schema elements into a global conceptual (mediated)
schema if one has not yet been defined
3. Schema mapping that determines how to map the elements of each LCS to the
other elements of the GCS

Schema Matching
• Schema matching determines which concepts of one schema match those of

another.
• If the GCS has already been defined, then one of these schemas is typically the GCS,
and the task is to match each LCS to the GCS.
• Otherwise, matching is done on two LCSs.
• The matches that are determined in this phase are then used in schema mapping to
produce a set of directed mappings, which, when applied to the source schema,
would map its concepts to the target schema.

Schema Matching
• Aside from schema heterogeneity, other issues that complicate the matching
process are the following
– Insufficient schema and instance information
– Unavailability of schema documentation
– Subjectivity of matching

Schema Matching
Insufficient schema and instance information:

• Matching algorithms depend on the information that can be extracted from the
schema and the existing data instances.
• In some cases there is some ambiguity of the terms due to the insufficient
information provided about these items.
• For example, using short names or ambiguous abbreviations for concepts, can lead
to incorrect matching.

Schema Matching
Unavailability of schema documentation:

• In most cases, the database schemas are not well documented or not documented
at all.
• Quite often, the schema designer is no longer available to guide the process.
• The lack of these vital information sources adds to the difficulty of matching.

Schema Matching
Subjectivity of matching:
• Finally, we need to note (and admit) that matching schema elements can be highly
subjective; two designers may not agree on a single “correct” mapping.
• This makes the evaluation of a given algorithm’s accuracy significantly difficult.

Schema Integration
• Once schema matching is done, the correspondences between the various LCSs
have been identified.
• The next step is to create the GCS, and this is referred to as schema integration.
• This step is only necessary if a GCS has not already been defined and matching was
performed on individual LCSs.
• If the GSC was defined up-front, then the matching step would determine
correspondences between it and each of the LCSs and there would be no need for
the integration step.

Schema Integration
• If the GCS is created as a result of the integration of LCSs based on correspondences

identified during schema matching, then, as part of integration, it is important to
identify the correspondences between the GCS and the LCSs

Schema Integration
• Integration methodologies can be classified as binary or nary mechanisms based on

the manner in which the local schemas are handled in the first phase
• Binary integration methodologies involve the manipulation of two schemas at a
time.
• These can occur in a stepwise (ladder) fashion where intermediate schemas are
created for integration with subsequent schemas or in a purely binary fashion
where each schema is integrated with one other, creating an intermediate schema
for integration with other intermediate schemas

Schema Integration

Schema Integration

Schema Integration
• Nary integration mechanisms integrate more than two schemas at each iteration.
• One-pass integration occurs when all schemas are integrated at once, producing
the global conceptual schema after one iteration

Schema Integration
• Iterative nary integration offers more flexibility typically, more information is

available and is more general the number of schemas can be varied depending on
the integrator’s preferences
• Binary approaches are a special case of iterative nary.

Schema Integration

Schema Mapping
• Once a GCS (or mediated schema) is defined, it is necessary to identify how the
data from each of the local databases (source) can be mapped to GCS (target) while
preserving semantic consistency as defined by both the source and the target
• In the case of data warehouses, schema mappings are used to explicitly extract data
from the sources, and translate them to the data warehouse schema for populating
it.
• In the case of data integration systems, these mappings are used in query
processing phase by both the query processor and the wrappers

Schema Mapping
• There are two issues related to schema mapping that we will be studying mapping
creation, and mapping maintenance.
• Mapping creation is the process of creating explicit queries that map data from a
local database to the global data.
• Mapping maintenance is the detection and correction of mapping inconsistencies
resulting from schema evolution.

Schema Mapping
• Source schemas may undergo structural or semantic changes that invalidate

mappings.
• Mapping maintenance is concerned with the detection of broken mappings and the
(automatic) rewriting of mappings such that semantic consistency with the new
schema and semantic equivalence with the current mapping are achieved.

Data Cleaning
• Errors in source databases can always occur, requiring cleaning in order to correctly
answer user queries.
• Data cleaning is a problem that arises in both data warehouses and data integration
systems, but in different contexts.
• In data warehouses where data are actually extracted from local operational
databases and materialized as a global database, cleaning is performed as the
global database is created.
• In the case of data integration systems, data cleaning is a process that needs to be
performed during query processing when data are returned from the source
databases
10
SSZG554 - Distributed Data Systems 12th Sep 2020 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning
• The errors that are subject to data cleaning can generally be broken down into
either schema-level or instance-level concerns
• Schema-level problems can arise in each individual LCS due to violations of explicit
and implicit constraints.
10
Data Cleaning
• Instance level errors are those that exist at the data level.
• For example, the values of some attributes may be missing although they were
required, there could be misspellings and word transpositions or differences in
abbreviations embedded values that were erroneously placed in other fields,
duplicate values, and contradicting values
10
THANK YOU
10
BITS Pilani
BITS Pilani

L7 : Data Replication
Text Books
Ltd
Publishers: Wiley

organizations
SSZG554 - Distributed Data Systems Date : 26th Sep 2020 248 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Introduction • Primary Copy with Full Replication
• Consistency of Replicated Databases Transparency
• Mutual Consistency – Eager Distributed Protocols
• Mutual Consistency versus Transaction Consistency – Lazy Centralized Protocols
• Update Management Strategies • Single Master with Limited Transparency
– Eager Update Propagation • Single Master or Primary Copy with Full
– Lazy Update Propagation Replication Transparency
– Centralized Techniques – Lazy Distributed Protocols
– Distributed Techniques • Replication and Failures
• Replication Protocols • Failures and Lazy Replication
– Eager Centralized Protocols • Failures and Eager Replication
• Single Master with Limited Replication – Available copies protocol
Transparency – Distributed ROWA-A protocol
• Single Master with Full Replication • Replication Mediator Service
Transparency
Introduction
• Distributed databases are typically replicated. The purposes of replication

are
– System availability
– Performance
– Scalability
– Application requirements
Consistency of Replicated Databases
• There are two issues related to consistency of a replicated database
• One is mutual Consistency that deals with the convergence of the values of
physical data items corresponding to one logical data item
• Second is transaction consistency
Mutual Consistency
• Mutual consistency criteria for replicated databases can either be strong or
weak
• Strong mutual consistency criteria require that all copies of a data item have
the same value at the end of the execution of an update transaction.
• This is achieved by a variety of means, but the execution of 2PC at the
commit point of an update transaction is a common way to achieve strong
mutual consistency
Mutual Consistency
• Weak mutual consistency criteria do not require the values of replicas of a
data item to be identical when an update transaction terminates.
• What is required is that, if the update activity ceases for some time, the
values eventually become identical.
• This is commonly referred to as eventual consistency, which refers to the fact
that replica values may diverge over time, but will eventually converge.
• It is hard to define this concept formally or precisely
Mutual Consistency versus Transaction Consistency
• Mutual consistency refers to the replicas converging to the same value, while
transaction consistency requires that the global execution history be
serializable.
• It is possible for a replicated DBMS to ensure that data items are mutually
consistent when a transaction commits, but the execution history may not be
globally serializable.
• Consider three sites (A, B, and C) and three data items (x;y; z ) that are
distributed as follows:
– Site A hosts x ,
– Site B hosts x;y ,
– Site C hosts x; y; z .
• We will use site identifiers as subscripts on the data items to refer to a
particular replica.
• Now consider the following three transactions:
• Note that T1 ’s Write has to be executed at all three sites (since x is
replicated at all three sites)
• T2 ’s Write has to be executed at B and C, and
• T3 ’s Write has to be executed only at C.
• We are assuming a transaction execution model where transactions can
read their local replicas, but have to update all of the replicas.
• The serialization order in HB is T1 -> T2
while in HC it is T2 -> T3 -> T1 .
• Therefore, the global history is not serializable. However, the database is
mutually consistent.
• Assume, for example, that initially xA =xB =xC = 10; yB =yC = 15, and zC =
7.
• With the above histories, the final values will be xA =xB =xC = 20; yB =yC =
35; zC = 3: 5.All the physical copies (replicas) have indeed converged to the
same value
Update Management Strategies
• The replication protocols can be classified according to when the updates
are propagated to copies (eager versus lazy) and where updates are allowed
to occur (centralized versus distributed).
• These two decisions are generally referred to as update management
strategies
Eager Update Propagation
• The eager update propagation approaches apply the changes to all the
replicas within the context of the update transaction.
• Eager propagation may use Synchronous propagation of each update by
applying it on all the replicas at the same time when the Write is issued
• Deferred propagation whereby the updates are applied to one replica when

they are issued, but their application on the other replicas is batched and
deferred to the end of the transaction.
• Deferred propagation can be implemented by including the updates in the
“Prepare-to-Commit” message at the start of 2PC execution
• Eager techniques typically enforce strong mutual consistency criteria.
• Since all the replicas are mutually consistent at the end of an update
transaction, a subsequent read can read from any copy
• However, a Write(x) has to be applied to all
• Thus, protocols that follow eager update propagation are known as read-
one/write-all (ROWA) protocols.
• The advantages of eager update propagation are threefold.
• First, they typically ensure that mutual consistency is enforced therefore,
there are no transactional inconsistencies.
• Second, a transaction can read a local copy of the data item (if a local copy
is available) and be certain that an up-to-date value is read. Thus, there is no
need to do a remote read.
• Finally, the changes to replicas are done atomically; thus recovery from
failures can be managed.
• The main disadvantage of eager update propagation is that a transaction has
to update all the copies before it can terminate.
• First, the response time performance of the update transaction suffers, since
it typically has to participate in a 2PC execution, and because the update
speed is restricted by the slowest machine.
• Second, if one of the copies is unavailable, then the transaction cannot
terminate since all the copies need to be updated.
Lazy Update Propagation
• In lazy update propagation the replica updates are not all performed within
the context of the update transaction.
• The transaction does not wait until its updates are applied to all the copies
before it commits – it commits as soon as one replica is updated.
• The propagation to other copies is done asynchronously from the original
transaction, by means of refresh transactions that are sent to the replica
sites some time after the update transaction commits.
• A refresh transaction carries the sequence of updates of the corresponding
update transaction
• Lazy propagation is used in those applications for which strong mutual
consistency may be unnecessary and too restrictive.
• These applications may be able to tolerate some inconsistency among the
replicas in return for better performance
• The primary advantage of lazy update propagation techniques is that they
generally have lower response times for update transactions, since an
update transaction can commit as soon as it has updated one copy.
• The disadvantages are that the replicas are not mutually consistent and
some replicas may be out-of-date, and, consequently, a local read may read
stale data and does not guarantee to return the up-to-date value
Centralized Techniques
• Centralized update propagation techniques require that updates are first

applied at a master copy and then propagated to other copies
• The site that hosts the master copy is similarly called the master site , while
the sites that host the slave copies for that data item are called slave sites .
• In some techniques, there is a single master for all replicated data these as
single master centralized techniques.
• In other protocols, the master copy for each data item may be different and
are typically known as primary copy centralized techniques.
• The advantages of centralized techniques are two-fold.
• First, application of the updates is easy since they happen at only the master
site, and they do not require synchronization among multiple replica sites.
• Second, there is the assurance that at least one site – the site that holds the
master copy – has up-to-date values for a data item.
• These protocols are generally suitable in data warehouses and other
applications where data processing is centralized at one or a few master
sites.
• The primary disadvantage is that, as in any centralized algorithm, if there is
one central site that hosts all of the masters, this site can be overloaded and
can become a bottleneck.
Distributed Techniques
• Distributed techniques apply the update on the local copy at the site where
the update transaction originates, and then the updates are propagated to
the other replica sites.
• These are called distributed techniques since different transactions can
update different copies of the same data item located at different sites.
• They can more evenly distribute the load, and may provide the highest
system availability if coupled with lazy propagation techniques.
Distributed Techniques
• A serious complication that arises in these systems is that different replicas
of a data item may be updated at different sites (masters) concurrently.
• If distributed techniques are coupled by eager propagation methods, then the
distributed concurrency control methods can adequately address the
concurrent updates problem
Replication Protocols
• We discussed two dimensions along which update management techniques
can be classified.
• These dimensions are orthogonal therefore four combinations are possible:
– Eager Centralized
– Eager Distributed
– Lazy Centralized
– Lazy Distributed
Eager Centralized Protocols
• In eager centralized replica control, a master site controls the operations on
a data item.
• These protocols are coupled with strong consistency techniques, so that
updates to a logical data item are applied to all of its replicas within the
context of the update transaction, which is committed using the 2PC protocol
• Consequently, once the update transaction completes, all replicas have the
same values for the updated data items
Single Master with Limited Replication Transparency
• The simplest case is to have a single master for the entire database with
limited replication transparency so that user applications know the master
site.
• In this case, global update transactions are submitted directly to the
transaction manager (TM) at the master site.
• At the master, each Read(x) operation is performed on the master copy are
executed as follows:
• A read lock is obtained on data item, the read is performed, and the result is
returned to the user.
• Similarly, each Write(x) causes an update of the master copy by first
obtaining a write lock and then performing the write operation.
• The master TM then forwards the Write to the slave sites either
synchronously or in a deferred fashion
• In either case, it is important to propagate updates such that conflicting
updates are executed at the slaves in the same order they are executed at
the master.
• This can be achieved by timestamping or by some other ordering scheme.
(1) AWrite is applied on the master copy;
(2) Write is then propagated to the other replicas;
(3) Updates become permanent at commit time;
(4) Read-only transaction’s Read goes to any slave copy
Single Master with Full Replication Transparency
• Single master eager centralized protocols require each user application to
know the master site, and they put significant load on the master that has to
deal with the Read operations within update transactions as well as acting as
the coordinator for these transactions during 2PC execution.
• These issues can be addressed, to some extent, by involving, in the
execution of the update transactions, the TM at the site where the application
runs.
• Thus, the update transactions are not submitted to the master, but to the TM
at the site where the application runs
• This TM can act as the coordinating TM for both update and read-only
transactions.
• Applications can simply submit their transactions to their local TM, providing
full transparency.
• There are alternatives to implementing full transparency – the coordinating

TM may only act as a “router”, forwarding each operation directly to the
master site.
• The master site can then execute the operations locally and return the
results to the application.
• Although this alternative implementation provides full transparency and has
the advantage of being simple to implement, it does not address the
overloading problem at the master
Primary Copy with Full Replication Transparency
• Each data item can have a different master.
• In this case, for each replicated data item, one of the replicas is designated
as the primary copy.
• Consequently, there is no single master to determine the global serialization
order
• In the case of fully replicated databases, any replica can be primary copy for
a data item, however for partially replicated databases, limited replication
transparency option only makes sense if an update transaction accesses
only data items whose primary sites are at the same site
• Otherwise, the application program cannot forward the update transactions
to one master; it will have to do it operation-by-operation, and, furthermore, it
is not clear which primary copy master would serve as the coordinator for
2PC execution.
(1) Operations (Read orWrite) for each data item are routed to that data item’s master
and a Write is first applied at the master;
(2) Write is then propagated to the other replicas;
(3) Updates become permanent at commit time.
Eager Distributed Protocols
• In eager distributed replica control, the updates can originate anywhere, and
they are first applied on the local replica, then the updates are propagated to
other replicas.
• If the update originates at a site where a replica of the data item does not
exist, it is forwarded to one of the replica sites, which coordinates its
execution.
• Again, all of these are done within the context of the update transaction, and
when the transaction commits, the user is notified and the updates are made
permanent.
Eager Distributed Protocols
(1) Two Write operations are applied on two local replicas of the same data item;
(2) The Write operations are independently propagated to the other replicas;
(3) Updates become permanent at commit time
Lazy Centralized Protocols
• Lazy centralized replication algorithms are similar to eager centralized

replication ones in that the updates are first applied to a master replica and
then propagated to the slaves.
• The important difference is that the propagation does not take place within
the update transaction, but after the transaction commits as a separate
refresh transaction.
Lazy Centralized Protocols
• Consequently, if a slave site performs a Read(x) operation on its local copy, it
may read stale (non-fresh) data, since x may have been updated at the
master, but the update may not have yet been propagated to the slaves
Single Master with Limited Transparency
• In Single Master with Limited Transparency the update transactions are
submitted and executed directly at the master site (as in the eager single
master); once the update transaction commits, the refresh transaction is sent
to the slaves
Single Master with Limited Transparency
(1) Update is applied on the local replica;
(2) Transaction commit makes the updates permanent at the master;
(3) Update is propagate to the other replicas in refresh transactions;
(4) Transaction 2 reads from local copy.
Single Master or Primary Copy with Full Replication Transparency
• Alternatives that provide full transparency by allowing (both read and update)
transactions to be submitted at any site and forwarding their operations to
either the single master or to the appropriate primary master site.
• This involves two problems: the first is that, unless one is careful, 1SR global
history may not be guaranteed; the second problem is that a transaction may
not see its own updates.
Lazy Distributed Protocols
• Lazy distributed replication protocols are the most complex ones owing to
the fact that updates can occur on any replica and they are propagated to
the other replicas
• The operation of the protocol at the site where the transaction is submitted is
straightforward: both Read and Write operations are executed on the local
copy, and the transaction commits locally.
• Sometime after the commit, the updates are propagated to the other sites by
means of refresh transactions.
Lazy Distributed Protocols
(1) Two updates are applied on two local replicas;
(2) Transaction commit makes the updates permanent;
(3) The updates are independently propagated to the other replicas.
Replication and Failures
• Till now we have focused on replication protocols in the absence of any
failures.
• What happens to mutual consistency concerns if there are system failures?
• The handling of failures differs between eager replication and lazy replication
approaches.
Failures and Lazy Replication
• Dealing failures with lazy replication techniques is relatively easy since these
protocols allow divergence between the master copies and the replicas.
• Consequently, when communication failures make one or more sites
unreachable (the latter due to network partitioning), the sites that are
available can simply continue processing.
• Even in the case of network partitioning, one can allow operations to proceed
in multiple partitions independently and then worry about the convergence of
the database states upon repair using
Failures and Lazy Replication
• Even in the case of network partitioning, one can allow operations to proceed
in multiple partitions independently and then worry about the convergence of
the database states upon repair using
• Before the merge, databases at multiple partitions diverge, but they are
reconciled at merge time.
Failures and Eager Replication
• All eager techniques implement some sort of ROWA protocol, ensuring that,
when the update transaction commits, all of the replicas have the same
value.
• ROWA family of protocols is attractive and elegant.
• It has one significant drawback even if one of the replicas is unavailable,
then the update transaction cannot be terminated.
• So, ROWA fails in meeting one of the fundamental goals of replication,
namely providing higher availability.
• An alternative to ROWA which attempts to address the low availability
problem is the Read-One/Write-All Available (ROWA-A) protocol.
• The general idea is that the write commands are executed on all the
available copies and the transaction terminates.
• The copies that were unavailable at the time will have to “catch up” when
they become available.
• There have been various versions of this protocol two of which are popular.
– Available copies protocol
– Distributed ROWA-A protocol
Available copies protocol
• The coordinator of an update transaction Ti i.e., the master where the
transaction is executing sends each Wi(x) to all the slave sites where
replicas of x reside, and waits for confirmation of execution or rejection.
• If it times out before it gets acknowledgement from all the sites, it considers
those which have not replied as unavailable and continues with the update
on the available sites.
• The unavailable slave sites update their databases to the latest state when
they recover.
• These sites may not even be aware of the existence of Ti and the update to
x that Ti has made if they had become unavailable before Ti started.
• There are two complications that need to be addressed.
• The first one is the possibility that the sites that the coordinator thought were
unavailable were in fact up and running and may have already updated x
but their acknowledgement may not have reached the coordinator before its
timer ran out.
• Second, some of these sites may have been unavailable when Ti started
and may have recovered since then and have started executing transactions.
• Therefore, the coordinator undertakes a validation procedure before
committing Ti
Distributed ROWA-A protocol
• In the distributed ROWA-A protocol each site S maintains a set, Vs , of sites
that it believes to be available this is the “view” that S has of the system
configuration.
• In particular, when a transaction Ti is submitted, its coordinator’s view
reflects all the sites that the coordinator knows to be available let it be
denoted as VC(Ti).
Distributed ROWA-A protocol
• A Ri(x) is performed on any replica in Vc(Ti) and a Wi(x) updates all copies
in Vc(Ti) .
• The coordinator checks its view at the end of Ti , and if the view has changed
since Ti ’s start, then Ti is aborted.
• To modify V , a special atomic transaction is run at all sites, ensuring that no
concurrent views are generated.
• This can be achieved by assigning timestamps to each V when it is
generated and ensuring that a site only accepts a new view if its version
number is greater than the version number of that site’s current view.
Replication Mediator Service
• The replication protocols we have covered so far are suitable for tightly
integrated distributed database systems where we can insert the protocols
into each component DBMS.
• In multidatabase systems, replication support has to be supported outside
the DBMSs by mediators.
• The NODO (NOn-Disjoint conflict classes and Optimistic multicast) protocol

is a hybrid between distributed and primary copy
• It permits transactions to be submitted at any site, but it does have the notion
of a primary copy for a data item.
• It uses group communications and optimistic delivery to reduce latency.
• The optimistic delivery technique delivers a message optimistically as soon

as it is received without guaranteeing any order among messages. The
message is said to be “opt-delivered”.
• When the total order of the message is established, then the message is to-
delivered.
• Although optimistic delivery does not guarantee any order, most of the time
the order will be the same as total ordering.
• This fact is exploited by NODO to overlap the total ordering of the transaction
request with the transaction execution at the master node, thus masking the
latency of total ordering.
• The protocol also executes transactions optimistically, and may abort them if
necessary.
• The implementation of the NODO protocol combines concurrency control

with group communication primitives and what has been traditionally done
inside the DBMS.
• This solution can be implemented outside a DBMS without a negligible
overhead, and thus supports DBMS autonomy
• Similar eager replication protocols have been proposed to support partial
replication , where copies can be stored at subsets of nodes
• Unlike full replication, partial replication increases access locality and
reduces the number of messages for propagating updates to replicas.
THANK YOU
BITS Pilani
BITS Pilani

L8 : Parallel Database Systems
Reference : T1 - Ch. 14
Text Books
Ltd
Publishers: Wiley

organizations
SSZG554 - Distributed Data Systems Date : 24th Oct 2020 320 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Parallel Database Systems • Load Balancing
– Objectives – Parallel Execution Problems
– Functional Architecture – Intra-Operator Load Balancing
• Parallel DBMS Architectures – Inter-Operator Load Balancing
• Shared-Memory – Intra-Query Load Balancing
• Shared-Disk • Database Clusters
• Shared-Nothing
• Hybrid Architectures
– NUMA
– Cluster
• Parallel Data Placement
Parallel Database Systems
• Many data-intensive applications require support for very large databases

with hundreds of terabytes or petabytes.
• Examples of such applications are e-commerce, data warehousing, and data
mining.
• Very large databases are typically accessed through high numbers of
concurrent transactions (e.g., performing on-line orders on an electronic
store) or complex queries (e.g., decision-support queries).
• The first kind of access is representative of On-Line Transaction Processing
(OLTP) applications while the second is representative of On-Line Analytical
Processing (OLAP) applications.
• Supporting very large databases efficiently for either OLTP or OLAP can be
addressed by combining parallel computing and distributed database
management
• Data distribution can be exploited to increase performance (through

parallelism) and availability (through replication).
• This principle can be used to implement parallel database systems , i.e.,
database systems on parallel computers
Objectives
• Parallel database systems combine database management and parallel

processing to increase performance and availability.
• Performance was the objective of database machines in the 70s and 80s.
• The problem faced by conventional database management has long been
known as “I/O bottleneck” induced by high disk access time with respect to
main memory access time typically hundreds of thousands times faster
Objectives
• An important result, however, is in the general solution to the I/O bottleneck
is by increasing the I/O bandwidth through parallelism .
• For instance, if we store a database of size D on a single disk with
throughput T , the system throughput is bounded by T .
• On the contrary, if we partition the database across n disks, each with
capacity D=n and throughput T0 (hopefully equivalent to T ), we get an
ideal throughput of n * T0 that can be better consumed by multiple
processors (ideally n ).
Objectives
• Note that the main memory database system solution which tries to maintain
the database in main memory, is complementary rather than alternative.
• In particular, the “memory access bottleneck” in main memory systems can
also be tackled using parallelism in a similar way.
• Therefore, parallel database system designers have strived to develop
software-oriented solutions in order to exploit parallel computers.
Objectives
• The objectives of parallel database systems are by those of distributed
DBMS Ideally, a parallel database system should provide the following
advantages
– High-performance
– High-availability
– Extensibility
Functional Architecture
• Assuming a client/server architecture, the functions supported by a parallel

database system can be divided into three subsystems much like in a typical
DBMS.
• The differences, though, have to do with implementation of these functions,
which must now deal with parallelism, data partitioning and replication, and
distributed transactions.
• Depending on the architecture, a processor node can support all or a subset
of these subsystems.
Session Manager
• It plays the role of a transaction monitor, providing support for client
interactions with the server.
• In particular, it performs the connections and disconnections between the
client processes and the two other subsystems.
• Therefore, it initiates and closes user sessions which may contain multiple
transactions.
• In case of OLTP sessions, the session manager is able to trigger the
execution of pre-loaded transaction code within data manager modules.
Transaction Manager
• It receives client transactions related to query compilation and execution.
• It can access the database directory that holds all meta-information about
data and programs.
• The directory itself should be managed as a database in the server.
• Depending on the transaction, it activates the various compilation phases,
triggers query execution, and returns the results as well as error codes to the
client application.
Data Manager
• It provides all the low-level functions needed to run compiled queries in
parallel, i.e., database operator execution, parallel transaction support,
cache management, etc.
• If the transaction manager is able to compile dataflow control, then
synchronization and communication among data manager modules is
possible.
• Otherwise, transaction control and synchronization must be done by a
transaction manager module
Parallel DBMS Architectures
• As any system, a parallel database system represents a compromise in

design choices in order to provide the advantages with a good
cost/performance.
• One guiding design decision is the way the main hardware elements, i.e.,
processors, main memory, and disks, are connected through some fast
interconnection network.
Parallel DBMS Architectures
• There are three basic parallel computer architectures depending on how
main memory or disk is shared:
– shared-memory
– shared-disk
– shared-nothing
• Hybrid architectures such as NUMA or cluster try to combine the benefits of
the basic architectures.
Shared-Memory
• In the shared-memory approach, any processor has access to any memory
module or disk unit through a fast interconnect
• All the processors are under the control of a single operating system.
• All shared-memory parallel database products today can exploit inter-query
parallelism to provide high transaction throughput and intra-query parallelism
to reduce response time of decision-support queries
Shared-Memory
Shared-Memory
• Shared-memory has two strong advantages: simplicity and load balancing.
• Since meta-information (directory) and control information (e.g., lock tables)
can be shared by all processors, writing database software is not very
different than for singleprocessor computers.
Shared-Memory
• In particular, inter-query parallelism comes for free.
• Intra-query parallelism requires some parallelization but remains rather
simple.
• Load balancing is easy to achieve since it can be achieved at run-time using
the shared-memory by allocating each new task to the least busy processor.
Shared-Memory
• Shared-memory has three problems: high cost, limited extensibility and low
availability.
• High cost is incurred by the interconnect that requires fairly complex
hardware because of the need to link each processor to each memory
module or disk.
• With faster processors (even with larger caches), conflicting accesses to the
shared-memory increase rapidly and degrade performance
• Therefore, extensibility is limited to a few tens of processors
Shared-Memory
• Finally, since the memory space is shared by all processors, a memory fault
may affect most processors thereby hurting availability. The solution is to use
duplex memory with a redundant interconnect.
Shared-Disk
• In the shared-disk approach, any processor has access to any disk unit
through the interconnect but exclusive (non-shared) access to its main
memory.
• Each processor-memory node is under the control of its own copy of the
operating system.
• Then, each processor can access database pages on the shared disk and
cache them into its own memory.
Shared-Disk
Shared-Disk
• Since different processors can access the same page in conflicting update
modes, global cache consistency is needed.
• This is typically achieved using a distributed lock manager
• The first parallel DBMS that used shared-disk is Oracle with an efficient
implementation of a distributed lock manager for cache consistency.
• Other major DBMS vendors such as IBM, Microsoft and Sybase provide
shared-disk implementations.
Shared-Disk
• Shared-disk has a number of advantages: lower cost, high extensibility, load
balancing, availability, and easy migration from centralized systems.
• The cost of the interconnect is significantly less than with shared-memory
since standard bus technology may be used.
• Given that each processor has enough main memory, interference on the
shared disk can be minimized.
Shared-Disk
• Thus, extensibility can be better, typically up to a hundred processors. Since
memory faults can be isolated from other nodes, availability can be higher.
• Finally, migrating from a centralized system to shared-disk is relatively
straightforward since the data on disk need not be reorganized.
Shared-Disk
• Shared-disk suffers from higher complexity and potential performance

problems.
• It requires distributed database system protocols, such as distributed locking
and two-phase commit.
• Furthermore, maintaining cache consistency can incur high communication
overhead among the nodes.
• Finally, access to the shared-disk is a potential bottleneck
Shared-Nothing
• In the shared-nothing approach, each processor has exclusive access to its
main memory and disk unit(s).
• Similar to shared-disk, each processor memory-disk node is under the
control of its own copy of the operating system.
• Then, each node can be viewed as a local site (with its own database and
software) in a distributed database system.
Shared-Nothing
• Therefore, most solutions designed for distributed databases such as

database fragmentation, distributed transaction management and distributed
query processing may be reused.
• Using a fast interconnect, it is possible to accommodate large numbers of
nodes.
• As opposed to SMP, this architecture is often called Massively Parallel
Processor (MPP).
Shared-Nothing
Shared-Nothing
• The first major parallel DBMS product was Teradata’s Database Computer

that could accommodate a thousand processors in its early version.
• Other major DBMS vendors such as IBM, Microsoft and Sybase provide
shared-nothing implementations
Shared-Nothing
• Shared-nothing has three main virtues: lower cost, high extensibility, and
high availability.
• The cost advantage is better than that of shared-disk that requires a special
interconnect for the disks.
• By implementing a distributed database design that favors the smooth
incremental growth of the system by the addition of new nodes, extensibility
can be better (in the thousands of nodes).
Shared-Nothing
• With careful partitioning of the data on multiple disks, almost linear speedup
and linear scaleup could be achieved for simple workloads. Finally, by
replicating data on multiple nodes, high availability can also be achieved.
Shared-Nothing
• Shared-nothing is much more complex to manage than either shared-
memory or shared-disk.
• Higher complexity is due to the necessary implementation of distributed
database functions assuming large numbers of nodes.
• In addition, load balancing is more difficult to achieve because it relies on the
effectiveness of database partitioning for the query workloads.
Shared-Nothing
• Unlike shared-memory and shared-disk, load balancing is decided based on
data location and not the actual load of the system.
• Furthermore, the addition of new nodes in the system presumably requires
reorganizing the database to deal with the load balancing issues.
Hybrid Architectures
• Various possible combinations of the three basic architectures are possible

to obtain different trade-offs between cost, performance, extensibility,
availability, etc.
• Hybrid architectures try to obtain the advantages of different architectures:
typically the efficiency and simplicity of shared-memory and the extensibility
and cost of either shared disk or shared nothing.
Hybrid Architectures
• There are two popular hybrid architectures:
– NUMA
– Cluster
NUMA
• With shared-memory, each processor has uniform memory access (UMA),

with constant access time, since both the virtual memory and the physical
memory are shared.
• One major advantage is that the programming model based on shared virtual
memory is simple.
• With either shared-disk or shared-nothing, both virtual and shared memory
are distributed, which yields scalability to large numbers of processors
NUMA
• The objective of NUMA is to provide a shared-memory programming model
and all its benefits, in a scalable architecture with distributed memory.
• The term NUMA reflects the fact that an access to the (virtually) shared
memory may have a different cost depending on whether the physical
memory is local or remote to the processor
NUMA
NUMA
• The most successful class of NUMA multiprocessors is Cache Coherent
NUMA (CC-NUMA)
• With CC-NUMA, the main memory is physically distributed among the nodes
as with shared-nothing or shared-disk.
• However, any processor has access to all other processors’ memories
NUMA
• Most SMP manufacturers are now offering NUMA systems that can scale up
to a hundred processors.
• The strong argument for NUMA is that it does not require any rewriting of the
application software
Cluster
• A cluster is a set of independent server nodes interconnected to share
resources and form a single system.
• The shared resources, called clustered resources, can be hardware such as
disk or software such as data management services.
• The server nodes are made of off-the-shelf components ranging from simple
PC components to more powerful SMP.
• Using many off-the-shelf components is essential to obtain the best
cost/performance ratio while exploiting continuing progress in hardware
components.
Cluster
• Compared to a distributed system, a cluster is geographically concentrated

(at a single site) and made of homogeneous nodes.
• Its architecture can be either shared nothing or shared-disk.
• Shared-nothing clusters have been widely used because they can provide
the best cost/performance ratio and scale up to very large configurations
Cluster
• However, because each disk is directly connected to a computer via a bus,
adding or replacing cluster nodes requires disk and data reorganization.
• Shared-disk avoids such reorganization but requires disks to be globally
accessible by the cluster nodes.
Cluster
There are two main technologies to share disks in a cluster:
– Network-attached storage (NAS) and
– Storage-area network (SAN)
Cluster
• A NAS is a dedicated device to shared disks over a network (usually TCP/IP)
using a distributed file system protocol such as Network File System (NFS).
• NAS is well suited for low throughput applications such as data backup and
archiving from PC’s hard disks.
• However, it is relatively slow and not appropriate for database management
as it quickly becomes a bottleneck with many nodes.
Cluster
• A storage area network (SAN) provides similar functionality but with a lower
level interface.
• For efficiency, it uses a block-based protocol thus making it easier to manage
cache consistency (at the block level).
• In fact, disks in a SAN are attached to the network instead to the bus as
happens in Directly Attached Storage (DAS)
Cluster
• A cluster architecture has important advantages.
• It combines the flexibility and performance of shared-memory at each node
with the extensibility and availability of shared-nothing or shared-disk.
Parallel Data Placement
• There are two important differences with the distributed database approach.
• First, there is no need to maximize local processing (at each node) since
users are not associated with particular nodes.
• Second, load balancing is much more difficult to achieve in the presence of a
large number of nodes.
• The main problem is to avoid resource contention, which may result in the
entire system thrashing e.g., one node ends up doing all the work while the
others remain idle.
• Since programs are executed where the data reside, data placement is a
critical performance issue
• Data placement must be done so as to maximize system performance, which
can be measured by combining the total amount of work done by the system
and the response time of individual queries.
• In terms of data placement, we have the following trade-off: maximizing
response time or inter-query parallelism leads to partitioning
• Whereas minimizing the total amount of work leads to clustering
• An alternative solution to data placement is full partitioning , whereby each
relation is horizontally fragmented across all the nodes in the system.
• There are three basic strategies for data partitioning:
– Round-Robin
– Hash
– Range partitioning
• Round-robin partitioning is the simplest strategy, it ensures uniform data

distribution.
• With n partitions, the ith tuple in insertion order is assigned to partition ( i
mod n).
• This strategy enables the sequential access to a relation to be done in
parallel.
• However, the direct access to individual tuples, based on a predicate,
requires accessing the entire relation.
• Hash partitioning applies a hash function to some attribute that yields the
partition number.
• This strategy allows exact-match queries on the selection attribute to be
processed by exactly one node and all other queries to be processed by all
the nodes in parallel
• Range partitioning distributes tuples based on the value intervals (ranges) of
some attribute.
• In addition to supporting exact-match queries (as in hashing), it is well-suited
for range queries
• However, range partitioning can result in high variation in partition size.
Load Balancing
• Good load balancing is crucial for the performance of a parallel system.
• The response time of a set of parallel operators is that of the longest one.
• Thus, minimizing the time of the longest one is important for minimizing
response time.
• Balancing the load of different transactions and queries among different
nodes is also essential to maximize throughput.
Load Balancing
• Although the parallel query optimizer incorporates decisions on how to
execute a parallel execution plan, load balancing can be hurt by several
problems incurring at execution time.
• Solutions to these problems can be obtained at the intra- and inter-operator
levels.
Parallel Execution Problems
• The principal problems introduced by parallel query execution are
– Initialization,
– Interference and
– Skew.
Initialization.
• Before the execution takes place, an initialization step is necessary.
• This first step is generally sequential. It includes process (or thread) creation
and initialization communication initialization, etc.
• The duration of this step is proportional to the degree of parallelism and can
actually dominate the execution time of simple queries
Interferences
• A highly parallel execution can be slowed down by interference.
• Interference occurs when several processors simultaneously access the
same resource, hardware or software.
Skew
• Load balancing problems can appear with intra-operator parallelism
(variation in partition size), namely data skew, and inter-operator parallelism
(variation in the complexity of operators).
Intra-Operator Load Balancing
• Good intra-operator load balancing depends on the degree of parallelism
and the allocation of processors for the operator.
• For some algorithms, e.g., the parallel hash join algorithm, these parameters
are not constrained by the placement of the data.
• Thus, the home of the operator (the set of processors where it is executed)
must be carefully decided.
Intra-Operator Load Balancing
• The skew problem makes it hard for a parallel query optimizer to make this
decision statically (at compile-time) as it would require a very accurate and
detailed cost model.
• Therefore, the main solutions rely on adaptive or specialized techniques that
can be incorporated in a hybrid query optimizer.
Inter-Operator Load Balancing
• In order to obtain good load balancing at the inter-operator level, it is
necessary to choose, for each operator, how many and which processors to
assign for its execution.
• This should be done taking into account pipeline parallelism, which requires
inter operator communication.
Intra-Query Load Balancing
• Intra-query load balancing must combine intra- and inter-operator parallelism.
• To some extent, given a parallel architecture, the techniques for either intra-
or inter operator load balancing can be combined.
• In the context of hybrid systems such as NUMA or cluster, the problems of
load balancing are exacerbated because they must be addressed at two
levels, locally among the processors of each shared-memory node (SM-
node) and globally among all nodes.
Intra-Query Load Balancing
• A general solution to load balancing in hybrid systems is the execution model
called Dynamic Processing (DP)
• The fundamental idea is that the query is decomposed into self-contained
units of sequential processing, each of which can be carried out by any
processor.
• A processor can migrate horizontally (intra-operator parallelism) and
vertically (inter-operator parallelism) along the query operators.
• This minimizes the communication overhead of internode load balancing by
maximizing intra and inter-operator load balancing within shared-memory
nodes.
Database Clusters
• Clusters of PC servers are another form of parallel computer that provides a
cost effective alternative to supercomputers or tightly-coupled
multiprocessors
• They have been used successfully in scientific computing, web information
retrieval (e.g., Google search engine) and data warehousing.
• However, these applications are typically read-intensive, which makes it
easier to exploit parallelism
Database Clusters
• In order to support update-intensive applications that are typical of business
data processing, full parallel database capabilities, including transaction
support, must be provided.
• This can be achieved using a parallel DBMS implemented over a cluster.
• In this case, all cluster nodes are homogeneous, under the full control of the
parallel DBMS.
Database Clusters
• The parallel DBMS solution may be not viable for some businesses such as
Application Service Providers (ASP).
• In the ASP model, customers’ applications and databases (including data
and DBMS) are hosted at the provider site and need to be available, typically
through the Internet, as efficiently as if they were local to the customer site.
• A major requirement is that applications and databases remain autonomous,
i.e., remain unchanged when moved to the provider site’s cluster and under
the control of the customers.
Database Clusters
• Thus, preserving autonomy is critical to avoid the high costs and problems
associated with application code modification.
• Using a parallel DBMS in this case is not appropriate as it is expensive,
requires heavy migration to the parallel DBMS and hurts database autonomy.
• A solution is to use a database cluster , which is a cluster of autonomous
databases, each managed by an off-the-shelf DBMS
Database Clusters
• A major difference with a parallel DBMS implemented on a cluster is the use
of a “black-box” DBMS at each node.
• Since the DBMS source code is not necessarily available and cannot be
changed to be “cluster-aware”, parallel data management capabilities must
be implemented via middleware.
• In its simplest form, a database cluster can be viewed as a multidatabase
system on a cluster.
Database Clusters
• Clusters of PC servers are another form of parallel computer that provides a
cost effective alternative to supercomputers or tightly-coupled
multiprocessors.
• For instance, they have been used successfully in scientific computing, web
information retrieval (e.g., Google search engine) and data warehousing.
• However, these applications are typically read-intensive, which makes it
easier to exploit parallelism.
• In order to support update-intensive applications that are typical of business
data
• processing, full parallel database capabilities, including transaction support,
must be provided.
• This can be achieved using a parallel DBMS implemented over a cluster.
• In this case, all cluster nodes are homogeneous, under the full control of the
Database Clusters
• The parallel DBMS solution may be not viable for some businesses such as
Application Service Providers (ASP).
• In the ASP model, customers’ applications and databases (including data
and DBMS) are hosted at the provider site and need to be available, typically
through the Internet, as efficiently as if they were local to the customer site.
• A major requirement is that applications and databases remain autonomous,
i.e., remain unchanged when moved to the provider site’s cluster and under
the control of the customers.
• Thus, preserving autonomy is critical to avoid the high costs and problems
associated with application code modification.
• Using a parallel DBMS in this case is not appropriate as it is expensive,
requires heavy migrazion to the parallel DBMS and hurts database
autonomy.
Database Clusters
• A solution is to use a database cluster , which is a cluster of autonomous
databases, each managed by an off-the-shelf DBMS
• A major difference with a parallel DBMS implemented on a cluster is the use
of a “black-box” DBMS at each node.
• Since the DBMS source code is not necessarily available and cannot be
changed to be “cluster-aware”, parallel data management capabilities must
be implemented via middleware.
• In its simplest form, a database cluster can be viewed as a multidatabase
system on a cluster.
THANK YOU
BITS Pilani
BITS Pilani

L9 : Web Data Management
Reference : T1 - Ch. 17
Text Books
Ltd
Publishers: Wiley

organizations
SSZG554 - Distributed Data Systems Date : 31st Aug 2020 403 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Web Graph Management • Ranking and Link Analysis
• Compressing Web Graphs • Evaluation of Keyword Search
• Storing Web Graphs as S-Nodes • Web Querying
– Supernode graph • Semistructured Data Approach
– Intranode graph • Web Query Language Approach
– Positive superedge graph • Question Answering
– Negative superedge graph • Searching and Querying the Hidden Web
• Web Search • Crawling the Hidden Web
• Web Crawling
• Indexing
– Structure Index
– Text Index
Web Graph Management
• The web consists of “pages” that are connected by hyperlinks, and this
structure can be modelled as a directed graph that reflects the hyperlink
structure.
• In this graph, commonly referred to as the web graph, static HTML web
pages are the nodes and the links between them are represented as directed
edges
• This brings us to a discussion of the structure of the web graph, which has a
“bowtie” shape
• It has a strongly connected component (the knot in the middle) in which there
is a path between each pair of pages.
• The strongly connected component (SCC) accounts for about 28% of the
web pages.
• A further 21% of the pages constitute the “IN” component from which there
are paths to pages in SCC, but to which no paths exist from pages in SCC.
• Symmetrically, “OUT” component has pages to which paths exists from
pages in SCC but not vice versa, and these also constitute 21% of the
pages.
• What is referred to as “tendrils” consist of pages that cannot be reached from
SCC and from which SCC pages cannot be reached either.
• These constitute about 22% of the web pages.
• These are pages that have not yet been “discovered” and have not yet been
connected to the better known parts of the web.
• Finally, there are disconnected components that have no links to/from
anything except their own small communities.
• This makes up about 8% of the web.
• This structure is interesting in that it determines the results that one gets
from web searches and from querying the web.
Compressing Web Graphs
• A specific proposal for compressing the web graph takes advantage of the
fact that we can attempt to find nodes that share several common out-edges,
corresponding to the case where one node might have copied links from
another node
• The main idea behind this technique is that when a new node is added to the
graph, it takes an existing page and copies some of the links from that page
to itself.
Storing Web Graphs as S-Nodes
• An alternative to compressing the web graph is to develop special storage
structures that allow efficient storage and querying.
• S-Nodes is one such structure that provides a two-level representation of the
web graph.
• In this scheme, the web graph is represented by a set of smaller directed sub
-graphs.
• Each of these smaller sub-graphs encodes the interconnections within a

small subset of pages.
• A top-level directed graph, consisting of supernodes and superedges
contains links to these smaller sub-graphs
• Given a web graph WG, the S-Node representation can be constructed as
follows.
• Let P = N1;N2; :::;Nn be a partition on the vertex set of WG.
• The following types of directed graphs can be defined
– Supernode graph
– Intranode graph
– Positive superedge graph
– Negative superedge graph
Supernode graph
• A supernode graph contains n vertices, one for each partition in P.
• There is a supernode for each of the partitions N1, N2 and N3.
• Supernodes are linked using superedges.
• A superedge Eij is created from Ni to Nj if there is at least one page in Ni that
points to some page in Nj .
Supernode graph
Intranode graph
• Each partition Ni is associated with an intranode graph
• IntraNode that represents all the interconnections between the pages that
belong to Ni.
• For example, IntraNode1 represents the hyperlinks between pages P1 and
P2.
Intranode graph
Positive superedge graph
• A positive superedge graph SEdgePosi,j is a directed bipartite graph

representing all links from Ni to Nj
• SEdgePos1,2 contains two edges that represent the two links from P1 and
P2 to P3.
• There is an SEdgePosi,j if there exists a corresponding superedge Ei,j.
Positive superedge graph
Negative superedge graph
• A negative superedge graph SEdgeNegi,j is a directed bipartite graph that

represents all links between Ni and Nj that do not exist in the
• actual web graph.
• Similar to SEdgePos, an SEdgeNegi,j exists if and only if there exists a
corresponding superedge Ei,j
Negative superedge graph
• Given a partition P on the vertex set of WG, an S-Node representation
Snode (WG;P) can be constructed by using the supernode graph that points
to the intranode graph and a set of positive and negative supernode graphs.
• The decision as to whether to use the positive or the negative supernode
graph depends on which representation has the lower number of edges.
Web Search
• Web search involves finding “all” the web pages that are relevant have
content related to keyword(s) that a user specifies.
• Naturally, it is not possible to find all the pages, or even to know if one has
retrieved all the pages; thus the search is performed on a database of web
pages that have been collected and indexed.
• There are usually multiple pages that are relevant to a query, these pages
are presented to the user in ranked order of relevance as determined by the
search engine.
Web Search
• In every search engine the crawler plays one of the most crucial roles.
• A crawler is a program used by a search engine to scan the web on its
behalf and collect data about web pages.
• A crawler is given a starting set of pages – more accurately, it is given a set
of Uniform Resource Locators (URLs) that identify these pages.
Web Search
• The crawler retrieves and parses the page corresponding to that URL,
extracts any URLs in it, and adds these URLs to a queue.
• In the next cycle, the crawler extracts a URL from the queue (based on some
order) and retrieves the corresponding page.
• This process is repeated until the crawler stops.
• A control module is responsible for deciding which URLs should be visited
next.
• The retrieved pages are stored in a page repository
Web Search
• The indexer module is responsible for constructing indexes on the pages
that have been downloaded by the crawler
• The ranking module is responsible for sorting the large number of results so
that those that are considered to be most relevant to the user’s search are
presented first
Web Crawling
• A crawler scans the web on behalf of a search engine to extract information
about the visited web pages.
• Given the size of the web, the changing nature of web pages, and the limited
computing and storage capabilities of crawlers, it is impossible to crawl the
entire web.
• Thus, a crawler must be designed to visit “most important” pages before
others.
• The issue, then, is to visit the pages in some ranked order of importance.
Web Crawling
• The importance of a given page measures can be static, such that the
importance of a page is determined independent of retrieval queries that will
run against it, or dynamic in that they take the queries into consideration
Web Crawling
• The PageRank of a page Pi (denoted r(Pi) ) is simply the normalized sum of
the PageRank of all Pi ’s backlink pages (denoted as BPi ):
Web Crawling
• This formula calculates the rank of a page based on the backlinks, but
normalizes the contribution of each backlinking page Pj using the number of
links that Pj has to other pages.
• The idea here is that it is more important to be pointed at by pages
conservatively link to other pages than by those who link to others
indiscriminately
Web Crawling
• A second issue is how the crawler chooses the next page to visit once it has
crawled a particular page.
• The crawler maintains a queue in which it stores the URLs for the pages that
it discovers as it analyzes each page.
• Thus, the issue is one of ordering the URLs in this queue.
• One possibility is to visit the URLs in the order in which they were
discovered; this is referred to as the breadth-first approach
Web Crawling
• Another alternative is to use random ordering whereby the crawler chooses a
URL randomly from among those that are in its queue of unvisited pages.
• Other alternatives are to use metrics that combine ordering with importance
ranking such as backlink counts or PageRank
Web Crawling
• Since many web pages change over time, crawling is a continuous activity
and pages need to be re-visited.
• Instead of restarting from scratch each time, it is preferable to selectively re-
visit web pages and update the gathered information.
• Crawlers that follow this approach are called incremental crawlers
Web Crawling
• Some search engines specialize in searching pages belonging to a particular
topic.
• These engines use crawlers optimized for the target topic, and are referred
to as focused crawlers
Indexing
• In order to efficiently search the crawled pages and the gathered information,
a number of indexes are built.
• The two more important indexes are
– Structure (or link ) index
– Text (or content ) index
Structure Index
• The structure index is based on the graph model with the graph representing
the structure of the crawled portion of the web.
• The efficient storage and retrieval of these pages is important
• The structure index can be used to obtain important information about the
linkage of web pages such as information regarding the neighborhood of a
page and the siblings of a page.
Text Index
• The most important and mostly used index is the text index .
• Indexes to support textbased retrieval can be implemented using any of the
access methods traditionally used to search over text document collections.
• Examples include suffix arrays, inverted files or inverted indexes and
signature files
Text Index
• An inverted index is a collection of inverted lists, where each list is
associated with a particular word.
• In general, an inverted list for a given word is a list of document identifiers in
which the particular word occurs
• If needed, the location of the word in a particular page can also be saved as
part of the inverted list.
• This information is usually needed in proximity queries and query result
ranking
Ranking and Link Analysis
• A typical search engine returns a large number of web pages that are
expected to be relevant to a user query.
• However, these pages are likely to be different in terms of their quality and
relevance.
• The user is not expected to browse through this large collection to find a high
quality page.
• Clearly, there is a need for algorithms to rank these pages thus higher quality
web pages appear as part of the top results.
• Link-based algorithms can be used to rank a collection of pages.
• A page that has a large number of incoming links is expected to have good
quality, and hence the number of incoming links to a page can be used as a
ranking criteria.
• This intuition is the basis of ranking algorithms
• HITS is also a link-based algorithm.
• It is based on identifying “authorities” and “hubs”.
• A good authority page receives a high rank.
• Hubs and authorities have a mutually reinforcing relationship: a good
authority is a page that is linked to by many good hubs, and a good hub is a
document that links to many authorities.
• Thus, a page pointed to by many hubs (a good authority page) is likely to be
of high quality
Evaluation of Keyword Search
• Keyword-based search engines are the most popular tools to search
information on the web.
• They are simple, and one can specify fuzzy queries that may not have an
exact answer, but may only be answered approximately by finding facts that
are “similar” to the keywords.
Evaluation of Keyword Search
• The limitation is that keyword search is not sufficiently powerful to express
complex queries
• A second limitation is that keyword search does not offer support for a global
view of information on the web the way that database querying exploits
database schema information
Web Querying
• Declarative querying and efficient execution of queries has been a major
focus of database technology.
• It would be beneficial if the database techniques can be applied to the web.
• In this way, accessing the web can be treated, to a certain extent, similar to
accessing a large database
Web Querying
• There are difficulties in carrying over traditional database querying concepts
to web data.
• Perhaps the most important difficulty is that database querying assumes the
existence of a strict schema.
Web Querying
• The web data are semi structured data may have some structure, but this
may not be as rigid, regular, or complete as that of databases, so that
different instances of the data may be similar but not identical
• There are, obviously, inherent difficulties in querying schemaless data.
Web Querying
• A second issue is that the web is more than the semi structured data
• The links that exist between web data entities (e.g., pages) are important
and need to be considered This requires links to be treated as first-class
objects.
Web Querying
• A third major difficulty is that there is no commonly accepted language,
similar to SQL, for querying web data
• A standardized language for XML has emerged (XQuery), and as XML
becomes more prevalent on the web
• This language is likely to become dominant and more widely used
Semistructured Data Approach
• One way to approach querying the web data is to treat it as a collection of
semi structured data.
• Then, models and languages that have been developed for this purpose can
be used to query the data.
• Semi structured data models and languages were not originally developed to
deal with web data; rather they addressed the requirements of growing data
collections that did not have as strict a schema as their relational counter
parts OEM (Object Exchange Model) is a self describing semi structured
data model. Self-describing means that each object specifies the schema
that it follows.
• An OEM object is defined as a four-tuple label, type, value, oid where
– label - is a character string describing what the object represents,
– type - specifies the type of the object’s value,
– value - is obvious, and
– oid - is the object identifier that distinguishes it from other objects.
• The type of an object can be atomic, in which case the
object is called an atomic object, or complex, in which
case the object is called a complex object.
• An atomic object contains a primitive value such as an
integer, a real, or a string, while a complex object
contains a set of other objects, which can themselves be
atomic or complex.
• The value of a complex object is a set of oids.
Web Query Language Approach
• The approaches in this category are aimed to directly address the
characteristics of web data, particularly focusing on handling links properly.
• Their starting point is to overcome the shortcomings of keyword search by
providing proper abstractions for capturing the content structure of
documents
• They combine the content-based queries (e.g., keyword expressions) and
structure-based queries (e.g., path expressions).
• A number of languages have been proposed specifically to deal with web
data, and these can be categorized as first-generation and second
generation
• The first generation languages model the web as interconnected collection of
atomic objects.
• Consequently, these languages can express queries that search the link
structure among web objects and their textual content, but they cannot
express queries that exploit the document structure of these web objects
• The second generation languages model the web as a linked collection of
structured objects, allowing them to express queries that exploit the
document structure similar to Semi structured languages.
• First generation approaches include WebSQL, W3QL and WebLog
• Second generation approaches include WebOQL and StruQL
Question Answering
• These systems accept natural language questions that are then analyzed to
determine the specific query that is being posed.
• They, then, conduct a search to find the appropriate answer
• Question answering systems have grown within the context of IR systems
where the objective is to determine the answer to posed queries within a well
-defined corpus of documents.
• These are usually referred to as closed domain systems.
• They extend the capabilities of keyword search queries in two fundamental
ways.
Question Answering
• First, they allow users to specify complex queries in natural language that
may be difficult to specify as simple keyword search requests
• In the context of web querying, they also enable asking questions without a
full knowledge of the data organization.
Question Answering
• Sophisticated natural language processing (NLP) techniques are then

applied to these queries to understand the specific query.
• Second, they search the corpus of documents and return explicit answers
rather than links to documents that may be relevant to the query
Question Answering
• This does not mean that they return exact answers as traditional DBMSs do,
but they may return a (ranked) list of explicit responses to the query, rather
than a set of web pages.
General architecture of QA Systems 463
SSZG554 - Distributed Data Systems Date : 31st Aug 2020
Searching and Querying the Hidden Web
• Most general-purpose search engines only operate on the PIW while
considerable amount of the valuable data are kept in hidden databases,
either as relational data, as embedded documents, or in many other forms.
• The current trend in web searching is to find ways to search the hidden web
as well as the PIW, for two main reasons.
• First is the size – the size of the hidden web (in terms of generated HTML
pages) is considerably larger than the PIW, therefore the probability of
finding answers to users’ queries is much higher if the hidden web can also
be searched.
• The second is in data quality – the data stored in the hidden web are usually
of much higher quality than those found on public web pages since they are
properly curated.
• Ordinary crawlers cannot be used to search the hidden web, since there are
neither HTML pages, nor hyperlinks to crawl.
• Usually, the data in hidden databases can be only accessed through a
search interface or a special interface, requiring access to this interface.
• In most cases, the underlying structure of the database is unknown, and the
data providers are usually reluctant to provide any information about their
data that might help in the search process one has to work through the
interfaces provided by these data sources.
Crawling the Hidden Web
• One approach to address the issue of searching the hidden web is to try
crawling in a manner similar to that of the PIW.
• As already mentioned, the only way to deal with hidden web databases is
through their search interfaces.
Crawling the Hidden Web
A hidden web crawler should be able to perform two tasks:
(a) submit queries to the search interface of the database, and
(b) analyze the returned result pages and extract relevant information from
them.
THANK YOU
BITS Pilani
BITS Pilani
L10 : Hadoop & Big Data

Text Books
Ltd
Publishers: Wiley

organizations
SSZG554 - Distributed Data Systems Date : 7th Nov 2020 472 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Hadoop Architecture
• HDFS
– Advantages
– Features
• Architecture
• HDFS Operations
• Big Data
• MapReduce
Introduction
• Hadoop is an open-source framework that allows to store and process big
data in a distributed environment across clusters of computers using simple
programming models.
• It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
• All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the
framework
Introduction
• Hadoop is a free, Java-based programming framework that supports the

processing of large data sets in a distributed computing environment. It is
part of the Apache project sponsored by the Apache Software Foundation
• It provides massive storage for any kind of data, enormous processing power
and the ability to handle virtually limitless concurrent tasks or jobs
Introduction
• The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part called MapReduce
• Hadoop splits files into large blocks and distributes them across nodes in a
cluster.
• To process data, Hadoop transfers packaged code for nodes to process in
parallel based on the data that needs to be processed.
Introduction
• This approach takes advantage of data locality – nodes manipulating the

data they have access to – to allow the dataset to be processed faster and
more efficiently than it would be in a more conventional supercomputer
architecture that relies on a parallel file system where computation and data
are distributed via high-speed networking
Hadoop Architecture
The base Apache Hadoop framework is composed of the following
modules:
– Hadoop Common – contains libraries and utilities needed by other

Hadoop modules
– Hadoop Distributed File System (HDFS) – a distributed file-system that

stores data on commodity machines, providing very high aggregate
bandwidth across the cluster
Hadoop Architecture
– Hadoop YARN – a resource-management platform responsible for

managing computing resources in clusters and using them for scheduling
of users' applications
– Hadoop MapReduce – an implementation of the MapReduce

programming model for large scale data processing.
Hadoop Distributed File System
• Hadoop can work directly with any mountable distributed file system such as
Local FS, HFTP FS, S3 FS, and others, but the most common file system
used by Hadoop is the Hadoop Distributed File System (HDFS).
• The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed to run
on large clusters (thousands of computers) of small computer machines in a
reliable, fault-tolerant manner.
• HDFS uses a master/slave architecture where master consists of a

singleNameNode that manages the file system metadata and one or more
slaveDataNodes that store the actual data.
• A file in an HDFS namespace is split into several blocks and those blocks
are stored in a set of DataNodes.
• The NameNode determines the mapping of blocks to the DataNodes.
• The DataNodes takes care of read and write operation with the file system.
• They also take care of block creation, deletion and replication based on
instruction given by NameNode.
• HDFS provides a shell like any other file system and a list of commands are
available to interact with the file system
How Does Hadoop Work?
Stage 1
• A user/application can submit a job to the Hadoop (a hadoop job client) for
required process by specifying the following items:
• The location of the input and output files in the distributed file system.
• The java classes in the form of jar file containing the implementation of map
and reduce functions.
• The job configuration by setting different parameters specific to the job.
Stage 2
• The Hadoop job client then submits the job (jar/executable etc) and
configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-client.
Stage 3
• The TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the output
files on the file system.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test distributed
systems.
• It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect
and handle failures at the application layer.
Advantages of Hadoop
• Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
• Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
HDFS
• Hadoop File System was developed using distributed file system design.
• It is run on commodity hardware.
• Unlike other distributed systems, HDFS is highly faulttolerant and designed
using low-cost hardware.
HDFS
• HDFS holds very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple machines.
• These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
• HDFS also makes applications available to parallel processing.
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check
the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
• HDFS follows the master-slave architecture and it has the following elements
– Namenode
– Datanode
Namenode
• The Namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
• It is a software that can be run on commodity hardware.
• The system having the namenode acts as the master server and it does the
following tasks:
– Manages the file system namespace.
– Regulates client’s access to files.
– It also executes file system operations such as renaming, closing, and
opening files and directories.
Datanode
• The datanode is a commodity hardware having the GNU/Linux operating

system and datanode software.
• For every node in a cluster, there will be a datanode.
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client
request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
• Generally the user data is stored in the files of HDFS.
• The file in a file system will be divided into one or more segments and/or
stored in individual data nodes.
• These file segments are called as blocks. In other words, the minimum
amount of data that HDFS can read or write is called a Block. The default
block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
HDFS
Goals of HDFS
Fault detection and recovery :

• Since HDFS includes a large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have mechanisms for quick
and automatic fault detection and recovery.
Huge datasets :
• HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Goals of HDFS
Hardware at data :
• A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
HDFS Operations
Starting HDFS
• Initially you have to format the configured HDFS file system, open namenode
(HDFS server), and execute the following command.
HDFS Operations
• After formatting the HDFS, start the distributed file system. The following
command will start the namenode as well as the data nodes as cluster.
HDFS Operations
Listing Files in HDFS

• After loading the information in the server, we can find the list of files in a
directory, status of a file, using ‘ls’.
• The syntax of ls that can pass to a directory or a filename as an argument.
HDFS Operations
Inserting Data into HDFS

• Assume that data is in the file called file.txt in the local system which is ought
to be saved in the hdfs file system.
• Following are the steps given to insert the required file in the Hadoop file
system.
HDFS Operations
Step 1
• You have to create an input directory.
Step 2
• Transfer and store a data file from local systems to the Hadoop file system
using the put command.
HDFS Operations
Step 3
• You can verify the file using ls command.
HDFS Operations
Retrieving Data from HDFS

• Assume we have a file in HDFS called outfile.
• For retrieving the required file from the Hadoop file system.
Step 1
• Initially, view the data from HDFS using cat command.
HDFS Operations
Step 2
• Get the file from HDFS to the local file system using get command.
HDFS Operations
Shutting Down the HDFS

• You can shut down the HDFS by using the following command.
Big Data
• Big data means really a big data, it is a collection of large datasets that
cannot be processed using traditional computing techniques.
• Big data is not merely a data, rather it has become a complete subject, which
involves various tools, technqiues and frameworks.
Big Data
• Due to the advent of new technologies, devices, and communication means
like social networking sites, the amount of data produced by mankind is
growing rapidly every year.
• The amount of data produced by us from the beginning of time till 2003 was
5 billion gigabytes.
• The same amount was created in every two days in 2011, and in every ten
minutes in 2013.
• This rate is still growing enormously. Though all this information produced is
meaningful and can be useful when processed, it is being neglected.
Big Data
• Big data involves the data produced by different devices and applications.
• Given are some of the fields that come under the umbrella of Big Data
Black Box Data :
It is a component of helicopter, airplanes, and jets, etc. It captures voices of the
flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
Big Data
Social Media Data :

Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.
Stock Exchange Data :
The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions
made on a share of different companies made by the customers.
Big Data
Power Grid Data :

The power grid data holds information consumed by a particular node with
respect to a base station.
Transport Data :
Transport data includes model, capacity, distance and availability of a vehicle.
Search Engine Data :
Search engines retrieve lots of data from different databases.
Big Data
Thus Big Data includes huge volume, high velocity, and extensible
variety of data. The data in it will be of three types.
– Structured data : Relational data.
– Semi Structured data : XML data.
– Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data
• Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions,
and other advertising mediums.
• Using the information in the social media like preferences and product
perception of their consumers, product companies and retail organizations
are planning their production.
• Using the data regarding the previous medical history of patients, hospitals
are providing better and quick service.
Big Data Technologies
• Big data technologies are important in providing more accurate analysis,
which may lead to more concrete decision-making resulting in greater
operational efficiencies, cost reductions, and reduced risks for the business.
• To harness the power of big data, you would require an infrastructure that
can manage and process huge volumes of structured and unstructured data
in realtime and can protect data privacy and security.
Big Data Technologies
• There are various technologies in the market from different vendors including
Amazon, IBM, Microsoft, etc., to handle big data.
• While looking into the technologies that handle big data, we examine the
following two classes of technology:
– Operational Big Data
– Analytical Big Data
Operational Big Data
• This include systems like MongoDB that provide operational capabilities for
real-time, interactive workloads where data is primarily captured and stored.
• NoSQL Big Data systems are designed to take advantage of new cloud
computing architectures that have emerged over the past decade to allow
massive computations to be run inexpensively and efficiently.
Operational Big Data
• This makes operational big data workloads much easier to manage, cheaper,
and faster to implement.
• Some NoSQL systems can provide insights into patterns and trends based
on real-time data with minimal coding and without the need for data
scientists and additional infrastructure.
Analytical Big Data
• This includes systems like Massively Parallel Processing (MPP) database

systems and MapReduce that provide analytical capabilities for retrospective
and complex analysis that may touch most or all of the data.
• MapReduce provides a new method of analyzing data that is complementary
to the capabilities provided by SQL, and a system based on MapReduce that
can be scaled up from single servers to thousands of high and low end
machines.
• These two classes of technology are complementary and frequently
deployed together.
Big Data Solutions
• In this approach, an enterprise will have a computer to store and process big
data.
• Here data will be stored in an RDBMS like Oracle Database, MS SQL Server
or DB2 and sophisticated softwares can be written to interact with the
database, process the required data and present it to the users for analysis
purpose.
Big Data Solutions
Big Data Solutions
• This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the
processor which is processing the data.
• But when it comes to dealing with huge amounts of data, it is really a tedious
task to process such data through a traditional database server
Big Data Solutions
Google’s Solution
• Google solved this problem using an algorithm called MapReduce.
• This algorithm divides the task into small parts and assigns those parts to
many computers connected over the network, and collects the results to form
the final result dataset
Big Data Solutions
Big Data Solutions
• Doug Cutting, Mike Cafarella and team took the solution provided by Google
and started an Open Source Project called HADOOP in 2005 and Doug
named it after his son's toy elephant.
• Now Apache Hadoop is a registered trademark of the Apache Software
Foundation.
Big Data Solutions
• Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes.
• In short, Hadoop framework is capable enough to develop applications
capable of running on clusters of computers and they could perform
complete statistical analysis for a huge amounts of data.
Big Data Solutions
MapReduce
• MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
• A MapReduce program is composed of a Map() procedure (method) that
performs filtering and sorting (such as sorting students by first name into
queues, one queue for each name) and a Reduce() method that performs a
summary operation (such as counting the number of students in each queue,
yielding name frequencies).
MapReduce
• The "MapReduce System" (also called "infrastructure" or "framework")

orchestrates the processing by marshalling the distributed servers, running
the various tasks in parallel, managing all communications and data
transfers between the various parts of the system, and providing for
redundancy and fault tolerance.
MapReduce
• MapReduce is a framework for processing parallelizable problems across

huge datasets using a large number of computers (nodes), collectively
referred to as a cluster (if all nodes are on the same local network and use
similar hardware) or a grid (if the nodes are shared across geographically
and administratively distributed systems, and use more heterogenous
hardware).
MapReduce
• Processing can occur on data stored either in a filesystem (unstructured) or
in a database (structured).
• MapReduce can take advantage of the locality of data, processing it near the
place it is stored in order to reduce the distance over which it must be
transmitted.
MapReduce
"Map" step:
Each worker node applies the "map()" function to the local data, and writes the
output to a temporary storage. A master node ensures that only one copy of
redundant input data is processed.
"Shuffle" step:
Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the
same worker node.
MapReduce
"Reduce" step:
Worker nodes now process each group of output data, per key, in parallel.
MapReduce
• MapReduce allows for distributed processing of the map and reduction
operations.
• Provided that each mapping operation is independent of the others, all maps
can be performed in parallel – though in practice this is limited by the number
of independent data sources and/or the number of CPUs near each source.
• Similarly, a set of 'reducers' can perform the reduction phase, provided that
all outputs of the map operation that share the same key are presented to
the same reducer at the same time, or that the reduction function is
associative
MapReduce
Another way to look at MapReduce is as a 5-step parallel and distributed
computation:
• Prepare the Map() input – the "MapReduce system" designates Map
processors, assigns the input key value K1 that each processor would work
on, and provides that processor with all the input data associated with that
key value.
• Run the user-provided Map() code – Map() is run exactly once for
each K1 key value, generating output organized by key values K2.
MapReduce
• "Shuffle" the Map output to the Reduce processors – the MapReduce

system designates Reduce processors, assigns the K2 key value each
processor should work on, and provides that processor with all the Map-
generated data associated with that key value.
• Run the user-provided Reduce() code – Reduce() is run exactly once for
each K2 key value produced by the Map step.
• Produce the final output – the MapReduce system collects all the Reduce
output, and sorts it by K2 to produce the final outcome.
MapReduce
An example, imagine that for a database of 1.1 billion people, one would like to
compute the average number of social contacts a person has according to age.
In SQL, such a query could be expressed as:
SELECT age, AVG(contacts)
FROM social.person
GROUP BY age
ORDER BY age
MapReduce
• Using MapReduce, the K1 key values could be the integers 1 through 1100,
each representing a batch of 1 million records, the K2 key value could be a
person’s age in years, and this computation could be achieved using the
following functions
MapReduce
function Map is
input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record (Y,(N,1))
repeat
end function
function Reduce is
input: age (in years) Y
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
repeat
let A be S/Cnew
produce one output record (Y,(A,Cnew))
end function
MapReduce
• The MapReduce System would line up the 1100 Map processors, and would
provide each with its corresponding 1 million input records.
• The Map step would produce 1.1 billion (Y,(N,1)) records, with Y values
ranging between, say, 8 and 103.
• The MapReduce System would then line up the 96 Reduce processors by
performing shuffling operation of the key/value pairs due to the fact that we
need average per age, and provide each with its millions of corresponding
input records.
MapReduce
• The Reduce step would result in the much reduced set of only 96 output
records (Y,A), which would be put in the final result file, sorted by Y.
• The count info in the record is important if the processing is reduced more
than one time.
• If we did not add the count of the records, the computed average would be
wrong, for example:
MapReduce
-- map output #1: age, quantity of contacts
10, 9
10, 9
10, 9
10, 9
10, 9
10, 10
MapReduce
• If we reduce files #1 and #2, we will have a new file with an average of 9
contacts for a 10-year-old person ((9+9+9+9+9)/5):
-- reduce step #1: age, average of contacts
10, 9
• If we reduce it with file #3, we lose the count of how many records we've
already seen, so we end up with an average of 9.5 contacts for a 10-year-old
person ((9+10)/2), which is wrong.
• The correct answer is
9.166 = 55 / 6 = (9*3+9*2+10*1)/(3+2+1).
THANK YOU

SSZG554 Merged DDS PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSZG554 Merged DDS PDF

Uploaded by

Copyright:

Available Formats

BITS Pilani Presentation

DISTRIBUTED DATA SYSTEMS - SSZG554

Note: In order to broaden understanding of concepts as applied to Indian IT industry,

SSZG554 - Distributed Data Systems 8th Aug 2020 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

In storage-centric IT

SSZG554 - Distributed Data Systems 8th Aug 2020 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

All servers share the storage

SSZG554 - Distributed Data Systems 8th Aug 2020 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• In active cabling the individual

SSZG554 - Distributed Data Systems 8th Aug 2020 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 8th Aug 2020 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DISTRIBUTED DATA SYSTEMS - SSZG554

Note: In order to broaden understanding of concepts as applied to Indian IT industry,

SSZG554 - Distributed Data Systems 16th Aug 2020 49 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 50 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 53 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 57 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 58 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 59 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 60 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 61 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SSZG554 - Distributed Data Systems 16th Aug 2020 62 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DAS NAS SAN

Data Transmission IDE/SCSI TCP/IP, Ethernet Fibre Channel

Access Mode clients or servers clients or servers servers