Dbms Unit

CS8492-DBMS-UNIT-5 ADVANCED TOPICS
Unit-V Advanced TOPICS

Distributed Databases:-Architecture, Data Storage, Transaction Processing-Object-
based Databases: Object Database Concepts, Object-Relational Features,ODMG
Object Model,ODL,OQL-XML Databases:XML Hierarchical model,DTD,XML
Schema, XQuery-Information Retrieval:IR Concepts, Retrieval Models, Queries in
IR systems
DISTRIBUTED DATABASES:-
A distributed database is basically a database that is not limited to one system, it is
spread over different sites, i.e, on multiple computers or over a network of
computers. A distributed database system is located on various sited that don‟t
share physical components. This maybe required when a particular database needs
to be accessed by various users globally. It needs to be managed such that for the
users it looks like one single database.
A distributed database is a collection of multiple interconnected databases, which
are spread physically across various locations that communicate via a computer
network.
Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The

operating system, database management system and the data structures used – all
are same at all sites. Hence, they‟re easy to manage.
IIYEAR/IV SEM Page 1

2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema

and software that can lead to problems in query processing and transactions. Also,
a particular site might be completely unaware of the other sites. Different
computers may use a different operating system, different database application.
They may even use different data models for the database. Hence, translations are
required for different sites to communicate
Distributed Data Storage
There are 2 ways in which data can be stored on different sites. These are:
1.Replication
In this approach, the entire relation is stored redundantly at 2 or more sites. If the
entire database is available at all sites, it is a fully redundant database. Hence, in
replication, systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also,
now query requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated.
Any change made at one site needs to be recorded at every site that relation is
stored or else it may lead to inconsistency. This is a lot of overhead. Also,
concurrency control becomes way more complex as concurrent access now needs
to be checked over a number of sites.
2.Fragmentation
In this approach, the relations are fragmented (i.e., they‟re divided into smaller
parts) and each of the fragments is stored in different sites where they‟re required.

It must be made sure that the fragments are such that they can be used to
reconstruct the original relation
(i.e, there isn‟t any loss of data).

Fragmentation is advantageous as it doesn‟t create copies of data, consistency is
not a problem.
Fragmentation of relations can be done in two ways:
 Horizontal fragmentation – Splitting by rows – The relation is fragmented

into groups of tuples so that each tuple is assigned to at least one fragment.
 Vertical fragmentation – Splitting by columns – The schema of the relation
is divided into smaller schemas. Each fragment must contain a common
candidate key so as to ensure lossless join.
In certain cases, an approach that is hybrid of fragmentation and replication is

used.
Types of Distributed Databases
Distributed databases can be broadly classified into homogeneous and

heterogeneous distributed database environments, each with further sub-divisions,
as shown in the following illustration.

Homogeneous Distributed Databases
In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −
 The sites use very similar software.

 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process
user requests.
 The database is accessed through a single interface as if it is a single
database.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
 Autonomous − Each database is independent that functions on its own.

They are integrated by a controlling application and use message passing to
share data updates.

 Non-autonomous − Data is distributed across the homogeneous nodes and a

central or master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating

systems, DBMS products and data models. Its properties are −
 Different sites use dissimilar schemas and software.

 The system may be composed of a variety of DBMSs like relational,
network, hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in
processing user requests.
Types of Heterogeneous Distributed Databases
 Federated − The heterogeneous database systems are independent in nature

and integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating
module through which the databases are accessed.
Distributed DBMS Architectures
DDBMS architectures are generally developed depending on three parameters −
 Distribution − It states the physical distribution of data across the different

sites.

 Autonomy − It indicates the distribution of control of the database system

and the degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data
models, system components and databases.
Architectural Models
Some of the common architectural models are −
 Client - Server Architecture for DDBMS

 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture
Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers and
clients. The server functions primarily encompass data management, query
processing, optimization and transaction management. Client functions include
mainly user interface. However, they have some functions like consistency
checking and transaction management.
The two different client - server architecture are −
 Single Server Multiple Client

 Multiple Server Multiple Client (shown in the following diagram)

Peer- to-Peer Architecture for DDBMS
In these systems, each peer acts both as a client and a server for imparting database
services. The peers share their resource with other peers and co-ordinate their
activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.

 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.

Multi - DBMS Architectures
This is an integrated database system formed by a collection of two or more

autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of

subsets of the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that
comprises of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across
different sites and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.

 Local database Conceptual Level − Depicts local data organization at each

site.
 Local database Internal Level − Depicts physical data organization at each
site.
There are two design alternatives for multi-DBMS −
 Model with multi-database conceptual level.

 Model without multi-database conceptual level.

Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −
 Non-replicated and non-fragmented

 Fully replicated
 Partially replicated
 Fragmented
 Mixed
Non-replicated & Non-fragmented
In this design alternative, different tables are placed at different sites. Data is
placed so that it is at a close proximity to the site where it is used most. It is most

suitable for database systems where the percentage of queries needed to join
information in tables placed at different sites is low. If an appropriate distribution
strategy is adopted, then this design alternative helps to reduce the communication
cost during data processing.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored.
Since, each site has its own copy of the entire database, queries are very fast
requiring negligible communication cost. On the contrary, the massive redundancy
in data requires huge cost during update operations. Hence, this is suitable for
systems where a large number of queries is required to be handled whereas the
number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution
of the tables is done in accordance to the frequency of access. This takes into
consideration the fact that the frequency of accessing the tables vary considerably
from site to site. The number of copies of the tables (or portions) depends on how
frequently the access queries execute and the site which generate the access
queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or
partitions, and each fragment can be stored at different sites. This considers the fact
that it seldom happens that all data stored in a table is required at a given site.
Moreover, fragmentation increases parallelism and provides better disaster

recovery. Here, there is only one copy of each fragment in the system, i.e. no
redundant data.
The three fragmentation techniques are −
 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are
initially fragmented in any form (horizontal or vertical), and then these fragments
are partially replicated across the different sites according to the frequency of
accessing the fragments.
Data Replication
Data replication is the process of storing separate copies of the database at two or
more sites. It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
 Reliability − In case of failure of any site, the database system continues to

work since a copy is available at another site(s).
 Reduction in Network Load − Since local copies of data are available,
query processing can be done with reduced network usage, particularly
during prime hours. Data updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data ensures quick
query processing and consequently quick response time.

 Simpler Transactions − Transactions require less number of joins of tables

located at different sites and minimal coordination across the network. Thus,
they become simpler in nature.
Disadvantages of Data Replication
 Increased Storage Requirements − Maintaining multiple copies of data is

associated with increased storage costs. The storage space required is in
multiples of the storage required for a centralized system.
 Increased Cost and Complexity of Data Updating − Each time a data item
is updated, the update needs to be reflected in all the copies of the data at the
different sites. This requires complex synchronization techniques and
protocols.
 Undesirable Application – Database coupling − If complex update
mechanisms are not used, removing data inconsistency requires complex co-
ordination at application level. This results in undesirable application –
database coupling.
Some commonly used replication techniques are −
 Snapshot replication
 Near-real-time replication
 Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical). Horizontal

fragmentation can further be classified into two techniques: primary horizontal

fragmentation and derived horizontal fragmentation.
Fragmentation should be done in a way so that the original table can be

reconstructed from the fragments. This is needed so that the original table can be
reconstructed from the fragments whenever required. This requirement is called
“reconstructiveness.”
Advantages of Fragmentation
 Since data is stored close to the site of usage, efficiency of the database
system is increased.
 Local query optimization techniques are sufficient for most queries since
data is locally available.
 Since irrelevant data is not available at the sites, security and privacy of the
database system can be maintained.
Disadvantages of Fragmentation
 When data from different fragments are required, the access speeds may be
very high.
 In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
 Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into

fragments. In order to maintain reconstructiveness, each fragment should contain
the primary key field(s) of the table. Vertical fragmentation can be used to enforce
privacy of data.
For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.
STUDENT
Regd_No Name Course Address Semester Fees Marks
Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −
CREATE TABLE STD_FEES AS

SELECT Regd_No, Fees
FROM STUDENT;
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of

one or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the original
base table.
For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT

WHERE COURSE = "Computer Science";
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation

techniques are used. This is the most flexible fragmentation technique since it
generates fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
 At first, generate a set of horizontal fragments; then generate vertical

fragments from one or more of the horizontal fragments.
 At first, generate a set of vertical fragments; then generate horizontal
fragments from one or more of the vertical fragments.
The three dimensions of distribution transparency are −
 Location transparency
 Fragmentation transparency
 Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or
fragment(s) of a table as if they were stored locally in the user‟s site. The fact that
the table or its fragments are stored at remote site in the distributed database
system, should be completely oblivious to the end user. The address of the remote
site(s) and the access mechanisms are completely hidden.

In order to incorporate location transparency, DDBMS should have access to

updated and accurate data dictionary and DDBMS directory which contains the
details of locations of data.
Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were

unfragmented. Thus, it hides the fact that the table the user is querying on is
actually a fragment or union of some fragments. It also conceals the fact that the
fragments are located at diverse sites.
This is somewhat similar to users of SQL views, where the user may not know that
they are using a view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the
users. It enables users to query upon a table as if only a single copy of the table
exists.
Replication transparency is associated with concurrency transparency and failure

transparency. Whenever a user updates a data item, the update is reflected in all the
copies of the table. However, this operation should not be known to the user. This
is concurrency transparency. Also, in case of failure of a site, the user can still
proceed with his queries using replicated copies without any knowledge of failure.
This is failure transparency.

Combination of Transparencies
In any distributed database system, the designer should ensure that all the stated
transparencies are maintained to a considerable extent. The designer may choose to
fragment tables, replicate them and store them at different sites; all oblivious to the
end user. However, complete distribution transparency is a tough task and requires
considerable design efforts.
Database control refers to the task of enforcing regulations so as to provide correct

data to authentic users and applications of a database. In order that correct data is
available to users, all data should conform to the integrity constraints defined in the
database. Besides, data should be screened away from unauthorized users so as to
maintain security and privacy of the database. Database control is one of the
primary tasks of the database administrator (DBA).
The three dimensions of database control are −
 Authentication
 Access rights
 Integrity constraints
Authentication
In a distributed database system, authentication is the process through which only

legitimate users can gain access to the data resources.
Authentication can be enforced in two levels −
 Controlling Access to Client Computer − At this level, user access is

restricted while login to the client computer that provides user-interface to

the database server. The most common method is a username/password

combination. However, more sophisticated methods like biometric
authentication may be used for high security data.
 Controlling Access to the Database Software − At this level, the database
software/administrator assigns some credentials to the user. The user gains
access to the database using these credentials. One of the methods is to
create a login account within the database server.
Access Rights
A user‟s access rights refers to the privileges that the user is given regarding
DBMS operations such as the rights to create a table, drop a table,
add/delete/update tuples in a table or query upon the table.
In distributed environments, since there are large number of tables and yet larger
number of users, it is not feasible to assign individual access rights to users. So,
DDBMS defines certain roles. A role is a construct with certain privileges within a
database system. Once the different roles are defined, the individual users are
assigned one of these roles. Often a hierarchy of roles are defined according to the
organization‟s hierarchy of authority and responsibility.
For example, the following SQL statements create a role "Accountant" and then
assigns this role to user "ABC".
CREATE ROLE ACCOUNTANT;

GRANT SELECT, INSERT, UPDATE ON EMP_SAL TO ACCOUNTANT;
GRANT INSERT, UPDATE, DELETE ON TENDER TO ACCOUNTANT;
GRANT INSERT, SELECT ON EXPENSE TO ACCOUNTANT;
COMMIT;

GRANT ACCOUNTANT TO ABC;

COMMIT;
Semantic Integrity Control
Semantic integrity control defines and enforces the integrity constraints of the
database system.
The integrity constraints are as follows −
 Data type integrity constraint

 Entity integrity constraint
 Referential integrity constraint
Data Type Integrity Constraint
A data type constraint restricts the range of values and the type of operations that
can be applied to the field with the specified data type.
For example, let us consider that a table "HOSTEL" has three fields - the hostel
number, hostel name and capacity. The hostel number should start with capital
letter "H" and cannot be NULL, and the capacity should not be more than 150. The
following SQL command can be used for data definition −
CREATE TABLE HOSTEL (

H_NO VARCHAR2(5) NOT NULL,
H_NAME VARCHAR2(15),
CAPACITY INTEGER,
CHECK ( H_NO LIKE 'H%'),
CHECK ( CAPACITY<= 150)

);
Entity Integrity Control
Entity integrity control enforces the rules so that each tuple can be uniquely
identified from other tuples. For this a primary key is defined. A primary key is a
set of minimal fields that can uniquely identify a tuple. Entity integrity constraint
states that no two tuples in a table can have identical values for primary keys and
that no field which is a part of the primary key can have NULL value.
For example, in the above hostel table, the hostel number can be assigned as the
primary key through the following SQL statement (ignoring the checks) −
CREATE TABLE HOSTEL (

H_NO VARCHAR2(5) PRIMARY KEY,
H_NAME VARCHAR2(15),
CAPACITY INTEGER
);
Referential Integrity Constraint
Referential integrity constraint lays down the rules of foreign keys. A foreign key
is a field in a data table that is the primary key of a related table. The referential
integrity constraint lays down the rule that the value of the foreign key field should
either be among the values of the primary key of the referenced table or be entirely
NULL.
For example, let us consider a student table where a student may opt to live in a
hostel. To include this, the primary key of hostel table should be included as a
foreign key in the student table. The following SQL statement incorporates this −

CREATE TABLE STUDENT (

S_ROLL INTEGER PRIMARY KEY,
S_NAME VARCHAR2(25) NOT NULL,
S_COURSE VARCHAR2(10),
S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL
);
Advantage:
1. Sharing data
2. Autonomy-able to control over data stored locally.
3. Availability-one site fails the other sites able to continue operating.
Disadvantage:
1. Software development cost very high.
2. Greater potential for bugs is harder to ensure to correctness of algorithms, during
failures of system.
3. Increased processing overhead, exchange of message and additional achieve
interside.
OBJECT-BASED DATABASES
 Object oriented database systems are alternative to relational database and

other database systems.
 In object oriented database, information is represented in the form of objects.
 Object oriented databases are exactly same as object oriented programming
languages. If we can combine the features of relational model (transaction,
concurrency, recovery) to object oriented databases, the resultant model is
called as object oriented database model.

Data model designed for processing database is not adequate to support computer
aided design (CAD), image database, multimedia, hypertext databases and so on.

It support new technology data, it requires,

 Complex data types
 Data encapsutlion
 Efficient indexing and query technique.
So to satisfy above requirements, object based data models are needed.
Storing of data in database as object is termed as object oriented database. An
object has both attributes and methods associated with it. Object has two
components, state and behavior, state implies the current state of object and
behavior implies the operation being carried out by object. Object is entity in ER
model.
Attributes of an object exists as two types:

 Simple Attributes can be a normal data type such as integer, real etc.
 complexAttributes has reference to other attribute.
An object has the following characteristics:
 identifier(unique-id for object)
 Lifetime(determines the persistence of an object)
 Objected oriented data model is logical model, which describes entities and
relationship among entity.
Group of objects is terms as class. An object is instance of class.
Object include:
 Variable types
 Methods
 Message interface

Class student
{
String name;
Introllno;
Get_name();
Set_name();
};
Object Oriented database (OODB) always maintains a direct link between

real world object and data objects and it has a unique system generated
identifier called object identifier (OID).
In OODB all object have an object structure and holds all relevant
information corresponding to the real world entity.
OODB provides a complete encapsulation of objects. Thus it specific
predefined operations for an object. But predefined operations does not
support adhoc/instant query by a user.
Encapsulation is come in two parts:

1. Signature or interface-specifies object name and attributes
2. Method or body-specifies operation need to perform by an object.
OODB supports inheritance to reuse existing type definitions to create new

objects.
A consortium of object oriented database is called ODMG(Object Data
Management Group)
It supports Operator Overloading. That is operation name may refer to
several implementation depending on the type of object it is applied to. This

is known as polymorphism.
OBJECT IDENTITY (OID)

An OODB provides unique identifier (OID) to objects. It is a system
generated OID.it is not known to the external users, but is internally used by
the system to identity object.
Object identity cannot be changed.so it is immutable. This immutable
property preserves the real world objects being represented. Each OID can
be used only once. Even if an object removed from database, same OID
cannot be assigned to another object.
Object Structure
The state of object can be constructed from other object by using type
constructor.
Each object is denoted as a triple (i, c, v) where I unique object identifier
type constructor object state (current).
Example:
(i, atomic, 10)
Three basic type constructors are atom, tuple, and set.
Atom represents all at atomic values such as integer,real,number,character
string,booolean etc.
1. The object-oriented paradigm is based on encapsulating code and data into a

single unit. Conceptually, all interactions between an object and the rest of
the system are via messages. Thus, the interface between an object and the
rest of the system is defined by a set of allowed messages.
2. In general, an object has associated with it:

o A set of variables that contain the data for the object. The value of
each variable is itself an object.
o A set of messages to which the object responds.
o A set of methods, each of which is a body of code to implement each
message; a method returns a value as the response to the message.
3. Motivation of using messages and methods.
All employee objects respond to the annual-salary message but in different

computations for managers, tellers, etc. By encapsulation within the
employee object itself the information about how to compute the annual
salary, all employee objects present the same interface.
Since the only external interface presented by an object is the set of

messages to which it responds, it is possible to (i) modify the definition of
methods and variables without affecting the rest of the system, and (ii)
replace a variable with the method that computes a value, e.g., age from
birth_date.
The ability to modify the definition of an object without affecting the rest of
the system is considered to be one of the major advantages of the OO
programming paradigm.
4. Methods of an object may be classified as either read-only or update.

Message can also be classified as read-only or update. Derived attributes of
an entity in the ER model can be expressed as read-only messages.

ODMG









XML DATABASE
 Xml (extensible Markup Language) is a markup language.
 XML is designed to store and transport data.
 Xml was released in late 90‟s. it was created to provide an easy to use and store
self-describing data.

XML database is a data persistence software system used for storing the huge
amount of information in XML format. It provides a secure place to store XML
documents.
You can query your stored data by using XQuery, export and serialize into desired
format. XML databases are usually associated with document-oriented databases.
Types of XML databases
There are two types of XML databases.
1. XML-enabled database
2. Native XML database (NXD)
XML-enable Database
XML-enable database works just like a relational database. It is like an extension

provided for the conversion of XML documents. In this database, data is stored in
table, in the form of rows and columns.
Native XML Database
Native XML database is used to store large amount of data. Instead of table format,
Native XML database is based on container format. You can query data by XPath
expressions.
Native XML database is preferred over XML-enable database because it is highly

capable to store, maintain and query XML documents.
Let's take an example of XML database:

1. <?xml version="1.0"?>
2. <contact-info>
3. <contact1>
4. <name>Vimal Jaiswal</name>
5. <company>SSSIT.org</company>
6. <phone>(0120) 4256464</phone>
7. </contact1>
8. <contact2>
9. <name>Mahesh Sharma </name>
10. <company>SSSIT.org</company>
11. <phone>09990449935</phone>
12. </contact2>
13.</contact-info>
In the above example, a table named contacts is created and holds the contacts
(contact1 and contact2). Each one contains 3 entities name, company and phone.
Features and Advantages of XML
1) XML separates data from HTML
2) XML simplifies data sharing
3) XML simplifies data transport
XML DOCUMENT
XML is basis for all new generation data interchanges formats.

XML provides detailed information about the structre and meanng of the data in
the web pages.
It is suited for:
 Structured data
 Semi-Structured data
 Unstructured data
Structured data
In structured data all information are stored in predefined structure format. There
will be no deviations in the arrangement and operations being performed data, Data
in which stored in database in specified format is normally termed as structured
data.
Example:
Data retrieval about department and course offered in college.
Semi-Structured data
In some applications data is collected in adhoc manner. Since data is collected

from different sources, formats of data also vary. But data collected from different
formats will be stored in knows format. Schema information is not known directly,
but it can be known through the data values present in the attributes. Because this
type of data is called self-describing data.

Unstructured data
User text entry on web search is the best example of unstructured data. Text
document which has information embedded into it is called unstructured data.
Nested structure of tags helps to gain depth knowledge about the database,but it
may lead to redundancy.
Two concepts used to construct XML document are:

1) Element
2) Attribute
Attributes in XML are strings that do not contain markup and appears only once in
a given tag.XML has been used to exchange data between applications by
mechanism called namespace.Namespace mechanism enables global unique
names used to element tags in document.
Example: Webaddress
https://www.annauniv.edu
XML data has to be exchanged between organization.so namespace is the only

solution.
<college xmlns:au=‟//www.au.com‟>
<au:department>
<au:course>CSE</au:course>
<au:Block>Milton</au:Block>
</au: department>
XML HIERARCHICAL DATA MODEL
XML document can be modeled in tree structure using nodes. Eachnode has
corresponding element andattributes. Termed parent, child, ancestors, siblings,
descendant are used to XML tree model.

There are two types of elements: complex –complex nodes are internal nodes. and
simple elements are leaf nodes.
Simple elements contain data values whereas complex element constructed from
other elements hierarchy model.
Example:
<bookstore>
<bookcategory="children">
<title>HarryPotter</title>
<author>JK. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
</book>
</bookstore>
Here book, title, author,year,price are simple elements and given construct values
for each element called complex elements.
It is possible to characterize three main types of XML documents:
Data-centric XML documents. These documents have many small

dataitems that follow a specific structure and hence may be extracted from a
structured database. They are formatted as XML documents in order to exchange

them over or display them on the Web. These usually follow a predefined schema
that defines the tag names.
Document-centric XML documents.These are documents with

largeamounts of text, such as news articles or books. There are few or no struc-
tured data elements in these documents.
Hybrid XML documents.These documents may have parts that

containstructured data and other parts that are predominantly textual or unstruc-
tured. They may or may not have a predefined schema.
XML document Schema
Schema specifies the overall structures of the XML document.XML document

schema speicifies:
 Information being stored

 Type of information being stored.
XML schema is most important for the web pages to interpret navigate to
next web page application.
XML schema is used to identify specific set of XML schema language
stored at a different website location.
http://www.au.deu/2005/XMLSchema
Each part of the schema defines different part or location in a file.XMLns is
a variable used as prefix to all XML schema tags.
There are two types of XML document:
1) Well-formed XML
2) Valid XML

In Well-formed XML there is only a single root element and every element
must have start and end tag with in a start and end tag of parent
element.DOM(Document Object Model) allows programs to manipulate tree
representation to the well-formed XML document.it parses the entire tree
structure.
In valid XML document must satisfy the structure specified in XML schema
file or XML DTD(Document type Definition)
Example: XML Schema for project
<xsd:schemaxmlns:xds=http://www.project.com/2010/xmlschema>
<xsd:annotation>
<xsd:documentationxml”lang=”en”>project expo</xsd:annotation>
</xsd:annotation>
<xsd:Complextype name=”Name”>
<xsd: sequence>
<xsd: element name=”first_name” type=”xsd.string”/>
<xsd: element name=”last_name” type=”xsd.string”/>
<xsd: element name=”middle_name” type=”xsd.string”/>
</xsd:sequence>
</xsd:CompleType>
<xsd:Sequence>
<xsd: element name=”emp_id” type=”xsd.integer”/>
<xsd: element name=”f_name” type=”xsd.string”/>
<xsd: element name=”dept” type=”xsd.string”/>
</xsd:sequence>
</xsd:CompleType>
</xsd:Schema>
For referring primary key and foreign key of schema xsd:keyref is used.
Xsd annotation and documentation are used for providing comments and
other descriptions in XML document.
The structure of company root element is xsd:complextype.


DTD (Document Type Definition)
A DTD is a Document Type Definition.
A DTD defines the structure and the legal elements and attributes of an XML
document.
Use a DTD
With a DTD, independent groups of people can agree on a standard DTD for
interchanging data.
An application can use a DTD to verify that XML data is valid.
Syntax
Basic syntax of a DTD is as follows −
<!DOCTYPE element DTD identifier

[
declaration1
declaration2
........
]>
XML document with an internal DTD
<?xml version="1.0"?>
<!DOCTYPE note [

<!ELEMENT note (to,from,heading,body)>

<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>
An External DTD Declaration
If the DTD is declared in an external file, the <!DOCTYPE> definition must

contain a reference to the DTD file:
XML document with a reference to an external DTD
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

Limitations
1) Data types are very sophisticated not general, all values should be
string.
2) Since it has two own syntax need specific processor run it.
3) UnOrdering specification of elements are never permitted.
4) Difficult to specify unordered sets of sub elements.
XML Schema
An XML Schema describes the structure of an XML document, just like a DTD.
An XML document with correct syntax is called "Well Formed".
An XML document validated against an XML Schema is both "Well Formed" and
"Valid".
XML Schema
XML Schema is an XML-based alternative to DTD:
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>

</xs:element>
The Schema above is interpreted like this:
 <xs:element name="note"> defines the element called "note"

 <xs:complexType> the "note" element is a complex type
 <xs:sequence> the complex type is a sequence of elements
 <xs:element name="to" type="xs:string"> the element "to" is of type string
(text)
 <xs:element name="from" type="xs:string"> the element "from" is of type
string
 <xs:element name="heading" type="xs:string"> the element "heading" is of
type string
 <xs:element name="body" type="xs:string"> the element "body" is of type
string
XML Schemas are More Powerful than DTD
 XML Schemas are written in XML

 XML Schemas are extensible to additions
 XML Schemas support data types
 XML Schemas support namespace
Use an XML Schema
With XML Schema, your XML files can carry a description of its own format.
With XML Schema, independent groups of people can agree on a standard for
interchanging data.
With XML Schema, you can verify data.
XML Schemas Support Data Types
One of the greatest strengths of XML Schemas is the support for data types:
 It is easier to describe document content

 It is easier to define restrictions on data
 It is easier to validate the correctness of data
 It is easier to convert data between different data types
XML Schemas use XML Syntax
Another great strength about XML Schemas is that they are written in XML:
 don't have to learn a new language

 can use your XML editor to edit your Schema files
 can use your XML parser to parse your Schema files
 can manipulate your Schemas with the XML DOM
 can transform your Schemas with XSLT
XQUERY
XQuery is to XML what SQL is to databases.
XQuery is designed to query XML data.
What is XQuery?

 XQuery is the language for querying XML data

 XQuery for XML is like SQL for databases
 XQuery is built on XPath expressions
 XQuery is supported by all major databases
 XQuery is a W3C Recommendation
The XML Example Document
We will use the following XML document in the examples below.
"books.xml":
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="COOKING">
<titlelang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
</book>
<book category="CHILDREN">
<titlelang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
</book>
<book category="WEB">
<titlelang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>VaidyanathanNagarajan</author>
<year>2003</year>
</book>
<book category="WEB">
<titlelang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
</book>
</bookstore>
Path Expressions
XQuery uses path expressions to navigate through elements in an XML document.
The following path expression is used to select all the title elements in the
"books.xml" file:

doc("books.xml")/bookstore/book/title
(/bookstore selects the bookstore element, /book selects all the book elements
under the bookstore element, and /title selects all the title elements under each
book element)
XQuery Data Types
XQuery shares the same data types as XML Schema 1.0 (XSD).
XSD String
XSD Date
XSD Numeric
XSD Misc
Selecting and Filtering Elements
As we have seen in the previous chapters, we are selecting and filtering elements
with either a Path expression or with a FLWOR expression.
Look at the following FLWOR expression:
for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
 for - (optional) binds a variable to each item returned by the in expression

 let - (optional)

 where - (optional) specifies a criteria

 order by - (optional) specifies the sort-order of the result
 return - specifies what to return in the result
XPath
XPath is a major element in the XSLT standard.
XPath can be used to navigate through elements and attributes in an XML

document.
 XPath stands for XML Path Language

 XPath uses "path like" syntax to identify and navigate
nodes in an XML document
 XPath contains over 200 built-in functions
 XPath is a major element in the XSLT standard
XPath Path Expressions
XPath uses path expressions to select nodes or node-sets in an XML document.
These path expressions look very much like the path expressions you use with
traditional computer file systems:

XPath Standard Functions
XPath includes over 200 built-in functions.
There are functions for string values, numeric values, booleans, date and time
comparison, node manipulation, sequence manipulation, and much more.
Today XPath expressions can also be used in JavaScript, Java, XML Schema,
PHP, Python, C and C++, and lots of other languages.
The XML Example Document
We will use the following XML document in the examples below.
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<titlelang="en">Harry Potter</title>

</book>
<book>
<titlelang="en">Learning XML</title>
</book>
</bookstore>
Selecting Nodes
XPath uses path expressions to select nodes in an XML document. The node is
selected by following a path or steps. The most useful path expressions are listed
below:
Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node
Selects nodes in the document from the current node that

//
match the selection no matter where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes
Selecting Several Paths
By using the | operator in an XPath expression you can select several paths.
In the table below we have listed some path expressions and the result of the
expressions:

Path Expression Result

Selects all the title AND price elements of all
//book/title | //book/price
book elements
Selects all the title AND price elements in the

//title | //price
document
Selects all the title elements of the book element

/bookstore/book/title | //price of the bookstore element AND all the price
elements in the document
Location Path Expression
A location path can be absolute or relative.
An absolute location path starts with a slash ( / ) and a relative location path does
not. In both cases the location path consists of one or more steps, each separated by
a slash:
An absolute location path:
/step/step/...
A relative location path:
step/step/...
Each step is evaluated against the nodes in the current node-set.
A step consists of:

 an axis (defines the tree-relationship between the selected nodes and the
current node)
 a node-test (identifies a node within an axis)
 zero or more predicates (to further refine the selected node-set)
The syntax for a location step is:
axisname::nodetest[predicate]
INFORMATION RETRIEVAL
Information retrieval, as the name implies, concerns the retrieving of relevant

information from databases. It is basically concerned with facilitating the user's
access to large amounts of (predominantly textual) information. The process of
information retrieval involves the following stages:
1. Representing Collections of Documents - how to represent, identify and process

the collection of documents.
2. User-initiated querying - understanding and processing of the queries.
3. Retrieval of the appropriate documents - the searching mechanism used to obtain
and retrieve the relevant documents
IR Components
IR system makes use of following components to perform information retrieval.
User Query is the text entered by the user to search for the information.
Text operation receives the entered text from user and converts it into tokens
as matching keyword for the search information.

Indexing directs the generated tokens to different pointers. Based upon the
relevant matching of token with pointers different query response will be
generated.
These query response are ranked based on the relevance metric of the concept
.
User interface management user interaction by processing input and visualization
of output.Before indexing logical view of documents must be created by query
manager. Differs are used for defining the database index structures.
The time and memory space sent on defining text database and building index are
by querying the retrieval system many times.
Search information provided by the user sent to text acquisition where it gathers all
search tokens. Then based upon the matching tokens identifies pointers. Most
accepting level of search retrieval has been selected; index is created for those link
and stored in the index DB with rankings.

Document data store holds the metadata for all documents. It includes document
type, structure and length so on.
The set of search keyword is called index terms. The set of collection of terms
indexed for a document is called document vocabulary. Index gets updated
whenever new search is performed.
Inverted file or inverted index is the stores index whereas document file stores
documents.
Query processing in IR system involves the following steps
 User interface
 Query process
 Ranking and evaluation
 Delivering information
Whenever a user query a search engine, it process the query performs

evaluation of search data, rank them and use index feature to display
corresponding search result to user. Log data is maintained to track of all
possible operations performed for information retrieval.

Relevance Ranking
IR means to identify relevance of ranking documents.
Differenent relevance ranking approaches are,
(i) Relevance using terms
 Ranking using TF-IDF
 Similarity based retrieval
(ii) Relevance using Hyperlinks

 Popularity rank
 Page rank
 Search engine spamming
(iii) Synonyms, Homonyms and ontologies
(iv) Indexing of documents

Ranking using TF-IDF:

Term t, document d, number of occurrence of t in d measures ranking relevance.
TF: Term Frequency [Relevance of document to a term]
IDF: Inverse Document Frequency [Assigning weights to terms]
TF(d,t)=log(1+n(d,t)/n(d))
IDF (t)=1/n(t)
Stop words are the collection of words by IR.These are the common word and
ignored during indexing a document.
Proximity refers to the multiple terms in a document. It is given higher priority.
r(d,Q)
Similarity based Retrieval
User provides doc A to system and system process to product the output which are
all similar with doc A.
TF (A,t)*IDF(t)
The resulting set of document is the search result details of user. This is called
relevance feedback. The document in a model of points and vectors in an n-
dimensional space is called vector space model.
Popularity Ranking
It is also prestige ranking. Gives higher priority to pages that are popular in web
site.
Files that stored in bookmark files come under popular ranking. One website
liking with other popularity website is also example for this.
Popularity can be measured by page linked with a particular page.

Page Rank
For Google measure of page ranks based on link to the page. It can be understood
by random wack model. Page rank algorithm does not give priority to query
keywords. It can be resolved by use keywords in the anchor text of links.
Search Engine Spamming
These are not popular website but it gives high relevance rank from some queries.
Synonymous
Synonymous or context based information retrieval was also in practice.
Retrieval effectives is measured by,
Percentage of false negatives
Percentage of false positives.
Information Retrieval Models
Classical Models are,
 Boolean Model
 Vector Model
 Probablistic Model
 Fuzzy Model
 Semantic Model
Boolean Model: In Boolean Model index term weight variables,as binary {0,1}.It
is a simple model nased on set theory.

The Boolean index terms are justified by present or absent. Boolean logic set
theoretic operation are AND,OR and NOT.Quey in Boolean expressions is
represented as a disjunction or conjunction vector.
→
Q =(1,1,1)V(1,1,1),(1,0,0) where each of the component is binary weighted vector
associated with the tuple (ba,bb,bc).
Advantages:
1. Simplicity of model
Disadvantages
1. Retrieval performance is low.
Vector Space Model
Documents are represented as features and weights in an n-dimensional vector
space of terms.
It is associated with a pair of (Ki,dj)is positive and non-binary.
The cosine of the angle between query and document vector is commonly used for
assessing similarity.
The term weight is used to compute degree of similarity to the user query.
Query vector →
Q = (w1, Q,w2,Q,….wn,Q)

Where n is the total number of index term in the document.

Advantages:
1. Term weighting improves performance
2. Retrieval appropriate to query conditions.
3. Evaluate degree of similarity.
Disadvantage:
Index terms are assumed to be manually independent.
QUERIES IN IR SYSTEMS
Keywords generally consists of words,phrases,authorname,date

created,last updated and so on.Based upon these keywords idexes has been
created.userquery compared with the set of index keywords and outputs the result
for queries.
Keyword-based queries:Keyword-based querying Queries are combinations of

words. The document collection is searched for documents that contain these
words. Word queries are intuitive, easy to express and provide fast ranking. The
concept of word must be defined. A word is a sequence of letters terminated by a
separator (period, comma, blank, etc). Definition of letter and separator is flexible;
e.g., hyphen could be defined as a letter or as a separator. Usually, “trivial
words”(such as “a”, “the”, or “of”) are ignored.
Boolean queries:Describe the information needed by relating multiple words with

Boolean operators. Operators: and, or, except except corresponds to and not
Semantics: For each query word w a corresponding set Dw is constructed that
includes the documents that contain w. The Boolean expression is then interpreted

as an expression on the corresponding document sets with corresponding set

operators: and difference union except  intersection
Basic queries Single-word queries: A query is a single word Simplest form of

query. All documents that include this word are retrieved. Documents may be
ranked by the frequency of this word in the document.
Phrase queries: A query is a sequence of words treated as a single unit. Also

called “literal string” or “exact phrase” query. Phrase is usually surrounded by
quotation marks. All documents that include this phrase are retrieved. Usually,
separators (commas, colons, etc.) and “trivial words” (e.g., “a”, “the”, or “of”) in
the phrase are ignored. In effect, this query is for a set of words that must appear in
sequence. Allows users to specify a context and thus gain precision. Example:
“United States of America”.
Multiple-word queries: A query is a set of words (or phrases). Two

interpretations: A document is retrieved if it includes any of the query words. A
document is retrieved if it includes each of the query words. Documents may be
ranked by the number of query words they contain: A document containing n query
words is ranked higher than a document containing m < n query words. Documents
containing all the query words are ranked at the top. Documents containing only
one query word are ranked at bottom. Frequency counts may still be used to break
ties among documents that contain the same query words. Example: The
phrase“Venetian blind” finds documents that discuss Venetian blinds. The
set(Venetian, blind) finds in addition documents that discuss blind Venetians.
Proximity queries: Restrict the distance within a document between two search
terms. Important for large documents in which the two search words may appear in

different contexts. Proximity specifications limit the acceptable occurrences and

hence increase the precision of the search. General Format: Word1 within m units
of Word2. Unit may be character, word, paragraph, etc.
Examples: united within 5 words of american: Finds documents that discuss

“United Airlines and American Airlines” but not “United States of America and
the American dream”. nuclear within 0 paragraphs of cleanup: Finds documents
that discuss “nuclear” and “cleanup” in the same paragraph.
PART-A(2 MARK)
1.Define a distributed databasesystem.

A distributed database system consists of a collection of sites,
connected together via some kind of communications network, inwhich:
a. Each site is a full database system site in its own rightbut
b. The sites have agreed to work together so that a user at any site can
access data anywhere in the network exactly as if the data were all
stored at the user„s ownsite.
2.Define a distributed database management system.(R)(NOV2016)

A new software component at each site logically an extension of the
local DBMS provides the necessary partnership functionality, and it is the
combination of these new components together with the existing DBMSs that
constitutes what is usually called the distributed database management
system
3.What are the advantages of distributeddatabases?

It enables the structure of the database to mirror the structure of the
enterprise-local data can be kept locally, where it most logically belongs-
while at the same time remote data can be accessed whennecessary.
4.What is the fundamental principle of distributeddatabase?

The fundamental principle of distributed database is to the user, a
distributed system should look exactly like a non-distributed system.
5.What is OODBMS?
Object-oriented database management systems (OODBMSs) combine

database capabilities with object-oriented programming language

capabilities. OODBMSs allow object-oriented programmers to develop
the product, store them as objects, and replicate or modify existing
objects to make new objects within the OODBMS. Because the database
is integrated with the programming language, the programmer can
maintain consistency within one environment, in that both the OODBMS
and the programming language will use the same model of representation.
6.What is meant by informationretrieval?(R)

An information retrieval process begins when a user enters a query into
the system. Queries are formal statements of information needs, for example
search strings in web search engines. In information retrieval a query does not
uniquely identify a single object in the collection. Instead, several objects may
match the query, perhaps with different degrees of relevancy.
7.What is meant by Relevancy ranking? (NOV2014)(U)

Relevancy ranking is the process of sorting the document results so that those
documents which are most likely to be relevant to your query are shown at the
top.
8.Define Crawling. (NOV2014)(R)

Web crawling is the process of search engines combing through web
pages in order to properly index them. These ―web crawlers‖ systematically
crawl pages and look at the keywords contained on the page, the kind of content,
all the links on the page, and then returns that information to the search engine„s
server for indexing. Then they follow all the hyperlinks on the website to get to

other websites. When a search engine user enters a query, the search engine will
go to its index and return the most relevant search results based on the keywords
in the search term. Web crawling is an automated process and provides quick, up
to date data.
9.What is meant by XMLDatabase?(R)

An XML database is a data persistence software system that allows data
to be specified, and sometimes stored, in XML format. These data can then be
queried, transformed, exported and returned to a calling system.
PART-B AND C
1. Explain about Distributed Databases. (NOV 2014)
2. Explain in detail Information retrieval and Relevance Ranking.
3. Describe about OODBMS and XML Database.
4. Explain in detail Threats and risks in Database Management System.
5. Explain detail about IR components.
6. Write about XML schema and Xpath query with example.

Dbms Unit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dbms Unit

Uploaded by

Copyright:

Available Formats

CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Unit-V Advanced TOPICS

In a homogeneous database, all different sites store database identically. The

IIYEAR/IV SEM Page 1

In a heterogeneous distributed database, different sites can use different schema

Distributed Data Storage

IIYEAR/IV SEM Page 2

(i.e, there isn‟t any loss of data).

 Horizontal fragmentation – Splitting by rows – The relation is fragmented

In certain cases, an approach that is hybrid of fragmentation and replication is

Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and

IIYEAR/IV SEM Page 3

Homogeneous Distributed Databases

 The sites use very similar software.

Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −

 Autonomous − Each database is independent that functions on its own.

IIYEAR/IV SEM Page 4

 Non-autonomous − Data is distributed across the homogeneous nodes and a

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating

 Different sites use dissimilar schemas and software.

Types of Heterogeneous Distributed Databases

 Federated − The heterogeneous database systems are independent in nature

Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three parameters −

 Distribution − It states the physical distribution of data across the different

IIYEAR/IV SEM Page 5

 Autonomy − It indicates the distribution of control of the database system

Some of the common architectural models are −

 Client - Server Architecture for DDBMS

Client - Server Architecture for DDBMS

The two different client - server architecture are −

 Single Server Multiple Client

IIYEAR/IV SEM Page 6

Peer- to-Peer Architecture for DDBMS

This architecture generally has four levels of schemas −

 Global Conceptual Schema − Depicts the global logical view of data.

IIYEAR/IV SEM Page 7

Multi - DBMS Architectures

This is an integrated database system formed by a collection of two or more

Multi-DBMS can be expressed through six levels of schemas −

 Multi-database View Level − Depicts multiple user views comprising of

IIYEAR/IV SEM Page 8

 Local database Conceptual Level − Depicts local data organization at each

There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.

IIYEAR/IV SEM Page 9

 Non-replicated and non-fragmented

Non-replicated & Non-fragmented

IIYEAR/IV SEM Page 10

IIYEAR/IV SEM Page 11

The three fragmentation techniques are −

Advantages of Data Replication

 Reliability − In case of failure of any site, the database system continues to

IIYEAR/IV SEM Page 12

 Simpler Transactions − Transactions require less number of joins of tables

Disadvantages of Data Replication

 Increased Storage Requirements − Maintaining multiple copies of data is

Some commonly used replication techniques are −

IIYEAR/IV SEM Page 13

fragmentation can further be classified into two techniques: primary horizontal