Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Unit-V Advanced TOPICS


Distributed Databases:-Architecture, Data Storage, Transaction Processing-Object-
based Databases: Object Database Concepts, Object-Relational Features,ODMG
Object Model,ODL,OQL-XML Databases:XML Hierarchical model,DTD,XML
Schema, XQuery-Information Retrieval:IR Concepts, Retrieval Models, Queries in
IR systems

DISTRIBUTED DATABASES:-
A distributed database is basically a database that is not limited to one system, it is
spread over different sites, i.e, on multiple computers or over a network of
computers. A distributed database system is located on various sited that don‟t
share physical components. This maybe required when a particular database needs
to be accessed by various users globally. It needs to be managed such that for the
users it looks like one single database.
A distributed database is a collection of multiple interconnected databases, which
are spread physically across various locations that communicate via a computer
network.

Types:
1. Homogeneous Database:

In a homogeneous database, all different sites store database identically. The


operating system, database management system and the data structures used – all
are same at all sites. Hence, they‟re easy to manage.

IIYEAR/IV SEM Page 1


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

2. Heterogeneous Database:

In a heterogeneous distributed database, different sites can use different schema


and software that can lead to problems in query processing and transactions. Also,
a particular site might be completely unaware of the other sites. Different
computers may use a different operating system, different database application.
They may even use different data models for the database. Hence, translations are
required for different sites to communicate

Distributed Data Storage

There are 2 ways in which data can be stored on different sites. These are:
1.Replication
In this approach, the entire relation is stored redundantly at 2 or more sites. If the
entire database is available at all sites, it is a fully redundant database. Hence, in
replication, systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also,
now query requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated.
Any change made at one site needs to be recorded at every site that relation is
stored or else it may lead to inconsistency. This is a lot of overhead. Also,
concurrency control becomes way more complex as concurrent access now needs
to be checked over a number of sites.

2.Fragmentation
In this approach, the relations are fragmented (i.e., they‟re divided into smaller
parts) and each of the fragments is stored in different sites where they‟re required.

IIYEAR/IV SEM Page 2


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

It must be made sure that the fragments are such that they can be used to
reconstruct the original relation

(i.e, there isn‟t any loss of data).


Fragmentation is advantageous as it doesn‟t create copies of data, consistency is
not a problem.
Fragmentation of relations can be done in two ways:

 Horizontal fragmentation – Splitting by rows – The relation is fragmented


into groups of tuples so that each tuple is assigned to at least one fragment.
 Vertical fragmentation – Splitting by columns – The schema of the relation
is divided into smaller schemas. Each fragment must contain a common
candidate key so as to ensure lossless join.

In certain cases, an approach that is hybrid of fragmentation and replication is


used.

Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and


heterogeneous distributed database environments, each with further sub-divisions,
as shown in the following illustration.

IIYEAR/IV SEM Page 3


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Homogeneous Distributed Databases

In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −

 The sites use very similar software.


 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process
user requests.
 The database is accessed through a single interface as if it is a single
database.

Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −

 Autonomous − Each database is independent that functions on its own.


They are integrated by a controlling application and use message passing to
share data updates.

IIYEAR/IV SEM Page 4


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 Non-autonomous − Data is distributed across the homogeneous nodes and a


central or master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating


systems, DBMS products and data models. Its properties are −

 Different sites use dissimilar schemas and software.


 The system may be composed of a variety of DBMSs like relational,
network, hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in
processing user requests.

Types of Heterogeneous Distributed Databases

 Federated − The heterogeneous database systems are independent in nature


and integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating
module through which the databases are accessed.

Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three parameters −

 Distribution − It states the physical distribution of data across the different


sites.

IIYEAR/IV SEM Page 5


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 Autonomy − It indicates the distribution of control of the database system


and the degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data
models, system components and databases.

Architectural Models

Some of the common architectural models are −

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS

This is a two-level architecture where the functionality is divided into servers and
clients. The server functions primarily encompass data management, query
processing, optimization and transaction management. Client functions include
mainly user interface. However, they have some functions like consistency
checking and transaction management.

The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client (shown in the following diagram)

IIYEAR/IV SEM Page 6


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Peer- to-Peer Architecture for DDBMS

In these systems, each peer acts both as a client and a server for imparting database
services. The peers share their resource with other peers and co-ordinate their
activities.

This architecture generally has four levels of schemas −

 Global Conceptual Schema − Depicts the global logical view of data.


 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.

IIYEAR/IV SEM Page 7


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Multi - DBMS Architectures

This is an integrated database system formed by a collection of two or more


autonomous database systems.

Multi-DBMS can be expressed through six levels of schemas −

 Multi-database View Level − Depicts multiple user views comprising of


subsets of the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that
comprises of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across
different sites and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.

IIYEAR/IV SEM Page 8


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 Local database Conceptual Level − Depicts local data organization at each


site.
 Local database Internal Level − Depicts physical data organization at each
site.

There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.


 Model without multi-database conceptual level.

IIYEAR/IV SEM Page 9


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Design Alternatives

The distribution design alternatives for the tables in a DDBMS are as follows −

 Non-replicated and non-fragmented


 Fully replicated
 Partially replicated
 Fragmented
 Mixed

Non-replicated & Non-fragmented

In this design alternative, different tables are placed at different sites. Data is
placed so that it is at a close proximity to the site where it is used most. It is most

IIYEAR/IV SEM Page 10


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

suitable for database systems where the percentage of queries needed to join
information in tables placed at different sites is low. If an appropriate distribution
strategy is adopted, then this design alternative helps to reduce the communication
cost during data processing.

Fully Replicated

In this design alternative, at each site, one copy of all the database tables is stored.
Since, each site has its own copy of the entire database, queries are very fast
requiring negligible communication cost. On the contrary, the massive redundancy
in data requires huge cost during update operations. Hence, this is suitable for
systems where a large number of queries is required to be handled whereas the
number of database updates is low.

Partially Replicated

Copies of tables or portions of tables are stored at different sites. The distribution
of the tables is done in accordance to the frequency of access. This takes into
consideration the fact that the frequency of accessing the tables vary considerably
from site to site. The number of copies of the tables (or portions) depends on how
frequently the access queries execute and the site which generate the access
queries.

Fragmented

In this design, a table is divided into two or more pieces referred to as fragments or
partitions, and each fragment can be stored at different sites. This considers the fact
that it seldom happens that all data stored in a table is required at a given site.
Moreover, fragmentation increases parallelism and provides better disaster

IIYEAR/IV SEM Page 11


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

recovery. Here, there is only one copy of each fragment in the system, i.e. no
redundant data.

The three fragmentation techniques are −

 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation

Mixed Distribution

This is a combination of fragmentation and partial replications. Here, the tables are
initially fragmented in any form (horizontal or vertical), and then these fragments
are partially replicated across the different sites according to the frequency of
accessing the fragments.

Data Replication

Data replication is the process of storing separate copies of the database at two or
more sites. It is a popular fault tolerance technique of distributed databases.

Advantages of Data Replication

 Reliability − In case of failure of any site, the database system continues to


work since a copy is available at another site(s).
 Reduction in Network Load − Since local copies of data are available,
query processing can be done with reduced network usage, particularly
during prime hours. Data updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data ensures quick
query processing and consequently quick response time.

IIYEAR/IV SEM Page 12


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 Simpler Transactions − Transactions require less number of joins of tables


located at different sites and minimal coordination across the network. Thus,
they become simpler in nature.

Disadvantages of Data Replication

 Increased Storage Requirements − Maintaining multiple copies of data is


associated with increased storage costs. The storage space required is in
multiples of the storage required for a centralized system.
 Increased Cost and Complexity of Data Updating − Each time a data item
is updated, the update needs to be reflected in all the copies of the data at the
different sites. This requires complex synchronization techniques and
protocols.
 Undesirable Application – Database coupling − If complex update
mechanisms are not used, removing data inconsistency requires complex co-
ordination at application level. This results in undesirable application –
database coupling.

Some commonly used replication techniques are −

 Snapshot replication
 Near-real-time replication
 Pull replication

Fragmentation

Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical). Horizontal

IIYEAR/IV SEM Page 13


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

fragmentation can further be classified into two techniques: primary horizontal


fragmentation and derived horizontal fragmentation.

Fragmentation should be done in a way so that the original table can be


reconstructed from the fragments. This is needed so that the original table can be
reconstructed from the fragments whenever required. This requirement is called
“reconstructiveness.”

Advantages of Fragmentation

 Since data is stored close to the site of usage, efficiency of the database
system is increased.
 Local query optimization techniques are sufficient for most queries since
data is locally available.
 Since irrelevant data is not available at the sites, security and privacy of the
database system can be maintained.

Disadvantages of Fragmentation

 When data from different fragments are required, the access speeds may be
very high.
 In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
 Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.

Vertical Fragmentation

In vertical fragmentation, the fields or columns of a table are grouped into


fragments. In order to maintain reconstructiveness, each fragment should contain
IIYEAR/IV SEM Page 14
CS8492-DBMS-UNIT-5 ADVANCED TOPICS

the primary key field(s) of the table. Vertical fragmentation can be used to enforce
privacy of data.

For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.

STUDENT

Regd_No Name Course Address Semester Fees Marks

Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −

CREATE TABLE STD_FEES AS


SELECT Regd_No, Fees
FROM STUDENT;

Horizontal Fragmentation

Horizontal fragmentation groups the tuples of a table in accordance to values of


one or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the original
base table.

For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −

CREATE COMP_STD AS
SELECT * FROM STUDENT

IIYEAR/IV SEM Page 15


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

WHERE COURSE = "Computer Science";

Hybrid Fragmentation

In hybrid fragmentation, a combination of horizontal and vertical fragmentation


techniques are used. This is the most flexible fragmentation technique since it
generates fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.

Hybrid fragmentation can be done in two alternative ways −

 At first, generate a set of horizontal fragments; then generate vertical


fragments from one or more of the horizontal fragments.
 At first, generate a set of vertical fragments; then generate horizontal
fragments from one or more of the vertical fragments.

The three dimensions of distribution transparency are −

 Location transparency
 Fragmentation transparency
 Replication transparency

Location Transparency

Location transparency ensures that the user can query on any table(s) or
fragment(s) of a table as if they were stored locally in the user‟s site. The fact that
the table or its fragments are stored at remote site in the distributed database
system, should be completely oblivious to the end user. The address of the remote
site(s) and the access mechanisms are completely hidden.

IIYEAR/IV SEM Page 16


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

In order to incorporate location transparency, DDBMS should have access to


updated and accurate data dictionary and DDBMS directory which contains the
details of locations of data.

Fragmentation Transparency

Fragmentation transparency enables users to query upon any table as if it were


unfragmented. Thus, it hides the fact that the table the user is querying on is
actually a fragment or union of some fragments. It also conceals the fact that the
fragments are located at diverse sites.

This is somewhat similar to users of SQL views, where the user may not know that
they are using a view of a table instead of the table itself.

Replication Transparency

Replication transparency ensures that replication of databases are hidden from the
users. It enables users to query upon a table as if only a single copy of the table
exists.

Replication transparency is associated with concurrency transparency and failure


transparency. Whenever a user updates a data item, the update is reflected in all the
copies of the table. However, this operation should not be known to the user. This
is concurrency transparency. Also, in case of failure of a site, the user can still
proceed with his queries using replicated copies without any knowledge of failure.
This is failure transparency.

IIYEAR/IV SEM Page 17


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Combination of Transparencies

In any distributed database system, the designer should ensure that all the stated
transparencies are maintained to a considerable extent. The designer may choose to
fragment tables, replicate them and store them at different sites; all oblivious to the
end user. However, complete distribution transparency is a tough task and requires
considerable design efforts.

Database control refers to the task of enforcing regulations so as to provide correct


data to authentic users and applications of a database. In order that correct data is
available to users, all data should conform to the integrity constraints defined in the
database. Besides, data should be screened away from unauthorized users so as to
maintain security and privacy of the database. Database control is one of the
primary tasks of the database administrator (DBA).

The three dimensions of database control are −

 Authentication
 Access rights
 Integrity constraints

Authentication

In a distributed database system, authentication is the process through which only


legitimate users can gain access to the data resources.

Authentication can be enforced in two levels −

 Controlling Access to Client Computer − At this level, user access is


restricted while login to the client computer that provides user-interface to

IIYEAR/IV SEM Page 18


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

the database server. The most common method is a username/password


combination. However, more sophisticated methods like biometric
authentication may be used for high security data.
 Controlling Access to the Database Software − At this level, the database
software/administrator assigns some credentials to the user. The user gains
access to the database using these credentials. One of the methods is to
create a login account within the database server.

Access Rights

A user‟s access rights refers to the privileges that the user is given regarding
DBMS operations such as the rights to create a table, drop a table,
add/delete/update tuples in a table or query upon the table.

In distributed environments, since there are large number of tables and yet larger
number of users, it is not feasible to assign individual access rights to users. So,
DDBMS defines certain roles. A role is a construct with certain privileges within a
database system. Once the different roles are defined, the individual users are
assigned one of these roles. Often a hierarchy of roles are defined according to the
organization‟s hierarchy of authority and responsibility.

For example, the following SQL statements create a role "Accountant" and then
assigns this role to user "ABC".

CREATE ROLE ACCOUNTANT;


GRANT SELECT, INSERT, UPDATE ON EMP_SAL TO ACCOUNTANT;
GRANT INSERT, UPDATE, DELETE ON TENDER TO ACCOUNTANT;
GRANT INSERT, SELECT ON EXPENSE TO ACCOUNTANT;
COMMIT;

IIYEAR/IV SEM Page 19


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

GRANT ACCOUNTANT TO ABC;


COMMIT;

Semantic Integrity Control

Semantic integrity control defines and enforces the integrity constraints of the
database system.

The integrity constraints are as follows −

 Data type integrity constraint


 Entity integrity constraint
 Referential integrity constraint

Data Type Integrity Constraint

A data type constraint restricts the range of values and the type of operations that
can be applied to the field with the specified data type.

For example, let us consider that a table "HOSTEL" has three fields - the hostel
number, hostel name and capacity. The hostel number should start with capital
letter "H" and cannot be NULL, and the capacity should not be more than 150. The
following SQL command can be used for data definition −

CREATE TABLE HOSTEL (


H_NO VARCHAR2(5) NOT NULL,
H_NAME VARCHAR2(15),
CAPACITY INTEGER,
CHECK ( H_NO LIKE 'H%'),
CHECK ( CAPACITY<= 150)

IIYEAR/IV SEM Page 20


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

);

Entity Integrity Control

Entity integrity control enforces the rules so that each tuple can be uniquely
identified from other tuples. For this a primary key is defined. A primary key is a
set of minimal fields that can uniquely identify a tuple. Entity integrity constraint
states that no two tuples in a table can have identical values for primary keys and
that no field which is a part of the primary key can have NULL value.

For example, in the above hostel table, the hostel number can be assigned as the
primary key through the following SQL statement (ignoring the checks) −

CREATE TABLE HOSTEL (


H_NO VARCHAR2(5) PRIMARY KEY,
H_NAME VARCHAR2(15),
CAPACITY INTEGER
);

Referential Integrity Constraint

Referential integrity constraint lays down the rules of foreign keys. A foreign key
is a field in a data table that is the primary key of a related table. The referential
integrity constraint lays down the rule that the value of the foreign key field should
either be among the values of the primary key of the referenced table or be entirely
NULL.

For example, let us consider a student table where a student may opt to live in a
hostel. To include this, the primary key of hostel table should be included as a
foreign key in the student table. The following SQL statement incorporates this −

IIYEAR/IV SEM Page 21


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

CREATE TABLE STUDENT (


S_ROLL INTEGER PRIMARY KEY,
S_NAME VARCHAR2(25) NOT NULL,
S_COURSE VARCHAR2(10),
S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL
);

Advantage:
1. Sharing data
2. Autonomy-able to control over data stored locally.
3. Availability-one site fails the other sites able to continue operating.
Disadvantage:
1. Software development cost very high.
2. Greater potential for bugs is harder to ensure to correctness of algorithms, during
failures of system.
3. Increased processing overhead, exchange of message and additional achieve
interside.

OBJECT-BASED DATABASES

 Object oriented database systems are alternative to relational database and


other database systems.
 In object oriented database, information is represented in the form of objects.
 Object oriented databases are exactly same as object oriented programming
languages. If we can combine the features of relational model (transaction,
concurrency, recovery) to object oriented databases, the resultant model is
called as object oriented database model.

IIYEAR/IV SEM Page 22


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Data model designed for processing database is not adequate to support computer
aided design (CAD), image database, multimedia, hypertext databases and so on.

IIYEAR/IV SEM Page 23


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

It support new technology data, it requires,


 Complex data types
 Data encapsutlion
 Efficient indexing and query technique.
So to satisfy above requirements, object based data models are needed.
Storing of data in database as object is termed as object oriented database. An
object has both attributes and methods associated with it. Object has two
components, state and behavior, state implies the current state of object and
behavior implies the operation being carried out by object. Object is entity in ER
model.

Attributes of an object exists as two types:


 Simple Attributes can be a normal data type such as integer, real etc.
 complexAttributes has reference to other attribute.
An object has the following characteristics:
 identifier(unique-id for object)
 Lifetime(determines the persistence of an object)
 Objected oriented data model is logical model, which describes entities and
relationship among entity.
Group of objects is terms as class. An object is instance of class.
Object include:
 Variable types
 Methods
 Message interface

IIYEAR/IV SEM Page 24


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Class student
{
String name;
Introllno;
Get_name();
Set_name();
};

Object Oriented database (OODB) always maintains a direct link between


real world object and data objects and it has a unique system generated
identifier called object identifier (OID).

In OODB all object have an object structure and holds all relevant
information corresponding to the real world entity.
OODB provides a complete encapsulation of objects. Thus it specific
predefined operations for an object. But predefined operations does not
support adhoc/instant query by a user.

Encapsulation is come in two parts:


1. Signature or interface-specifies object name and attributes
2. Method or body-specifies operation need to perform by an object.

OODB supports inheritance to reuse existing type definitions to create new


objects.
A consortium of object oriented database is called ODMG(Object Data
Management Group)
It supports Operator Overloading. That is operation name may refer to
IIYEAR/IV SEM Page 25
CS8492-DBMS-UNIT-5 ADVANCED TOPICS

several implementation depending on the type of object it is applied to. This


is known as polymorphism.

OBJECT IDENTITY (OID)


An OODB provides unique identifier (OID) to objects. It is a system
generated OID.it is not known to the external users, but is internally used by
the system to identity object.
Object identity cannot be changed.so it is immutable. This immutable
property preserves the real world objects being represented. Each OID can
be used only once. Even if an object removed from database, same OID
cannot be assigned to another object.
Object Structure
The state of object can be constructed from other object by using type
constructor.
Each object is denoted as a triple (i, c, v) where I unique object identifier
type constructor object state (current).
Example:
(i, atomic, 10)
Three basic type constructors are atom, tuple, and set.
Atom represents all at atomic values such as integer,real,number,character
string,booolean etc.

1. The object-oriented paradigm is based on encapsulating code and data into a


single unit. Conceptually, all interactions between an object and the rest of
the system are via messages. Thus, the interface between an object and the
rest of the system is defined by a set of allowed messages.
2. In general, an object has associated with it:

IIYEAR/IV SEM Page 26


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

o A set of variables that contain the data for the object. The value of
each variable is itself an object.
o A set of messages to which the object responds.
o A set of methods, each of which is a body of code to implement each
message; a method returns a value as the response to the message.
3. Motivation of using messages and methods.

All employee objects respond to the annual-salary message but in different


computations for managers, tellers, etc. By encapsulation within the
employee object itself the information about how to compute the annual
salary, all employee objects present the same interface.

Since the only external interface presented by an object is the set of


messages to which it responds, it is possible to (i) modify the definition of
methods and variables without affecting the rest of the system, and (ii)
replace a variable with the method that computes a value, e.g., age from
birth_date.

The ability to modify the definition of an object without affecting the rest of
the system is considered to be one of the major advantages of the OO
programming paradigm.

4. Methods of an object may be classified as either read-only or update.


Message can also be classified as read-only or update. Derived attributes of
an entity in the ER model can be expressed as read-only messages.

IIYEAR/IV SEM Page 27


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

ODMG

IIYEAR/IV SEM Page 28


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 29


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 30


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 31


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 32


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 33


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 34


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 35


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 36


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

XML DATABASE
 Xml (extensible Markup Language) is a markup language.
 XML is designed to store and transport data.
 Xml was released in late 90‟s. it was created to provide an easy to use and store
self-describing data.

IIYEAR/IV SEM Page 37


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

XML database is a data persistence software system used for storing the huge
amount of information in XML format. It provides a secure place to store XML
documents.

You can query your stored data by using XQuery, export and serialize into desired
format. XML databases are usually associated with document-oriented databases.

Types of XML databases

There are two types of XML databases.

1. XML-enabled database
2. Native XML database (NXD)

XML-enable Database

XML-enable database works just like a relational database. It is like an extension


provided for the conversion of XML documents. In this database, data is stored in
table, in the form of rows and columns.

Native XML Database

Native XML database is used to store large amount of data. Instead of table format,
Native XML database is based on container format. You can query data by XPath
expressions.

Native XML database is preferred over XML-enable database because it is highly


capable to store, maintain and query XML documents.

Let's take an example of XML database:

IIYEAR/IV SEM Page 38


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

1. <?xml version="1.0"?>
2. <contact-info>
3. <contact1>
4. <name>Vimal Jaiswal</name>
5. <company>SSSIT.org</company>
6. <phone>(0120) 4256464</phone>
7. </contact1>
8. <contact2>
9. <name>Mahesh Sharma </name>
10. <company>SSSIT.org</company>
11. <phone>09990449935</phone>
12. </contact2>
13.</contact-info>

In the above example, a table named contacts is created and holds the contacts
(contact1 and contact2). Each one contains 3 entities name, company and phone.

Features and Advantages of XML

1) XML separates data from HTML

2) XML simplifies data sharing

3) XML simplifies data transport

XML DOCUMENT

XML is basis for all new generation data interchanges formats.

IIYEAR/IV SEM Page 39


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

XML provides detailed information about the structre and meanng of the data in
the web pages.

It is suited for:

 Structured data
 Semi-Structured data
 Unstructured data

Structured data

In structured data all information are stored in predefined structure format. There
will be no deviations in the arrangement and operations being performed data, Data
in which stored in database in specified format is normally termed as structured
data.

Example:

Data retrieval about department and course offered in college.

Semi-Structured data

In some applications data is collected in adhoc manner. Since data is collected


from different sources, formats of data also vary. But data collected from different
formats will be stored in knows format. Schema information is not known directly,
but it can be known through the data values present in the attributes. Because this
type of data is called self-describing data.

IIYEAR/IV SEM Page 40


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Unstructured data

User text entry on web search is the best example of unstructured data. Text
document which has information embedded into it is called unstructured data.

Nested structure of tags helps to gain depth knowledge about the database,but it
may lead to redundancy.

Two concepts used to construct XML document are:

IIYEAR/IV SEM Page 41


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

1) Element
2) Attribute

Attributes in XML are strings that do not contain markup and appears only once in
a given tag.XML has been used to exchange data between applications by
mechanism called namespace.Namespace mechanism enables global unique
names used to element tags in document.

Example: Webaddress

https://www.annauniv.edu

XML data has to be exchanged between organization.so namespace is the only


solution.

<college xmlns:au=‟//www.au.com‟>

<au:department>

<au:course>CSE</au:course>

<au:Block>Milton</au:Block>

</au: department>

XML HIERARCHICAL DATA MODEL

XML document can be modeled in tree structure using nodes. Eachnode has
corresponding element andattributes. Termed parent, child, ancestors, siblings,
descendant are used to XML tree model.

IIYEAR/IV SEM Page 42


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

There are two types of elements: complex –complex nodes are internal nodes. and
simple elements are leaf nodes.

Simple elements contain data values whereas complex element constructed from
other elements hierarchy model.

Example:

<bookstore>
<bookcategory="children">
<title>HarryPotter</title>
<author>JK. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>

Here book, title, author,year,price are simple elements and given construct values
for each element called complex elements.

It is possible to characterize three main types of XML documents:

Data-centric XML documents. These documents have many small


dataitems that follow a specific structure and hence may be extracted from a
structured database. They are formatted as XML documents in order to exchange

IIYEAR/IV SEM Page 43


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

them over or display them on the Web. These usually follow a predefined schema
that defines the tag names.

Document-centric XML documents.These are documents with


largeamounts of text, such as news articles or books. There are few or no struc-
tured data elements in these documents.

Hybrid XML documents.These documents may have parts that


containstructured data and other parts that are predominantly textual or unstruc-
tured. They may or may not have a predefined schema.

XML document Schema

Schema specifies the overall structures of the XML document.XML document


schema speicifies:

 Information being stored


 Type of information being stored.
XML schema is most important for the web pages to interpret navigate to
next web page application.
XML schema is used to identify specific set of XML schema language
stored at a different website location.
http://www.au.deu/2005/XMLSchema
Each part of the schema defines different part or location in a file.XMLns is
a variable used as prefix to all XML schema tags.
There are two types of XML document:
1) Well-formed XML
2) Valid XML

IIYEAR/IV SEM Page 44


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

In Well-formed XML there is only a single root element and every element
must have start and end tag with in a start and end tag of parent
element.DOM(Document Object Model) allows programs to manipulate tree
representation to the well-formed XML document.it parses the entire tree
structure.

In valid XML document must satisfy the structure specified in XML schema
file or XML DTD(Document type Definition)
Example: XML Schema for project
<xsd:schemaxmlns:xds=http://www.project.com/2010/xmlschema>
<xsd:annotation>
<xsd:documentationxml”lang=”en”>project expo</xsd:annotation>
</xsd:annotation>
<xsd:Complextype name=”Name”>
<xsd: sequence>
<xsd: element name=”first_name” type=”xsd.string”/>
<xsd: element name=”last_name” type=”xsd.string”/>
<xsd: element name=”middle_name” type=”xsd.string”/>
</xsd:sequence>
</xsd:CompleType>
<xsd:Sequence>
<xsd: element name=”emp_id” type=”xsd.integer”/>
<xsd: element name=”f_name” type=”xsd.string”/>
<xsd: element name=”dept” type=”xsd.string”/>
</xsd:sequence>
</xsd:CompleType>
</xsd:Schema>
IIYEAR/IV SEM Page 45
CS8492-DBMS-UNIT-5 ADVANCED TOPICS

For referring primary key and foreign key of schema xsd:keyref is used.
Xsd annotation and documentation are used for providing comments and
other descriptions in XML document.
The structure of company root element is xsd:complextype.

IIYEAR/IV SEM Page 46


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

IIYEAR/IV SEM Page 47


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

DTD (Document Type Definition)

A DTD is a Document Type Definition.

A DTD defines the structure and the legal elements and attributes of an XML
document.

Use a DTD

With a DTD, independent groups of people can agree on a standard DTD for
interchanging data.

An application can use a DTD to verify that XML data is valid.

Syntax

Basic syntax of a DTD is as follows −

<!DOCTYPE element DTD identifier


[
declaration1
declaration2
........
]>

XML document with an internal DTD

<?xml version="1.0"?>
<!DOCTYPE note [

IIYEAR/IV SEM Page 48


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

<!ELEMENT note (to,from,heading,body)>


<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>

An External DTD Declaration

If the DTD is declared in an external file, the <!DOCTYPE> definition must


contain a reference to the DTD file:

XML document with a reference to an external DTD

<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

IIYEAR/IV SEM Page 49


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Limitations
1) Data types are very sophisticated not general, all values should be
string.
2) Since it has two own syntax need specific processor run it.
3) UnOrdering specification of elements are never permitted.
4) Difficult to specify unordered sets of sub elements.
XML Schema

An XML Schema describes the structure of an XML document, just like a DTD.

An XML document with correct syntax is called "Well Formed".

An XML document validated against an XML Schema is both "Well Formed" and
"Valid".

XML Schema

XML Schema is an XML-based alternative to DTD:

<xs:element name="note">

<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>

IIYEAR/IV SEM Page 50


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

</xs:element>

The Schema above is interpreted like this:

 <xs:element name="note"> defines the element called "note"


 <xs:complexType> the "note" element is a complex type
 <xs:sequence> the complex type is a sequence of elements
 <xs:element name="to" type="xs:string"> the element "to" is of type string
(text)
 <xs:element name="from" type="xs:string"> the element "from" is of type
string
 <xs:element name="heading" type="xs:string"> the element "heading" is of
type string
 <xs:element name="body" type="xs:string"> the element "body" is of type
string

XML Schemas are More Powerful than DTD

 XML Schemas are written in XML


 XML Schemas are extensible to additions
 XML Schemas support data types
 XML Schemas support namespace

Use an XML Schema

With XML Schema, your XML files can carry a description of its own format.

With XML Schema, independent groups of people can agree on a standard for
interchanging data.
IIYEAR/IV SEM Page 51
CS8492-DBMS-UNIT-5 ADVANCED TOPICS

With XML Schema, you can verify data.

XML Schemas Support Data Types

One of the greatest strengths of XML Schemas is the support for data types:

 It is easier to describe document content


 It is easier to define restrictions on data
 It is easier to validate the correctness of data
 It is easier to convert data between different data types

XML Schemas use XML Syntax

Another great strength about XML Schemas is that they are written in XML:

 don't have to learn a new language


 can use your XML editor to edit your Schema files
 can use your XML parser to parse your Schema files
 can manipulate your Schemas with the XML DOM
 can transform your Schemas with XSLT

XQUERY

XQuery is to XML what SQL is to databases.

XQuery is designed to query XML data.

What is XQuery?

IIYEAR/IV SEM Page 52


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 XQuery is the language for querying XML data


 XQuery for XML is like SQL for databases
 XQuery is built on XPath expressions
 XQuery is supported by all major databases
 XQuery is a W3C Recommendation

The XML Example Document

We will use the following XML document in the examples below.

"books.xml":

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book category="COOKING">
<titlelang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="CHILDREN">
<titlelang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
IIYEAR/IV SEM Page 53
CS8492-DBMS-UNIT-5 ADVANCED TOPICS

</book>

<book category="WEB">
<titlelang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>VaidyanathanNagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="WEB">
<titlelang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>

</bookstore>

Path Expressions

XQuery uses path expressions to navigate through elements in an XML document.

The following path expression is used to select all the title elements in the
"books.xml" file:

IIYEAR/IV SEM Page 54


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

doc("books.xml")/bookstore/book/title

(/bookstore selects the bookstore element, /book selects all the book elements
under the bookstore element, and /title selects all the title elements under each
book element)

XQuery Data Types

XQuery shares the same data types as XML Schema 1.0 (XSD).

XSD String

XSD Date

XSD Numeric

XSD Misc

Selecting and Filtering Elements

As we have seen in the previous chapters, we are selecting and filtering elements
with either a Path expression or with a FLWOR expression.

Look at the following FLWOR expression:

for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title

 for - (optional) binds a variable to each item returned by the in expression


 let - (optional)

IIYEAR/IV SEM Page 55


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 where - (optional) specifies a criteria


 order by - (optional) specifies the sort-order of the result
 return - specifies what to return in the result

XPath

XPath is a major element in the XSLT standard.

XPath can be used to navigate through elements and attributes in an XML


document.

 XPath stands for XML Path Language


 XPath uses "path like" syntax to identify and navigate
nodes in an XML document
 XPath contains over 200 built-in functions
 XPath is a major element in the XSLT standard

XPath Path Expressions

XPath uses path expressions to select nodes or node-sets in an XML document.

These path expressions look very much like the path expressions you use with
traditional computer file systems:

IIYEAR/IV SEM Page 56


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

XPath Standard Functions

XPath includes over 200 built-in functions.

There are functions for string values, numeric values, booleans, date and time
comparison, node manipulation, sequence manipulation, and much more.

Today XPath expressions can also be used in JavaScript, Java, XML Schema,
PHP, Python, C and C++, and lots of other languages.

The XML Example Document

We will use the following XML document in the examples below.

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
<titlelang="en">Harry Potter</title>
<price>29.99</price>

IIYEAR/IV SEM Page 57


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

</book>
<book>
<titlelang="en">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>

Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is
selected by following a path or steps. The most useful path expressions are listed
below:

Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node

Selects nodes in the document from the current node that


//
match the selection no matter where they are
. Selects the current node

.. Selects the parent of the current node

@ Selects attributes

Selecting Several Paths

By using the | operator in an XPath expression you can select several paths.

In the table below we have listed some path expressions and the result of the
expressions:

IIYEAR/IV SEM Page 58


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Path Expression Result


Selects all the title AND price elements of all
//book/title | //book/price
book elements

Selects all the title AND price elements in the


//title | //price
document

Selects all the title elements of the book element


/bookstore/book/title | //price of the bookstore element AND all the price
elements in the document

Location Path Expression

A location path can be absolute or relative.

An absolute location path starts with a slash ( / ) and a relative location path does
not. In both cases the location path consists of one or more steps, each separated by
a slash:

An absolute location path:

/step/step/...

A relative location path:

step/step/...

Each step is evaluated against the nodes in the current node-set.

A step consists of:

IIYEAR/IV SEM Page 59


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

 an axis (defines the tree-relationship between the selected nodes and the
current node)
 a node-test (identifies a node within an axis)
 zero or more predicates (to further refine the selected node-set)

The syntax for a location step is:

axisname::nodetest[predicate]

INFORMATION RETRIEVAL

Information retrieval, as the name implies, concerns the retrieving of relevant


information from databases. It is basically concerned with facilitating the user's
access to large amounts of (predominantly textual) information. The process of
information retrieval involves the following stages:

1. Representing Collections of Documents - how to represent, identify and process


the collection of documents.
2. User-initiated querying - understanding and processing of the queries.
3. Retrieval of the appropriate documents - the searching mechanism used to obtain
and retrieve the relevant documents

IR Components

IR system makes use of following components to perform information retrieval.

User Query is the text entered by the user to search for the information.

Text operation receives the entered text from user and converts it into tokens
as matching keyword for the search information.

IIYEAR/IV SEM Page 60


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Indexing directs the generated tokens to different pointers. Based upon the
relevant matching of token with pointers different query response will be
generated.

These query response are ranked based on the relevance metric of the concept
.
User interface management user interaction by processing input and visualization
of output.Before indexing logical view of documents must be created by query
manager. Differs are used for defining the database index structures.
The time and memory space sent on defining text database and building index are
by querying the retrieval system many times.

Search information provided by the user sent to text acquisition where it gathers all
search tokens. Then based upon the matching tokens identifies pointers. Most
accepting level of search retrieval has been selected; index is created for those link
and stored in the index DB with rankings.

IIYEAR/IV SEM Page 61


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Document data store holds the metadata for all documents. It includes document
type, structure and length so on.
The set of search keyword is called index terms. The set of collection of terms
indexed for a document is called document vocabulary. Index gets updated
whenever new search is performed.
Inverted file or inverted index is the stores index whereas document file stores
documents.
Query processing in IR system involves the following steps
 User interface
 Query process
 Ranking and evaluation
 Delivering information

Whenever a user query a search engine, it process the query performs


evaluation of search data, rank them and use index feature to display
corresponding search result to user. Log data is maintained to track of all
possible operations performed for information retrieval.

IIYEAR/IV SEM Page 62


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Relevance Ranking
IR means to identify relevance of ranking documents.
Differenent relevance ranking approaches are,
(i) Relevance using terms
 Ranking using TF-IDF
 Similarity based retrieval

(ii) Relevance using Hyperlinks


 Popularity rank
 Page rank
 Search engine spamming
(iii) Synonyms, Homonyms and ontologies
(iv) Indexing of documents

IIYEAR/IV SEM Page 63


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Ranking using TF-IDF:


Term t, document d, number of occurrence of t in d measures ranking relevance.
TF: Term Frequency [Relevance of document to a term]
IDF: Inverse Document Frequency [Assigning weights to terms]

TF(d,t)=log(1+n(d,t)/n(d))
IDF (t)=1/n(t)
Stop words are the collection of words by IR.These are the common word and
ignored during indexing a document.
Proximity refers to the multiple terms in a document. It is given higher priority.
r(d,Q)
Similarity based Retrieval
User provides doc A to system and system process to product the output which are
all similar with doc A.
TF (A,t)*IDF(t)
The resulting set of document is the search result details of user. This is called
relevance feedback. The document in a model of points and vectors in an n-
dimensional space is called vector space model.
Popularity Ranking
It is also prestige ranking. Gives higher priority to pages that are popular in web
site.
Files that stored in bookmark files come under popular ranking. One website
liking with other popularity website is also example for this.
Popularity can be measured by page linked with a particular page.

IIYEAR/IV SEM Page 64


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Page Rank
For Google measure of page ranks based on link to the page. It can be understood
by random wack model. Page rank algorithm does not give priority to query
keywords. It can be resolved by use keywords in the anchor text of links.
Search Engine Spamming
These are not popular website but it gives high relevance rank from some queries.
Synonymous
Synonymous or context based information retrieval was also in practice.
Retrieval effectives is measured by,
Percentage of false negatives
Percentage of false positives.
Information Retrieval Models
Classical Models are,
 Boolean Model
 Vector Model
 Probablistic Model
 Fuzzy Model
 Semantic Model
Boolean Model: In Boolean Model index term weight variables,as binary {0,1}.It
is a simple model nased on set theory.

IIYEAR/IV SEM Page 65


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

The Boolean index terms are justified by present or absent. Boolean logic set
theoretic operation are AND,OR and NOT.Quey in Boolean expressions is
represented as a disjunction or conjunction vector.


Q =(1,1,1)V(1,1,1),(1,0,0) where each of the component is binary weighted vector
associated with the tuple (ba,bb,bc).
Advantages:
1. Simplicity of model
Disadvantages
1. Retrieval performance is low.
Vector Space Model
Documents are represented as features and weights in an n-dimensional vector
space of terms.
It is associated with a pair of (Ki,dj)is positive and non-binary.
The cosine of the angle between query and document vector is commonly used for
assessing similarity.
The term weight is used to compute degree of similarity to the user query.
Query vector →
Q = (w1, Q,w2,Q,….wn,Q)

IIYEAR/IV SEM Page 66


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

Where n is the total number of index term in the document.


Advantages:
1. Term weighting improves performance
2. Retrieval appropriate to query conditions.
3. Evaluate degree of similarity.
Disadvantage:
Index terms are assumed to be manually independent.

QUERIES IN IR SYSTEMS

Keywords generally consists of words,phrases,authorname,date


created,last updated and so on.Based upon these keywords idexes has been
created.userquery compared with the set of index keywords and outputs the result
for queries.

Keyword-based queries:Keyword-based querying Queries are combinations of


words. The document collection is searched for documents that contain these
words. Word queries are intuitive, easy to express and provide fast ranking. The
concept of word must be defined. A word is a sequence of letters terminated by a
separator (period, comma, blank, etc). Definition of letter and separator is flexible;
e.g., hyphen could be defined as a letter or as a separator. Usually, “trivial
words”(such as “a”, “the”, or “of”) are ignored.

Boolean queries:Describe the information needed by relating multiple words with


Boolean operators. Operators: and, or, except except corresponds to and not
Semantics: For each query word w a corresponding set Dw is constructed that
includes the documents that contain w. The Boolean expression is then interpreted

IIYEAR/IV SEM Page 67


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

as an expression on the corresponding document sets with corresponding set


operators: and difference union except  intersection

Basic queries Single-word queries: A query is a single word Simplest form of


query. All documents that include this word are retrieved. Documents may be
ranked by the frequency of this word in the document.

Phrase queries: A query is a sequence of words treated as a single unit. Also


called “literal string” or “exact phrase” query. Phrase is usually surrounded by
quotation marks. All documents that include this phrase are retrieved. Usually,
separators (commas, colons, etc.) and “trivial words” (e.g., “a”, “the”, or “of”) in
the phrase are ignored. In effect, this query is for a set of words that must appear in
sequence. Allows users to specify a context and thus gain precision. Example:
“United States of America”.

Multiple-word queries: A query is a set of words (or phrases). Two


interpretations: A document is retrieved if it includes any of the query words. A
document is retrieved if it includes each of the query words. Documents may be
ranked by the number of query words they contain: A document containing n query
words is ranked higher than a document containing m < n query words. Documents
containing all the query words are ranked at the top. Documents containing only
one query word are ranked at bottom. Frequency counts may still be used to break
ties among documents that contain the same query words. Example: The
phrase“Venetian blind” finds documents that discuss Venetian blinds. The
set(Venetian, blind) finds in addition documents that discuss blind Venetians.

Proximity queries: Restrict the distance within a document between two search
terms. Important for large documents in which the two search words may appear in

IIYEAR/IV SEM Page 68


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

different contexts. Proximity specifications limit the acceptable occurrences and


hence increase the precision of the search. General Format: Word1 within m units
of Word2. Unit may be character, word, paragraph, etc.

Examples: united within 5 words of american: Finds documents that discuss


“United Airlines and American Airlines” but not “United States of America and
the American dream”. nuclear within 0 paragraphs of cleanup: Finds documents
that discuss “nuclear” and “cleanup” in the same paragraph.

PART-A(2 MARK)

1.Define a distributed databasesystem.


A distributed database system consists of a collection of sites,
connected together via some kind of communications network, inwhich:
a. Each site is a full database system site in its own rightbut
b. The sites have agreed to work together so that a user at any site can
access data anywhere in the network exactly as if the data were all
stored at the user„s ownsite.

2.Define a distributed database management system.(R)(NOV2016)


A new software component at each site logically an extension of the
local DBMS provides the necessary partnership functionality, and it is the
combination of these new components together with the existing DBMSs that
constitutes what is usually called the distributed database management
system

3.What are the advantages of distributeddatabases?


It enables the structure of the database to mirror the structure of the
enterprise-local data can be kept locally, where it most logically belongs-
while at the same time remote data can be accessed whennecessary.

4.What is the fundamental principle of distributeddatabase?


The fundamental principle of distributed database is to the user, a
distributed system should look exactly like a non-distributed system.

5.What is OODBMS?
Object-oriented database management systems (OODBMSs) combine

IIYEAR/IV SEM Page 69


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

database capabilities with object-oriented programming language


capabilities. OODBMSs allow object-oriented programmers to develop
the product, store them as objects, and replicate or modify existing
objects to make new objects within the OODBMS. Because the database
is integrated with the programming language, the programmer can
maintain consistency within one environment, in that both the OODBMS
and the programming language will use the same model of representation.

6.What is meant by informationretrieval?(R)


An information retrieval process begins when a user enters a query into
the system. Queries are formal statements of information needs, for example
search strings in web search engines. In information retrieval a query does not
uniquely identify a single object in the collection. Instead, several objects may
match the query, perhaps with different degrees of relevancy.

7.What is meant by Relevancy ranking? (NOV2014)(U)


Relevancy ranking is the process of sorting the document results so that those
documents which are most likely to be relevant to your query are shown at the
top.

8.Define Crawling. (NOV2014)(R)


Web crawling is the process of search engines combing through web
pages in order to properly index them. These ―web crawlers‖ systematically
crawl pages and look at the keywords contained on the page, the kind of content,
all the links on the page, and then returns that information to the search engine„s
server for indexing. Then they follow all the hyperlinks on the website to get to

IIYEAR/IV SEM Page 70


CS8492-DBMS-UNIT-5 ADVANCED TOPICS

other websites. When a search engine user enters a query, the search engine will
go to its index and return the most relevant search results based on the keywords
in the search term. Web crawling is an automated process and provides quick, up
to date data.

9.What is meant by XMLDatabase?(R)


An XML database is a data persistence software system that allows data
to be specified, and sometimes stored, in XML format. These data can then be
queried, transformed, exported and returned to a calling system.

PART-B AND C

1. Explain about Distributed Databases. (NOV 2014)

2. Explain in detail Information retrieval and Relevance Ranking.

3. Describe about OODBMS and XML Database.

4. Explain in detail Threats and risks in Database Management System.

5. Explain detail about IR components.

6. Write about XML schema and Xpath query with example.

IIYEAR/IV SEM Page 71

You might also like