Professional Documents
Culture Documents
Dbms Unit
Dbms Unit
DISTRIBUTED DATABASES:-
A distributed database is basically a database that is not limited to one system, it is
spread over different sites, i.e, on multiple computers or over a network of
computers. A distributed database system is located on various sited that don‟t
share physical components. This maybe required when a particular database needs
to be accessed by various users globally. It needs to be managed such that for the
users it looks like one single database.
A distributed database is a collection of multiple interconnected databases, which
are spread physically across various locations that communicate via a computer
network.
Types:
1. Homogeneous Database:
2. Heterogeneous Database:
There are 2 ways in which data can be stored on different sites. These are:
1.Replication
In this approach, the entire relation is stored redundantly at 2 or more sites. If the
entire database is available at all sites, it is a fully redundant database. Hence, in
replication, systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also,
now query requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated.
Any change made at one site needs to be recorded at every site that relation is
stored or else it may lead to inconsistency. This is a lot of overhead. Also,
concurrency control becomes way more complex as concurrent access now needs
to be checked over a number of sites.
2.Fragmentation
In this approach, the relations are fragmented (i.e., they‟re divided into smaller
parts) and each of the fragments is stored in different sites where they‟re required.
It must be made sure that the fragments are such that they can be used to
reconstruct the original relation
In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −
Architectural Models
This is a two-level architecture where the functionality is divided into servers and
clients. The server functions primarily encompass data management, query
processing, optimization and transaction management. Client functions include
mainly user interface. However, they have some functions like consistency
checking and transaction management.
In these systems, each peer acts both as a client and a server for imparting database
services. The peers share their resource with other peers and co-ordinate their
activities.
Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −
In this design alternative, different tables are placed at different sites. Data is
placed so that it is at a close proximity to the site where it is used most. It is most
suitable for database systems where the percentage of queries needed to join
information in tables placed at different sites is low. If an appropriate distribution
strategy is adopted, then this design alternative helps to reduce the communication
cost during data processing.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored.
Since, each site has its own copy of the entire database, queries are very fast
requiring negligible communication cost. On the contrary, the massive redundancy
in data requires huge cost during update operations. Hence, this is suitable for
systems where a large number of queries is required to be handled whereas the
number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution
of the tables is done in accordance to the frequency of access. This takes into
consideration the fact that the frequency of accessing the tables vary considerably
from site to site. The number of copies of the tables (or portions) depends on how
frequently the access queries execute and the site which generate the access
queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or
partitions, and each fragment can be stored at different sites. This considers the fact
that it seldom happens that all data stored in a table is required at a given site.
Moreover, fragmentation increases parallelism and provides better disaster
recovery. Here, there is only one copy of each fragment in the system, i.e. no
redundant data.
Vertical fragmentation
Horizontal fragmentation
Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are
initially fragmented in any form (horizontal or vertical), and then these fragments
are partially replicated across the different sites according to the frequency of
accessing the fragments.
Data Replication
Data replication is the process of storing separate copies of the database at two or
more sites. It is a popular fault tolerance technique of distributed databases.
Snapshot replication
Near-real-time replication
Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical). Horizontal
Advantages of Fragmentation
Since data is stored close to the site of usage, efficiency of the database
system is increased.
Local query optimization techniques are sufficient for most queries since
data is locally available.
Since irrelevant data is not available at the sites, security and privacy of the
database system can be maintained.
Disadvantages of Fragmentation
When data from different fragments are required, the access speeds may be
very high.
In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.
Vertical Fragmentation
the primary key field(s) of the table. Vertical fragmentation can be used to enforce
privacy of data.
For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −
Horizontal Fragmentation
For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT
Hybrid Fragmentation
Location transparency
Fragmentation transparency
Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or
fragment(s) of a table as if they were stored locally in the user‟s site. The fact that
the table or its fragments are stored at remote site in the distributed database
system, should be completely oblivious to the end user. The address of the remote
site(s) and the access mechanisms are completely hidden.
Fragmentation Transparency
This is somewhat similar to users of SQL views, where the user may not know that
they are using a view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the
users. It enables users to query upon a table as if only a single copy of the table
exists.
Combination of Transparencies
In any distributed database system, the designer should ensure that all the stated
transparencies are maintained to a considerable extent. The designer may choose to
fragment tables, replicate them and store them at different sites; all oblivious to the
end user. However, complete distribution transparency is a tough task and requires
considerable design efforts.
Authentication
Access rights
Integrity constraints
Authentication
Access Rights
A user‟s access rights refers to the privileges that the user is given regarding
DBMS operations such as the rights to create a table, drop a table,
add/delete/update tuples in a table or query upon the table.
In distributed environments, since there are large number of tables and yet larger
number of users, it is not feasible to assign individual access rights to users. So,
DDBMS defines certain roles. A role is a construct with certain privileges within a
database system. Once the different roles are defined, the individual users are
assigned one of these roles. Often a hierarchy of roles are defined according to the
organization‟s hierarchy of authority and responsibility.
For example, the following SQL statements create a role "Accountant" and then
assigns this role to user "ABC".
Semantic integrity control defines and enforces the integrity constraints of the
database system.
A data type constraint restricts the range of values and the type of operations that
can be applied to the field with the specified data type.
For example, let us consider that a table "HOSTEL" has three fields - the hostel
number, hostel name and capacity. The hostel number should start with capital
letter "H" and cannot be NULL, and the capacity should not be more than 150. The
following SQL command can be used for data definition −
);
Entity integrity control enforces the rules so that each tuple can be uniquely
identified from other tuples. For this a primary key is defined. A primary key is a
set of minimal fields that can uniquely identify a tuple. Entity integrity constraint
states that no two tuples in a table can have identical values for primary keys and
that no field which is a part of the primary key can have NULL value.
For example, in the above hostel table, the hostel number can be assigned as the
primary key through the following SQL statement (ignoring the checks) −
Referential integrity constraint lays down the rules of foreign keys. A foreign key
is a field in a data table that is the primary key of a related table. The referential
integrity constraint lays down the rule that the value of the foreign key field should
either be among the values of the primary key of the referenced table or be entirely
NULL.
For example, let us consider a student table where a student may opt to live in a
hostel. To include this, the primary key of hostel table should be included as a
foreign key in the student table. The following SQL statement incorporates this −
Advantage:
1. Sharing data
2. Autonomy-able to control over data stored locally.
3. Availability-one site fails the other sites able to continue operating.
Disadvantage:
1. Software development cost very high.
2. Greater potential for bugs is harder to ensure to correctness of algorithms, during
failures of system.
3. Increased processing overhead, exchange of message and additional achieve
interside.
OBJECT-BASED DATABASES
Data model designed for processing database is not adequate to support computer
aided design (CAD), image database, multimedia, hypertext databases and so on.
Class student
{
String name;
Introllno;
Get_name();
Set_name();
};
In OODB all object have an object structure and holds all relevant
information corresponding to the real world entity.
OODB provides a complete encapsulation of objects. Thus it specific
predefined operations for an object. But predefined operations does not
support adhoc/instant query by a user.
o A set of variables that contain the data for the object. The value of
each variable is itself an object.
o A set of messages to which the object responds.
o A set of methods, each of which is a body of code to implement each
message; a method returns a value as the response to the message.
3. Motivation of using messages and methods.
The ability to modify the definition of an object without affecting the rest of
the system is considered to be one of the major advantages of the OO
programming paradigm.
ODMG
XML DATABASE
Xml (extensible Markup Language) is a markup language.
XML is designed to store and transport data.
Xml was released in late 90‟s. it was created to provide an easy to use and store
self-describing data.
XML database is a data persistence software system used for storing the huge
amount of information in XML format. It provides a secure place to store XML
documents.
You can query your stored data by using XQuery, export and serialize into desired
format. XML databases are usually associated with document-oriented databases.
1. XML-enabled database
2. Native XML database (NXD)
XML-enable Database
Native XML database is used to store large amount of data. Instead of table format,
Native XML database is based on container format. You can query data by XPath
expressions.
1. <?xml version="1.0"?>
2. <contact-info>
3. <contact1>
4. <name>Vimal Jaiswal</name>
5. <company>SSSIT.org</company>
6. <phone>(0120) 4256464</phone>
7. </contact1>
8. <contact2>
9. <name>Mahesh Sharma </name>
10. <company>SSSIT.org</company>
11. <phone>09990449935</phone>
12. </contact2>
13.</contact-info>
In the above example, a table named contacts is created and holds the contacts
(contact1 and contact2). Each one contains 3 entities name, company and phone.
XML DOCUMENT
XML provides detailed information about the structre and meanng of the data in
the web pages.
It is suited for:
Structured data
Semi-Structured data
Unstructured data
Structured data
In structured data all information are stored in predefined structure format. There
will be no deviations in the arrangement and operations being performed data, Data
in which stored in database in specified format is normally termed as structured
data.
Example:
Semi-Structured data
Unstructured data
User text entry on web search is the best example of unstructured data. Text
document which has information embedded into it is called unstructured data.
Nested structure of tags helps to gain depth knowledge about the database,but it
may lead to redundancy.
1) Element
2) Attribute
Attributes in XML are strings that do not contain markup and appears only once in
a given tag.XML has been used to exchange data between applications by
mechanism called namespace.Namespace mechanism enables global unique
names used to element tags in document.
Example: Webaddress
https://www.annauniv.edu
<college xmlns:au=‟//www.au.com‟>
<au:department>
<au:course>CSE</au:course>
<au:Block>Milton</au:Block>
</au: department>
XML document can be modeled in tree structure using nodes. Eachnode has
corresponding element andattributes. Termed parent, child, ancestors, siblings,
descendant are used to XML tree model.
There are two types of elements: complex –complex nodes are internal nodes. and
simple elements are leaf nodes.
Simple elements contain data values whereas complex element constructed from
other elements hierarchy model.
Example:
<bookstore>
<bookcategory="children">
<title>HarryPotter</title>
<author>JK. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
Here book, title, author,year,price are simple elements and given construct values
for each element called complex elements.
them over or display them on the Web. These usually follow a predefined schema
that defines the tag names.
In Well-formed XML there is only a single root element and every element
must have start and end tag with in a start and end tag of parent
element.DOM(Document Object Model) allows programs to manipulate tree
representation to the well-formed XML document.it parses the entire tree
structure.
In valid XML document must satisfy the structure specified in XML schema
file or XML DTD(Document type Definition)
Example: XML Schema for project
<xsd:schemaxmlns:xds=http://www.project.com/2010/xmlschema>
<xsd:annotation>
<xsd:documentationxml”lang=”en”>project expo</xsd:annotation>
</xsd:annotation>
<xsd:Complextype name=”Name”>
<xsd: sequence>
<xsd: element name=”first_name” type=”xsd.string”/>
<xsd: element name=”last_name” type=”xsd.string”/>
<xsd: element name=”middle_name” type=”xsd.string”/>
</xsd:sequence>
</xsd:CompleType>
<xsd:Sequence>
<xsd: element name=”emp_id” type=”xsd.integer”/>
<xsd: element name=”f_name” type=”xsd.string”/>
<xsd: element name=”dept” type=”xsd.string”/>
</xsd:sequence>
</xsd:CompleType>
</xsd:Schema>
IIYEAR/IV SEM Page 45
CS8492-DBMS-UNIT-5 ADVANCED TOPICS
For referring primary key and foreign key of schema xsd:keyref is used.
Xsd annotation and documentation are used for providing comments and
other descriptions in XML document.
The structure of company root element is xsd:complextype.
A DTD defines the structure and the legal elements and attributes of an XML
document.
Use a DTD
With a DTD, independent groups of people can agree on a standard DTD for
interchanging data.
Syntax
<?xml version="1.0"?>
<!DOCTYPE note [
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Limitations
1) Data types are very sophisticated not general, all values should be
string.
2) Since it has two own syntax need specific processor run it.
3) UnOrdering specification of elements are never permitted.
4) Difficult to specify unordered sets of sub elements.
XML Schema
An XML Schema describes the structure of an XML document, just like a DTD.
An XML document validated against an XML Schema is both "Well Formed" and
"Valid".
XML Schema
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
With XML Schema, your XML files can carry a description of its own format.
With XML Schema, independent groups of people can agree on a standard for
interchanging data.
IIYEAR/IV SEM Page 51
CS8492-DBMS-UNIT-5 ADVANCED TOPICS
One of the greatest strengths of XML Schemas is the support for data types:
Another great strength about XML Schemas is that they are written in XML:
XQUERY
What is XQuery?
"books.xml":
<bookstore>
<book category="COOKING">
<titlelang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<titlelang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
IIYEAR/IV SEM Page 53
CS8492-DBMS-UNIT-5 ADVANCED TOPICS
</book>
<book category="WEB">
<titlelang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>VaidyanathanNagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<titlelang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
Path Expressions
The following path expression is used to select all the title elements in the
"books.xml" file:
doc("books.xml")/bookstore/book/title
(/bookstore selects the bookstore element, /book selects all the book elements
under the bookstore element, and /title selects all the title elements under each
book element)
XQuery shares the same data types as XML Schema 1.0 (XSD).
XSD String
XSD Date
XSD Numeric
XSD Misc
As we have seen in the previous chapters, we are selecting and filtering elements
with either a Path expression or with a FLWOR expression.
for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
XPath
These path expressions look very much like the path expressions you use with
traditional computer file systems:
There are functions for string values, numeric values, booleans, date and time
comparison, node manipulation, sequence manipulation, and much more.
Today XPath expressions can also be used in JavaScript, Java, XML Schema,
PHP, Python, C and C++, and lots of other languages.
<bookstore>
<book>
<titlelang="en">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<titlelang="en">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
Selecting Nodes
XPath uses path expressions to select nodes in an XML document. The node is
selected by following a path or steps. The most useful path expressions are listed
below:
Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node
@ Selects attributes
By using the | operator in an XPath expression you can select several paths.
In the table below we have listed some path expressions and the result of the
expressions:
An absolute location path starts with a slash ( / ) and a relative location path does
not. In both cases the location path consists of one or more steps, each separated by
a slash:
/step/step/...
step/step/...
an axis (defines the tree-relationship between the selected nodes and the
current node)
a node-test (identifies a node within an axis)
zero or more predicates (to further refine the selected node-set)
axisname::nodetest[predicate]
INFORMATION RETRIEVAL
IR Components
User Query is the text entered by the user to search for the information.
Text operation receives the entered text from user and converts it into tokens
as matching keyword for the search information.
Indexing directs the generated tokens to different pointers. Based upon the
relevant matching of token with pointers different query response will be
generated.
These query response are ranked based on the relevance metric of the concept
.
User interface management user interaction by processing input and visualization
of output.Before indexing logical view of documents must be created by query
manager. Differs are used for defining the database index structures.
The time and memory space sent on defining text database and building index are
by querying the retrieval system many times.
Search information provided by the user sent to text acquisition where it gathers all
search tokens. Then based upon the matching tokens identifies pointers. Most
accepting level of search retrieval has been selected; index is created for those link
and stored in the index DB with rankings.
Document data store holds the metadata for all documents. It includes document
type, structure and length so on.
The set of search keyword is called index terms. The set of collection of terms
indexed for a document is called document vocabulary. Index gets updated
whenever new search is performed.
Inverted file or inverted index is the stores index whereas document file stores
documents.
Query processing in IR system involves the following steps
User interface
Query process
Ranking and evaluation
Delivering information
Relevance Ranking
IR means to identify relevance of ranking documents.
Differenent relevance ranking approaches are,
(i) Relevance using terms
Ranking using TF-IDF
Similarity based retrieval
TF(d,t)=log(1+n(d,t)/n(d))
IDF (t)=1/n(t)
Stop words are the collection of words by IR.These are the common word and
ignored during indexing a document.
Proximity refers to the multiple terms in a document. It is given higher priority.
r(d,Q)
Similarity based Retrieval
User provides doc A to system and system process to product the output which are
all similar with doc A.
TF (A,t)*IDF(t)
The resulting set of document is the search result details of user. This is called
relevance feedback. The document in a model of points and vectors in an n-
dimensional space is called vector space model.
Popularity Ranking
It is also prestige ranking. Gives higher priority to pages that are popular in web
site.
Files that stored in bookmark files come under popular ranking. One website
liking with other popularity website is also example for this.
Popularity can be measured by page linked with a particular page.
Page Rank
For Google measure of page ranks based on link to the page. It can be understood
by random wack model. Page rank algorithm does not give priority to query
keywords. It can be resolved by use keywords in the anchor text of links.
Search Engine Spamming
These are not popular website but it gives high relevance rank from some queries.
Synonymous
Synonymous or context based information retrieval was also in practice.
Retrieval effectives is measured by,
Percentage of false negatives
Percentage of false positives.
Information Retrieval Models
Classical Models are,
Boolean Model
Vector Model
Probablistic Model
Fuzzy Model
Semantic Model
Boolean Model: In Boolean Model index term weight variables,as binary {0,1}.It
is a simple model nased on set theory.
The Boolean index terms are justified by present or absent. Boolean logic set
theoretic operation are AND,OR and NOT.Quey in Boolean expressions is
represented as a disjunction or conjunction vector.
→
Q =(1,1,1)V(1,1,1),(1,0,0) where each of the component is binary weighted vector
associated with the tuple (ba,bb,bc).
Advantages:
1. Simplicity of model
Disadvantages
1. Retrieval performance is low.
Vector Space Model
Documents are represented as features and weights in an n-dimensional vector
space of terms.
It is associated with a pair of (Ki,dj)is positive and non-binary.
The cosine of the angle between query and document vector is commonly used for
assessing similarity.
The term weight is used to compute degree of similarity to the user query.
Query vector →
Q = (w1, Q,w2,Q,….wn,Q)
QUERIES IN IR SYSTEMS
Proximity queries: Restrict the distance within a document between two search
terms. Important for large documents in which the two search words may appear in
PART-A(2 MARK)
5.What is OODBMS?
Object-oriented database management systems (OODBMSs) combine
other websites. When a search engine user enters a query, the search engine will
go to its index and return the most relevant search results based on the keywords
in the search term. Web crawling is an automated process and provides quick, up
to date data.
PART-B AND C