Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

Table Of Contents

Chapter No.
1

Contents
INTRODUCTION 1.2 Database Management System 1.2 Information Retrieval And Database Querying 1.3 Ranking Based Querying

Page No.

QUERIED UNITS 2.1 Definition 2.2 QUNIT Utility 2.3 QUNIT Based Search

QUNIT DERIVATION 3.1 Using Schema And Data 3.2 Using Query Log 3.3 Using External Evidences

EXPERIMENTS AGAINST STRUCTURED DATABASE 4.1 Understanding Search 4.2 Movie querylog Benchmark 4.3 Evaluating Result Quality

5 6

CONCLUSION REFERENCES

1|Page

1 INTRODUCTION

1.1 DATABASE M ANAGEMENT SYSTEM


A Database Management S ystem (DBMS) is a set of computer programs that controls the creation, maintenance, and the use of the database with computer as a platform or of an organization and its end users. It allows organizations to place control of organization -wide database development in the hands of database administrators (DBAs) and other specialists. A DBMS is a system software package that helps the use of integrated collection of data records and files known as databases. It allows different user application programs to easily access the same database. DBMSs may use any of a variety of database models, such as the network model or relational model. In large systems, a DBMS allows users and other software to store and retrieve data in a structured way. Instead of having to write computer programs to extract information, user can ask simple questions in a query language. Thus, many DBMS packages and provide other Fourth-

generation

programming

language

(4GLs)

application

development features. It helps to specify the logical organization for a database and access and use the information within a database. It provides facilities f or controlling data access, enforcing data integrity, managing concurrency controlled, and restoring database. Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data. The information stored in databases is known as structured data because it is represented in a strict format. Each record in a relational database table follows the same format as the other records in that table. For structured data; it is common to caref ully design the database in order to create the database schema. The DBMS then checks to ensure that all data follows the structures and constraints specified in the schema. However, not all data is collected and inserted into carefully designed
2|Page

structured databases. In some applications, data is collected in an ad hoc manner before it is known how it will be stored and managed. This data may have a certain structure, but not all the Information collected will have identical structure. Some attributes may be shared among the various entities, but other attributes may exist only in a few entities. Moreover, additional attributes can be introduced in some of the newer data items at any time, and there is no predefined schema. This type of data is known as semi structured data. A number of data models have been introduced for representing semi structured data, often based on using tree or graph data structures rather than the flat relational model structures.

A key difference between structured and semi structured data concerns how the schema constructs (such as the names of attributes,

relationships, and entity types) are handled. In semi structured data, the schema information is mixed in with the data values, since each data object can have different a ttributes that are not known in advance.

3|Page

Figure 1

The database system contains not only the database itself but also a complete definition or description of the database structure and

constraints. This definition is stored in the DBMS catalog, which contains information such as the structure of each file, the type and storage format of each data item, and various constraints on the data. The information stored in the catalog is called meta-data.

4|Page

The description of a database is ca lled the database schema, which is specified during database design a nd is not expected to change frequently. A schema diagram displays only some aspects of a schema, such as the names of record types and data items, and some types of constraints. The actual data in a database may change quite frequently. The data in the database at a particular moment in time is called a database state or snapshot. It is a lso called the current set of occurrences or instances in the database.

1.2 INFORMATION RETRIEV AL & DATABASE QUERYING Information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the W orld W ide W eb. IR is interdisciplinary, based on computer science , mathematics, library science, information science , information architecture , cognitive psychology, linguistics, statistics, and physics. Automated information retrieval systems are used to redu ce what has been called " information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. W eb search engines are the most visible IR applications.

Figure 1

Figure 2

5|Page

A database query is the operation that extracts a recordset from a database. A query consists of search criteria expressed in a database language called SQL. For example, the query can specify that only certain columns or only certain records be included in the recordset.

1.3 RANKING BASED QUERY

To rank documents, IR systems assign a score for each document as an estimation of the document relevance to the given query. Automated ranking of the results of a query is a popular aspect of the query model in Information Retrieval (IR). In contrast, database systems support only a Boolean query model. For example, a selection query on a SQL database returns all tuples that satisfy the conditions in the query. Therefore, the following two scenarios are not gracefully handled by a SQL system:

Empty answers: W hen the query is too selective, the answer may be empty. In that case, it is desirable to have the op tion of requesting a ranked list of approximately matching tuples without having to specify the ranking function that captures proximity to the query. An FBI agent or an analyst involved in data exploration will find such functionality appealing. Many answers: W hen the query is not too selective, too many tuples may be in the answer. In such a case, it will be desirable to have the option of ordering the matches automatically that ranks more globally important answer tuples higher and

returning only th e best matches.

6|Page

Conceptually, the automated ranking of query results problem is really that of taking a user query (say, a conjunctive selection query) and mapping it to a Top -K query with a ranking function that depends on given conditions in the user query. The key questions are: How to derive such ranking functions automatically? How well do ranking functions from IR apply? Are the ranking techniques for handling empty answers and many answers problems different? How

to

execute

such

Top -K

queries

efficiently

over

large

databases?

7|Page

2 QUERIED UNITS (QUNITS)

2.1 DEFINITION
Qunit: A qunit (Queried Unit) is the basic, independent semantic unit of information in a database. It is conceptually a collection of independent unit which represents the desired result for some query against the database. Qunits can be treated as a document for standard IR -like document retrieval.

Qunit definition can be considered as a combination of these two : Base expression It can be considered as a view on a database. It consists of a stored query accessible as a virtual table composed of the result set of a query. Conversion expression it converts the data in a form we want to represent it. Thus it can be used to dete rmine various presentation of given data.

A Basic Qunit Example For example, consider the IMDB (Internet movie database) database. W e would like to create a qunit definition corresponding to the information need cast. A cast is defined as the people a ssociated with a movie, we dont want the name of the movie repeated with each record, we like to have the presentation with the movie title on top and one record for each cast member. The base data in IMDB is relational, and against its schema, we would w rite the base expression in SQL with the conversion expression in XSL-like markup as follows:

8|Page

SQL often referred to as Structured Query Language is a database computer language designed for managing data in relational database management systems (RDBMS), and originally based upon relational algebra. Its scope includes data query and update, schema creation and modification, and data access control.

XSL Extensible Style sheet Language (XSL) is used to refer to a family of languages used for transforming and rendering XML documents. XML (Extensible documents Markup Language) is a set XMLs of rules design for encoding on electronically .Although focuses

documents, it is widely used for the representation of arbitrary data structures.

Base Expression SELECT * FROM person, cast, movie W HERE cast.movie id = movie.id AND cast.person id = person.id AND movie.title = "$x" RETURN

Conversion Expression <cast movie="$x"> <foreach:tuple> <person>$person.name</person> </foreach:tuple> </cast> The combination of these two expressions (Base and Conversion) forms our qunit definition. On applying this definition to a database, we derive qunit instances, one per movie.
9|Page

2.2 QUNIT UTILITY

The utility of a qunit applies to both qunit definition and qunit instances. By qunit utility we understand the importance of a qunit in relation to a user query in a database .The total number of possible views in a database is very large thus the total number of candidate qunit definitions is massive . The importance of a qunit has been t ermed as utility score.The utilty score of qunit is used to select the most relevant set of useful qunits from the large pool of candidate qunits. Utility score provides the most outstanding and relevant output taht should be returned for a particular query of the user.The qunit utility is relative to each user purpose ,need and is different for each user. W e quantise the qunit utility with well defined objective substitute .this is similar to measuring document relevance in Information retrieval where TF/IDF is used to approximate document relevance . TfIdf weight (term frequenc yinverse document frequenc y) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a colle ction or corpus. The importance increases

proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tfidf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

10 | P a g e

2.3 QUNIT BASED SEARCH

Figure 3

11 | P a g e

Consider the user query, star wars cast , as shown in Fig 3. Queries are first processed to identif y entities using standard query segmentation techniques. In this case one high -ranking segmentation is [movie.name][cast] and this has a very hig h overlap with the qunit definition that involves a join between movie.name and cast. Standard IR techniques can be used to evaluate this query against qunit instances of the identified type; each considered independently even if they contain elements in common. The qunit instance describing the cast of the movie Star W ars is chosen as the appropriate result. The qunits based approach is a far cleaner approach to model database search.

The benefit of maintaining a clear separation between ranking and database content is that structured info rmation can be considered as one source of information amongst many others. This makes our system easier to extend and enhance with additional IR m ethods for ranking, such as relevance feedback. Additionally, it allows us to concentrate on making the database more efficient using indices and query optimization, without havi ng to worry about extr aneous is sues such as search and ranking. It is observed that this conceptual demarcation of rankings and results does not imply materialization of all qunits.

12 | P a g e

3 QUNIT DERIVATION

Qunits can be identified by two ways: By Manual Identification By Automated Techniques. They can be identified manually by the database creator at the time of database creation ,the creator has the best knowledge of the data in the database therefore the qunit identification done by him is likely to be superior to an ything that automated techniques can provide. Identifying qunits involves writing a set of view definitions for

commonly expected query result types, the ma nual effort involved is only a small part of the total cost of database design . Manual identification of qunits may not always be feasible, like legacy systems have already been created without qunits being created therefore automated important. There are several possible sources of information that can be used to infer qunits ,the knowledge of the database s chema is the starting point Independent possible sources of information tha t are used for deriving qunits :1) Data contained in the database 2) History of keyword queries posed to the system previously 3) Publish results and reports based on information from the techniques for deriving qunits from a database are

database in question

13 | P a g e

3.1 USING SCHEM A AND DATA

In this method concept of queriability of a schema of a database is used to infer the databases important schema entities and attributes. Queriability It is defined as a likelyhood of a schema element to be used in a query and is computed using the cardinality of the data that the schema represents. Base expression of a qunit is generated by looking at the top -k schema entities based on descending queriability score. Each of the top -k1 schema entities is then expanded to include the top-k2 neighbouring entities as a join, where k1 and k2 are compatible parameters. This method does not derive optimal qunits like in the case of under specified queries. A Basic Example

14 | P a g e

Here a query George Cloone y is performed .now from the schema in above figure creating a qunit for person would result in the inclusion of important movie genre and the unimportant movie location tables, since every movie has a genre and location. but the information about the shooting location is not of that importance and interest to most of the people . Thus this method of generating qunits is not optimal.

3.2 USING QUERY LOG


This method uses a query rollup strategy for query logs, inspired by the observation that keyword queries are inherently under - specified, and hence the qunit definition for an under-specified query is an

aggregation of the qunit definitions of its special izations. For example, if the expected qunit for the query george clooney is a personality profile about the actor George Clooney, it can be constructed by considering the popular specialized variations of this query, such as George clooney actor, george clooney movies, and so on.

W e perform the following steps : Sampling the database for entities, and look them up in the search query log. The found query log en tries are then collected, along with the number of times they occur in the query log (query frequency). For each query, we then map each recognized entity on to the schema, constructing simple join plans. Than we consider the popular plan fragments for the qunit definition.

15 | P a g e

An example using query log

W e consider a schema element person.name. Instances of this element are i. ii. George Clooney tom hanks

These instances are looked up in the query log, where we find search queries like i. ii. iii. George Clooney actor George Clooney batman Tom hanks castaway

Using these 3 queries we built an annotated set of schema links where person.name links to cast.role once and to movie.name twice. This suggests that the rollup of the qunit representing person.name shou ld contain movie.name and cast.role, in that order.

16 | P a g e

3.3 USING EXTERNAL EVIDENCE

In this third method the external evidences are used to create qunits for the database. By considering the useful information from external evidences the goal is to learn qunit definitions. External Evidence External evidences are useful piece of information that exists in following forms: Reports. Published results of queries to the database. Relevant web pages that present parts of the data. Example Movie information f rom the sources such as W ikipedia and IMDB is organized, and also the information from these two sources greatly overlaps. The aim is to learn the organization of this overlapped data from W ikipedia. The Document Object Model (DOM) is an application progra mming interface (API) for valid HTML and well-formed XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated. In the DOM specification, the term

"document" is used in the broa d sense - increasingly, XML is being used as a way of representing many different kinds of information that may be stored in diverse systems, and much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manage this data. In the DOM, documents have a logical structure which is very much like a tree; to be more precise, which is like a "forest" or "grove", which can contain more than one tree. Each document contain s zero or one doctype nodes, one root element node, and zero or more

comments or processing instructions; the root element serves as the root of the element tree for the document. However, the DOM does not
17 | P a g e

specif y that documents must be implemented as a tree or a grove, nor does it specify how the relationships among objects be implemented. The DOM is a logical model that may be implemented in any convenient manner. In this specification, we use the term structure model to describe the tree-like representat ion of a document. We also use the term "tree" when referring to the arrangement of those information items which can be reached by using "tree -walking" methods; (this does not include attributes). One important property of DOM structure models is structural isomorphism: if any two Document Object Model implementations are used to create a representation of the same document, they will create the same structure model, in accordance with the XML Information Set . A type signature defines the inputs and output s for a function, subroutine or method. A type signature includes at least the function name and the number of its arguments. In some programming

languages, it may also specify the functio n's return type, the types of its arguments, or errors it may pass back. The foreach statement repeats a group of embedded statements for each element in an array or an object collection. The foreach

statement is used to iterate through the collection to g et the desired information.Cardinality specifies how many instances of an entity relate to one instance of another entity. By using the records in database entities document are identified. The signatures for each web pageare computed, utilizing the DOM tree and frequency of each occurrence. An example of a type signature for the cast page for a movie on W ikipedia would be ((person.name:1) (movie.name:40)), which would suggest using person.name as a label field, followed by a foreach consisting of movie .name, based on the relative cardinality in the signature and the number of the tuples generated in our qunit base expression. By aggregating the type signatures over a collection of pages, we can infer the appropriate qunit definition.
18 | P a g e

4 EXPERIMENTS AND EVALUATION

Experiment is performed to explore the natu re of keyword searches that users posed against a structured database. The experiment uses real world query log to evaluate the efficacy of qunit based methods.

4.1 UNDERSTANDING SEARCH


The Internet Mo vie Database or IMDb is a well -known repository of movie-related information on the Internet. W e performed a user study with five users (a,b,c,d,e) all familiar with IMDb and all with a moderate interest in movies. The subjects had a large variance in knowledge about databases. Two were graduate students specializing in

databases, while the other three were non -computer science majors. The subjects were asked to consider a hypothetical movie database that could answer all movie -related queries. Given this, the users were asked to come up with five information needs, and the corresponding keyword queries that they would use to query the database. The summary page of a movie was the most sought af ter page, this information need is expressed in five different ways by the users (row 1).The cast of a movie and finding connections between two actors are also common interests, these are expressed in many dif ferent ways(row 2,row 6).A query that only specifies the title of the movie may be specified on account of four different information needs ( column 1).the users may specify an actors name when they mean to look for either of two different pieces of inform ation the actors filmography, or information about co -actors(row3,row4). There exists a many-to-many relationship between information need and queries.

19 | P a g e

[Actor] [Act or]

[Title] freetext

[Award] [year]

ke yw o rd qu e r y

[Title] poster

Movie summary Cast Filmograph y Coactorship Posters Related movies Aw ards Movies of period Charts / lists Recommendation Soundtracks Tri vi a Box office

A, C E C, D E, A A, C

B E

A E D B C C D

Table 1

Table 1 Information Needs vs Keyword Queries. Five users (A, B, C, D, E) were each asked for their movie related information needs, and what queries they would use to search for them.

20 | P a g e

Dont know

freetext [Title] year

[Title] cast

[Title] OST

[Title] plot

[Title]box

[Genre]

[Movie]

office [Actor]

[Title]

Another key observation is that 10 of the 25 queries here are single entity queries, 8 of which are underspecified the query could be written better by adding on additional predicates. The results of interviews are displayed in Table 1. Each row in this table is an information need suggested by one or more users. Each column is the query structure the user thought to use to obtain an answer for this information need. The users themselves stated specific examples; For example if the user said they would query for star wars cast, it was abstracted to query type [title] cast. The unmatched portion of the query (cast) is still relevant to the schema structure and is hence considered an attribute. Conversely, users often issue queries with words that are non structural () details about the result, such as movie space transponders. W e consider these words free -form text in our query analysis. Some users came up with multiple queries to satisfy the same information need, and hence are entered more than once in the corresponding rows.

21 | P a g e

4.2 MOVIE QUERYLOG BENCHM ARK

To construct a typical workload, we use a real world dataset of web search engine query logs spanning 650,000 users and 20,000,000 queries. All query strings are first aggregated to combine all identities into a single anonymous crowd, and the queries that resulted in a navigation to the www.imdb.com domain are considered, resulting in 98,549 queries, or 46,901 unique queries. W e consider this to be our Base query log for the IMDB dataset. 93% of the unique queries (calculated by sampling) were identified as movie related terms. W e then construct a benchmark query log by first classif ying the base query log into various types. TOKEN A token is an instance of a type in knowledge representation , the typetoken distinction is a distinction that separates an abstract concept from the objects which are particular instances of the concept. For example, the particular bicycle in your garage is a token of the type of thing known as "The bicycle." Tokens in the query log are first replaced with schema types by looking for the largest possible string overlaps with entities in the database. This leaves us with typed templates, such as [name] movies for george clooney movies. W e then randomly pick two queries that match each of the top (by frequency) 14 templates, giving us 28 queries that we use as a workload for qualitative assessment. W e observed that our dataset reflects properties consistent with previous reports on query logs. At least 36% of the d istinct queries to the search engine were single entity queries that were just the name of an actor, or the title of a movie, while 20% were entity attribute queries, such as terminator cast. Approximately 2% of the queries contained more than one entity such as angelina jolie tombraider,while less than 2% of the queries contained a complex query structure involving aggregate functions such as highest box office revenue
22 | P a g e

4.3 EVALUATING RESULT QUALITY


The result quality of a search system is measured by its ability to satisf y a users information need. A result relevance study was conducted using a real-world search query log as described in the following subsection, against the Internet Movie Database. 20 users were asked to compare the results returned by each search algorithm, for 25 different search queries, rating each result 1 (result is correct and relevant) 0 (result is wrong or irrelevant).

For the experiments A survey was conducted using 25 of the 28 queries from the movie querylog benchmark. The workload generated using the query log is first issued on each of the competing algorithms and their results are collected. Users were then provided with a sample information need and query combination: need to find out more about julio iglesias being the need, and julio iglesias being the search query term. Users were then presented with a set of possible answers from a search engine, and were asked to rate the answers presented with one of the options listed in Table 2.

23 | P a g e

score

rating

provides incorrect information

provides no information above the query

.5

provides correct, but incomplete information

.5

provides correct, but excessive information

1.0

provides correct information

Table 2: Survey Options

Users were then asked to repeat this task for the 25 search queries mentioned above. The table also shows the score we internally assigned for each option. If the answer is incorrect uninformative it obviously should be scored 0. W here an answer is partially correct (incomplete or excessive), we should give it a score between 0 and 1 depending on how correct it is. An average value for this is 0.5. Now comparing the performance of currently available

approaches against the qunits.

24 | P a g e

Results are presented in Fig. 3 by considering the average relevance score for each algorithm across the query workload. As seen, the goal of theoretical maximum for result quality is quite far. Yet, qunit -based querying clearly outperforms existing methods.

Figure 3

25 | P a g e

5 CONCLUSION

Keyword

queries

against

structured

data

are

recognized

hard

problem. Researches in both the areas database and information retrieval communities have attacked this problem, and made

considerable progress. But the inco mpatibility of two paradigms still prevails. IR techniques are not designed to deal with str uctured interlinked data, and database techniques are not designed to produce results for queries that are under -specified and vague. This paper has elegantly bridged this gap through the concept of a Qunit. Additionally, this paradigm allows both techiniques to co -exist under the same ranking model .Qunits provide a clean approach to solving this problem. For the keyword query front end, the structured database is nothing more than a collection of independent qunits; so standard information retrieval techniques can be appl ied to choose the appropriate qunits, rank them, and return them to the user. For the structured database backend, each qunit is nothing more than a view definition, with specific instance tuples in the view being computed on demand; there is no issue of underspecified or vague queries. In this paper, multiple techniques are presented to derive the qunits for a structured database and determine their utility for various query types. The concept of a qunit is powerful, and this is merely a first research in this area.

26 | P a g e

6 REFERENCES
1. Arnab Nandi,H.V. Jagadish,(2009); Qunits: queried units for

database search, 4th Biennial Conference on Innovative Data Systems Research (CIDR) . 2. Fundamentals Of Database Systems ; by Ramez Elmasri

Department of Computer Science Engineering University of Texas at Arlington ,Shamkant B. Navathe College of Computing Georgia Institute of Technology. 3. Automated Agrawal Ranking Microsoft of Database Quer y Results; Chaudhuri by Sanjay

Research,

Surajit

Microsoft

Research ,Gautam Das Microsoft Research ,Aristides Gionis Computer Science Dept Stanford University . 4. Internet

27 | P a g e

You might also like