Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 14

ASSIGNMENT-2 Image Database & Text Database

Submitted by M.Sreekrishna 11MCS17

Image Database
Image Database is searchable electronic catalog or database which allows you to organize and list images by topics, modules, or categories. The Image Database will provide the student with important information such as image title, description, and thumbnail picture. Additional information can be provided such as creator of the image, filename, and keywords that will help students to search through the database for specific images. Before you and your students can use Image Database, you must add it to your course.An image retrieval system is a computer system for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools. Image search is a specialized data search used to find images. To search for images, a user may provide query terms such as keyword, image file/link, or click on some image, and the system will return images "similar" to the query. The similarity used for search criteria could be meta tags, color distribution in images, region/shape attributes, etc.

Image meta search - search of images based on associated metadata such as keywords, text, etc. Content-based image retrieval (CBIR) the application of computer vision to the image retrieval. CBIR aims at avoiding the use of textual descriptions and instead retrieves images based on similarities in their contents (textures, colors, shapes etc.) to a user-supplied query image or user-specified image features.
o

List of CBIR Engines - list of engines which search for images based image visual content such as color, texture, shape/object, etc.

Data Scope
It is crucial to understand the scope and nature of image data in order to determine the complexity of image search system design. The design is also largely influenced by factors such as the diversity of user-base and expected user traffic for a search system. Along this dimension, search data can be classified into the following categories:

Archives - usually contain large volumes of structured or semi-structured homogeneous data pertaining to specific topics.

Domain-Specific Collection - this is a homogeneous collection providing access to controlled users with very specific objectives. Examples of such a collection are biomedical and satellite image databases. Enterprise Collection - a heterogeneous collection of images that is accessible to users within an organizations intranet. Pictures may be stored in many different locations. Personal Collection - usually consists of a largely homogeneous collection and is generally small in size, accessible primarily to its owner, and usually stored on a local storage media. Web - World Wide Web images are accessible to everyone with an Internet connection. These image collections are semi-structured, non-homogeneous and massive in volume, and are usually stored in large disk arrays.

There are evaluation workshops for image retrieval systems aiming to investigate and improve the performance of such systems.

ImageCLEF - a continuing track of the Cross Language Evaluation Forum that evaluates systems using both textual and pure-image retrieval methods. Content-based Access of Image and Video Libraries - a series of IEEE workshops from 1998 to 2001.

Create an Image Database


An Image Database can ultimately contain as many images as you would like. You can put all images in one database or create multiple databases. Upload the image files that you want to include in the database. How to set up WebDAV to drag and drop files from your desktop to your course. Or see Manage Files to upload files. From the Homepage or the Course Menu select the Image Database link. The Image Database page displays. Select Add image database button from Options.

The Add Image Database page displays. Type desired database title in Title: field and click the Add button.

The new image database displays in the Available databases. Select the link to the new image database you just created.

The Image Database Screen displays. Select the Add Image button.

The Add Image screen displays. Type in relevant keywords in the *Keywords field. Type the owner of the image in the Creator: field. Type the path and filename in the *Filename: field or click the browse button and find the file in the My-Files area. Type a relevant title for this image in the Title: field. Type in the image description in the Description: field. Type the path and filename of the image thumbnail in the Thumbnail: field or click the browse button and find the file in the My-Files area. Select the Add button.

The Image Database page displays with the new image and information.

To add additional images to the database repeat the above steps.

Edit an Image Record


To find that you have information about an image that needs to be edited. If you have text in one column that needs to be changed, see Columns/Edit. If you have additional image information that needs to be changed, follow the steps below. From the Homepage or the Course Menu select the Image Database link.

The Available Database page displays. Select the link to the image database that contains the image you want to edit. The Image Database page displays. Select the radio button beside the image you would like to edit and select the Edit button.

The Edit Record page displays. To change the *Filename: field select the New Image button. The New Image Screen displays. Type the path and filename in the field or click the browse button and find the file in the My-Files area. Select the Regenerate thumbnail checkbox if you would like the image database to create a new thumbnail for you. Select the Update button.

The Edit Record page displays again with the new image filename in the *Filename: field. If you did not have the image database regenerate the thumbnail for you on the previous screen, select the New thumbnail button. The New Thumbnail page displays. Type the path and filename of the image thumbnail in the Thumbnail: field or click the browse button and find the file in the My-Files area. Select the Update button. The Edit Record page displays again with the new image filename in the

Thumbnail: field. Type the corrected information in *Keywords field, Creator: field, Title: field, and/or Description: field. Select the Update button.

The Image Database page displays with the new image and/or information.

Delete an Image Record


To find that you no longer want an image to be included in your image database. You can delete images from a image database but they must be deleted one at a time. From the Homepage or the Course Menu select the Image Database link. The Available Database page displays. Select the link to the image database that contains the image you want to edit. The Image Database page displays. Select the radio button beside the image you would like to delete and select the Delete button.

The Delete Image confirmation window displays. Select OK button.

The Image Database page displays without the deleted image. To delete additional images from the database repeat the above steps.

Text Database
A text database is an online platform for original text and translations. The original text simply needs to be uploaded into the text database. All texts that are to be placed on a pack will be separated into individual blocks and updated simultaneously. This avoids keeping various versions of text at different locations.The corrections to source text and translations will happen online, instantly replacing the original copy. This process eliminates errors and simplifies labour intensive tasks, boosting efficiency. A full text database or a complete text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents. Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing.

Indexing
The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.

Index design factors


Major factors in designing a search engine's architecture include: Merge factors

How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. Storage techniques How to store the index data, that is, whether information should be data compressed or filtered. Index size How much computer storage is required to support the index. Lookup speed How quickly a word can be found in the inverted index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science. Maintenance How the index is maintained over time Fault tolerance How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash-based or composite partitioning, as well as replication.

Index data structures


Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Types of indices include: Suffix tree Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. The suffix tree is a type of trie. Tries support extendable hashing, which is important for search engine indexing. Used for searching for patterns in DNA sequences and clustering. A major drawback is that storing a word in the tree may require space beyond that required to store the word itself. An alternate representation is a suffix array, which is considered to require less virtual memory and supports data compression such as the BWT algorithm. Inverted index Stores a list of occurrences of each atomic search criterion, typically in the form of a hash table or binary tree. Citation index Stores citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics. Ngram index Stores sequences of length of data to support other types of retrieval or text mining. Document-term matrix Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix.

Index merging
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing,[17][18] where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.

Compression
Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk.

Document parsing
Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization. It is also sometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

Challenges in natural language processing


Word Boundary Ambiguity Native English speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. In digital form, the texts of other languages such as Chinese, Japanese or Arabic represent a greater challenge, as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). Language Ambiguity To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its language or lexical category (part of speech). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.

Diverse File Formats In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. Faulty Storage The quality of the natural language data may not always be perfect. An unspecified number of documents, particular on the Internet, do not closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.

Tokenization
Unlike literate humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word, referred to as a token. Such a program is commonly called a tokenizer or parser or lexer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC or Lex.

Language recognition
If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which language the words belongs to may involve the use of a language recognition chart.

Format analysis
If the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include: HTML ASCII text files (a text document without specific computer readable formatting) Adobe's Portable Document Format (PDF) PostScript (PS) LaTeX UseNet netnews server formats XML and derivatives like RSS SGML Multimedia meta data formats like ID3 Microsoft Word Microsoft Excel

Microsoft Powerpoint IBM Lotus Notes Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom parser. Some search engines support inspection of files that are stored in a compressed or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include: ZIP - Zip archive file RAR - Roshal ARchive File CAB - Microsoft Windows Cabinet File Gzip - File compressed with gzip BZIP - File compressed using bzip2 Tape ARchive (TAR), Unix archive file, not (itself) compressed TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files compressed with Compress, GZIP or BZIP2 Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing: Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or Javascript to do so). Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.

Section recognition
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web, such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material. Two primary problems are noted: Content in different sections is treated as related in the index, when in reality it is not Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.

Meta tag indexing


Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the meta tag contains keywords which are also included in the index. Earlier Internet search engine technology would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was the hardware able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.

Full text search

In text retrieval, full text search refers to techniques for searching a single computerstored document or a collection in a full text database.

Indexing
When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching. However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial, the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words, such as the English "the", which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming on the words being indexed, so for example any of the words "drives", "drove", or "driven" will be recorded in the index under a single concept word "drive".

Document-oriented database
A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information. Document-oriented databases are one of the main categories of so-called NoSQL databases and the popularity of the term "document-oriented database" (or "document store") has grown with the use of the term NoSQL itself.

Documents
The central concept of a document-oriented database is the notion of a Document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). Documents inside a document-oriented database are similar, in some ways, to records or rows, in relational databases, but they are less rigid. They are not required to adhere to a standard schema nor will they have all the same sections, slots, parts, keys, or the like.

Keys, Retrieval, and Organization


Keys
Documents are addressed in the database via a unique key that represents that document. Often, this key is a simple string. In some cases, this string is a URI or path. Regardless, you can use this key to retrieve the document from the database. Typically, the database retains an index on the key such that document retrieval is fast.

Retrieval
One of the other defining characteristics of a document-oriented database is that, beyond the simple key-document (or key-value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow you to retrieve

documents based on their contents. For example, you may want a query that gets you all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to the next.

Organization
Implementations offer a variety of ways of organizing documents, including notions of Collections Tags Non-visible Metadata Directory hierarchies.

You might also like