Professional Documents
Culture Documents
B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)
B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)
B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)
Tech 8th Semester project 2012 Team Members: Akash (20084087) Anshul Goyal(20084027) Shrish Chandra Mishra(20084050) Amit Singh(20084057)
When we download images from world wide web the images obtained are in random order.
We often want these images to be in some semantic order. Image Processing may not be used for this purpose because of the huge overhead involved.
The images on the Web are not found alone they are embedded in an HTML page along with text related to image .
The surrounding text of an image can be used to determine the context of that image. Performing semantic analysis on the surrounding text we can obtain relevant ordering of images.
When the surrounding text is considered the information we get is Valley of flowers national Park is an Indian National Park, nestled in West Himalayas. It is located in Uttarakhand state.
(Source:Wikipedia.org)
MODULE 1
no
Search term is checked against the already cached downloaded images Module 1 download the images and the HTML document using the search term Images are downloaded by using producer consumer thread.
Thumbnails are created using the downloaded images and brought into an array Module 2 extracts the concepts from the corpora and arrange the images using the semantic analysis
User can see the image as well as the corresponding corpora of image by clicking the thumbnails
Module 1
Search Term
Create Web Search URL Web search and get resulting image URLs Download Images and their corresponding HTML document
Search Constraints
Search Constraints
1. 2. Number of images to be downloaded Image size
Web Search URL is created using the search term and search constraints and the image URLs are retrieved from the page Images and their HTML pages are downloaded from the URL retrieved above using the pythons mechanize library and using threads.
Threading
The threading is implemented as Producer-Consumer threads. The producer thread creates a new thread for every new image and put them in a queue.
Each thread downloads image and its corresponding HTML file. The consumer thread takes the thread from queue checks the validity of the image and then create an entry for it in the database serially for maintaining the consistency.
Surrounding text is extracted from the images HTML page using the pythons BeautifulSoup library and is saved as .corpora files in the system along with the corresponding images. For every new image downloaded, a new entry is created into the database with information like image URL, website URL, search term and images name in the hard disk.
Module2
Corpora
Removing Stop Words
Arranging Images
Output
Takes corpora as input from the module1 Irrelevant stop words are removed from the corpora using a predefined dictionary of stop words
Ex. Stopwords are common words that carry less important meaning than keywords. Usually search engines remove these words from Keyword phrase
Stopwords common words important meaning keywords search engines remove stopwords keyword phrase
Stemmer algorithm is used to obtain the morphological root of the words in corpora Stemmer algorithm is applied using the pythons whoosh library
Ex. Chatter
Stemmer Algo.
Chat
Running
Run
Concepts are extracted from the corpora by comparing against the given standard ontology in OWL format Arrangement of the images is done on the basis of the concepts extracted from the image The concepts extracted and the keywords from the surrounding text are entered into the database for a particular image.
A common RDF file is created for storing information A general entry of image into database is shown below
<rdf:Description rdf:about=URI of IMAGE> <image:image_name> </image:image_name > <image:URL_webpage></image:URL_webpage> <image:URL_image> </image:URL_image> <image_Search_term></image_Search_term>
<image:keyword> </image:keyword>
<image:concept_name> </image:concept_name> </rdf:Description>
The keywords are suggested by comparing the frequency of the words in the surrounding text The words with highest frequency are shown to the user
It is done by calculating the intersection of the concepts and the union of the concepts extracted the surrounding text from both the images Formula used Concepts from image1 Concepts from image1
A user can query the database on 3 attributes of the image 1.Surrounding text keyword 2.Search Term 3.Concept The images with corresponding attributes are shown to the user.
Applying the arrangement of images for other domains. Remove the duplicate images by trying to find some relationship between the texts. Optimization of the data structures used.