Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 7

Artificial Intelligence Assignment 2 Darren Curran L00070815

The Effect That AI Will Have On the Future Semantic Web (Web 3.0)

Introduction to Search Engines

For a long time now we have been sitting in amazement at the development of search engines such as Google and Yahoo, and how they have enhanced our web browsing experience. We can almost anything we want as long as we provide relevant information. The problem with this is that most of the time websites are titled according to the authors preference and some of them may be sitting on servers with cryptic names. This is why we have come to rely so much on the use of search engines.

Before the introduction of the Worldwide Web internet users had to rely on simple indexing programs such as Gopher and Archie which basically stored indexes on files which were stored on servers across the internet. This at the time really reduced the amount of time it would take to locate a file or program online.

How Search Engines Work

The way that search engines work nowadays is quite different. Once you provide your information there are a few ways in which the search engine reacts: y They search the Internet -- or select pieces of the Internet -- based on important words. y They keep an index of the words they find, and where they find them. y They allow users to look for words or combinations of words found in that index.

The way that they actually search the internet though is by using special software agents which are known as Spiders which build lists of the words found on websites, this process is in turn named Web Crawling . In order to build a useful list of results for us these Spiders have to look at many pages. The spider starts off with popular sites and servers, it indexes the words on

its pages and follows every link found within the site and it is because of this that the spider starts to branch out more widely to other parts of the worldwide web.

Google began as an academic search engine that used multiple spide rs, usually three at once, each keeping up to 300 connections to web pages open at one time. At its best performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data per second. To keep everything running more quickly and efficiently they had to build a system to relay important information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet Service Provider for t he Domain Name Server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum. When the Google spider looked at an HTML pa ge, it took note of two things, the words within the page and where the words were found. This allowed the Google spider to index all the important words.

Meta Tags

Recently we have also started using what is called Meta Tags which are special keywords which the author of the webpage can index their website by. This is most helpful where a word may have more than one meaning, Meta Tags can help let us know which meaning is intended by the word, however this has been abused quite frequently where an autho r would use a very popular meta tag to draw people towards his/her webpage. Browsers had to adapt to this by comparing the Meta tags to the content of the webpage, rejecting those that dont add up.

Indexing

Once we have as many possible matches as w e can find the search engine has to store the information in a convenient manner. The simplest way would be by showing the URL and the word you entered, but this would be quite limited, you wouldnt know how important your entry was on the webpage or if it provides you with other related links, in essence there would be no way of rating the websites relevance. Different search engines use different techniques for measuring the relevance of each item in the index giving each item weighting depending on where the sought after term was found most, meta tags, titles, body of text etc. The data then needs to be encoded where information on the results are stored in compact form. In Googles first published paper they described their encoding as 2 bytes of 8 bits each which stored information on whether the word was capitalized, font size, position etc. As time has went on though Google has become more and more successful with its patented PageRank algorithm which basically assigns each search result a relevancy score. Each results relevancy score is based on a few things:

The frequency and location of keywords within the Web page : If the keyword only appears once within the body of a page, it will receive a low score for that keyword. How long the Web page has ex isted: People create new Web pages every day, and not all of them stick around for long. Google places more value on pages with an established history. The number of other Web pages that link to the page in question : Google looks at how many Web pages lin k to a particular site to determine its relevance.

The method in which indexes are usually built is building a hash table which stores values retrieved from using a formula to apply a numerical value (score) to words. Dividing word by numbers instead of letters is key to the

success of this principle as with such a broad range of words in the English language you will find many more words beginning with D than say Z, this would mean that searching for a word beginning with D would take a lot more time, hashing evens this out and makes it easier to find entries. This allows us to store the hashed number along with a pointer to the actual entry in our hash table. The combination of efficient indexing and effective storage are what allow us to find data/web pages quickly.

Googles PageRank algorithm has revolutionized the way we browse on the web helping us to find web pages with minimal effort based on some simple criteria it intelligently returns us the results that are most relevant to us by trawling through millions of pages and evaluating them before returning them to us.

I think the average web user would say that our search engines are working to perfection at the minute, taking just milliseconds to return an abundance of quality search results which more often than not allow us to find what we are looking for. So what exactly is in store for the future of the web? What can we do to make our browsing even better and our search engines more accurate?

The Semantic Web

Experts see the future of the Web, Web 3.0 as being similar to a personal assistant, at the minute we are happy enough to search through our Google results and find information out for ourselves but in the future we envisage it as being able to forge a profile of you based on your inte rests while browsing the web. Basically letting you sit back while the web does the work. They see the future search engine being able to not only find keywords but to interpret the context in order to return relevant results. It would treat the entire internet as a database of information.

How Semantic Web Will Work

The semantic web as described by Tim -Berners Lee (Inventor of WWW) is: "The Semantic Web is not a separate Web but an extension of the current one, in which information is given well -defined meaning, better enabling computers and people to work in cooperation." To make this happen we need to give computers the ability to read which is pointed out in W3Cs definition: The Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines not just for display purposes, hut for automation, integration and reuse of data across various applications.

To even get close to this we need to go out with the old and in with the new.

Problems With HTML

Files on a computer can mainly be characterized as documents or data. Documents are files which are primarily for being read by humans, such as e-mail, reports, brochures etc. Data such as spreadsheets are files which can be viewed, searched, edite d or even combined. HTML falls under documents which means it is mainly for display purposes and holds no real information, it is only served to display information on a page in a certain manner. XML on the other hand allows us to store information on our pages through Meta Tags as shown below:

<meta name="keywords" content="computing, computer studies, computer"> <meta name="description" content="Cheap widgets for sale"> <meta name="author" content="John Doe">

We can describe the content of a webpage usi ng tags which can be read by the browser rather than having to trawl through the whole content of the webpage.

RDF For our browsers to understand these tags and information we need to use a few new technologies, firstly the Resource Description Framework which is using XML tags, helps provide us with a framework for describing resources. This is what lets us describe the content in my previous example but it can also be used to show the authors of the resource, date of creation or updating, the organization of the pages on a site (the sitemap), information that describes content in terms of audience or content rating, key words for search engine data collection, subject categories . This is advantageous for a number of reasons:

Because RDF will include a s tandard syntax for describing and querying data, software that exploits metadata will be easier and faster to produce. The standard syntax and query capability will allow applications to exchange information more easily. Searchers will get more precise results from searching, based on metadata rather than on indexes derived from full text gathering. Intelligent software agents will have more precise data to work with.

You might also like