Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 2

So, now we're going to write a set of applications and the code there is the

pagerank.zip. That's simple webpage crawler and then a simple pet webpage indexer,
and then we're going to visualize the resulting network using a visualization tool
called d3.js. So, in a search engine, there are three basic things that we do.
First, we have a process that's usually done sort of when the computers are bored.
They crawl the web by retrieving a page, pulling out all the links, having a list,
an input queue of links going through those links one at a time, marking off the
ones we've got, picking the next one and on and on and on. So, it says front end
processes, spidering or crawling. Then, once you have the data, you do what's
called index building where you try to look at the links between the pages to get a
sense of what are the most centrally located, and what are the most respected pages
where respect is defined as who points to whom. Then, we actually look through and
search it. In this case we won't really search it, we'll visualize the index when
we're done. So, a web crawler is a program that browses the web in some automated
manner. The idea is that Google and other search engines including the one that
you're going to run, don't actually want the Web. They want a copy of the web, and
then they can do data mining within their own copy of the web. It's just so much
more efficient than having to go out and look at the web, you just copy it all. So,
the crawler just slowly but surely shores crawls and and gets as good a copy of the
web as it can. Like I said, its goal is to retrieve a page, pull out all the links,
add the links to the queue and then just pull the next one off, and do it again,
and again, and again, and then save all the text of those pages into storage. In
our case, it'll be a database in Google's place. It's literally thousands or
hundreds of thousands of servers, but for us we'll just do this in a database. Now,
web crawling is a bit of a science. We're going to be really simple, we're just
going to try to get to the point we've crawled every page that we can find in once.
That's what this application is going to do. But in the real world, you have to
pick and choose how often which pages are more valuable. So, in real search
engines, they tend to revisit pages more often if they consider those pages more
valuable, but they also don't want to revisit them too often because Google could
crush your website and make it so that your users can't use their website, because
Google is hitting you so hard. There's also in the world of web crawling this file
called robots.txt. It's a simple website that tells that search engines, when they
see a domain or a URL for the first time, they download this and it informs them
where to look and where not to look. So, you can take a look at py4e.com and look
at the robots.txt, and see what my website is telling all the spiders where to go
look and where the good stuff is at. So, at some point you build this, you have
your own storage, and it's time to build an index. So, the idea is to figure out
what pages are better than other pages and it certainly, you start by looking at
all the words in the pages. Python word splits etc. But the other thing we're going
to do is look at the links between them and use those links as a way to ascribe
value. So, here's the process that we're going to run. There's going to be a couple
of different things in the code for all of this is sitting here in pagerank.zip.
The way it works is that actually only just spiders a single webpage, you can
spider dr-chuck.com, or you can actually spider Wikipedia. It's kind of
interesting, but it takes you a little longer before the link start to sort of go
back to one another on Wikipedia. But Wikipedia is not a bad place to start if you
want to run something long, because at least Wikipedia doesn't get mad at you for
using it too much. So, there's always all these sort of data mining things. This
crawling have this thing where it grabs basically a list of the and. So, we end up
with a list of URLs. Some of the URLs have data, some do not, and it randomly looks
for one of the unretrieved URLs. Goes and grabs that URL, passes it, and then puts
the data in for that URL but then also reads through to see if there's more links.
So, in this database, there are a few pages that retrieved and lots of pages yet to
retrieve. Then it goes back says, oh, let's randomly pick another unretrieved file.
Go get that one. Pull that in. Put the text for that one in, but then look at all
the links and add those links to our sorted list. If you watch this, even if you do
like one or two documents at a time, you'll be like "Wow, that was a lot of links"
and then you grab another page and there's 20 links, or 60 links, or 100 links. So,
you're not Google so you don't have the whole internet, though what you find is as
you touch any part of the internet, the number of links explodes and you end up
with so many links that you haven't retrieved. But, if you're Google after a year
and you've seen it all once, then you get your data more dense. So,that why in this
program we stay with one website. So eventually, you get some of those links filled
in and have more than one set of pointers. The other thing in here is we keep track
of which pages point to which pages, right, little arrows. So these, each page then
gets a number inside this database like a primary key, and we can keep track of
which pages and we're going to use these inbound and outbound links to compute the
Page Rank. That is the more inbound links you have from sites that have a good
number of inbound links, the better we like that site. So, that's a better site.
So, the Page Rank algorithm is a thing that sort of reads through this data and
then writes the data, and it takes a number of times through all of the data to get
this Page Rank values to converge. So, these are numbers that converge toward the
goodness of and each page, and so you can run this as many times as you want. This
runs really quickly, this runs really slow because it's got to talk to the network
and pull these things back, talk to the network and that's why we can restart this.
The Page Rank is all just talking to data inside that database and it's super fast,
and then if you want to reset these to the initial value of the Page Rank
algorithm, you can reset that and that just sets them all to the initial value. I
think of one, they also won a goodness of one and then some of these ended with
goodnesses of five and 0.01, and so the more you run this, the more this data
converges. So, these data items tend to converge after a while. The first few times
they jump around a bunch, and then later they jump around less and less. Then, at
any point in time as you run this this ranking application you can pull the data
out and dump it to look at the Page Rank values of, for this particular page, has a
page rank value of one. These are dumping out, this one has probably just run the
SP reset because they all have the same Page Rank. After you've run it, you'll see
when you're on spdump, you will see that these numbers start to change. This stuff
is all in the read me file that's sitting here in the zip file, you undo that. So,
the spdump just reads the stuff and prints it out, and then spjson also reads
through all the stuff that's in here and then takes the the best, some 20 or so
links with the best Page Rank and dumps them into a js JavaScript file. Then there
is some HTML and d3.js which is a visualization that produces this pretty picture
and the bigger little dots are the ones with a better page rank, and you can grab
this and move all this stuff around and it's nice and fun and exciting. So, we
visualize, right? So, again, we have a a multi-step process where it's a slow
restartable process than a sort of fast data analysis cleanup process, and then a
final output process that pulls stuff out of there. So, it's another one of these
multi-step data mining processes. The last thing that we're going to talk about is
visualizing mail data. We're going to go from the Mbox-short to Mbox to Mbox-super
gigantomatic. That's what we're going to do next.

You might also like