Professional Documents
Culture Documents
Web Crawler
Web Crawler
A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and
Bing. Their purpose is to index the content of websites all across the Internet so that those websites can
appear in search engine results.
Example:
import urllib
def get_page(url):
try:
return urllib.urlopen(url).read()
except:
return "No links found."
def get_next_target(s):
start_link = s.find('<a href=')
if start_link == -1:
return None, 0
start_quote = s.find('"', start_link)
end_quote = s.find('"', start_quote + 1)
url = s[start_quote + 1 : end_quote]
return url, end_quote
def print_all_links(page):
while True:
url, endpos = get_next_target(page)
if url:
print(url)
page = page[endpos:]
else:
break
print_all_links(get_page("https://www.facebook.com/"))