Professional Documents
Culture Documents
Chapter 3
Chapter 3
Chapter 3
● Python Project
○ Write a python program that takes a URL on the command line, fetches the page, and
outputs (one per line)
● Even crawling a site slowly will anger some web server administrators, who
object to any copying of their data
● Robots.txt file can be used to control crawlers
○ Try loading google.com/robots.txt
○ Build history over time for sites and pages to find sites and/or URLs that change frequently
● Attempts to download only those pages that are about a particular topic
○ used by vertical search applications
● Rely on the fact that pages about a topic tend to have links to other pages
on the same topic
○ popular pages for a topic are typically used as seeds
● Sites that are difficult for a crawler to find are collectively referred to as the
deep (or hidden) Web
○ much larger than conventional Web
○ form results
■ sites that can be reached only after entering some data into a form
○ scripted pages
■ pages that use JavaScript, Flash, or another client-side language to generate links
● Sitemaps contain lists of URLs and data about those URLs, such as
modification time and modification frequency
● Generated by web server administrators
● Tells crawler about pages it might not otherwise find
● Gives crawler a hint about when to check a page for changes
● Python Project
○ Write a python program that builds a list of all files on your computer.
● Two types:
○ A push feed alerts the subscriber to new documents
○ A pull feed requires the subscriber to check periodically for new documents
○ Try Feedly
○ Try [india filetype:pdf] in Google, click on three dots and look for cached version
● Single mapping from numbers to glyphs that attempts to include all glyphs
in common use in all known languages
● Unicode is a mapping between numbers and glyphs
○ does not uniquely specify bits to glyph mapping!
● Many applications use UTF-32 for internal text encoding (fast random
lookup) and UTF-8 for disk storage (less space)
○ UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a
matching unique binary string, and can also translate the binary string back to a Unicode
character.
○ provides efficient access to text for snippet generation, information extraction, etc.
○ Update
■ adding new anchor text (which happens for a page as you crawl other web pages)
○ 30% of the web pages in a large crawl are exact or near duplicates of pages in the other
70%
● Search:
○ find near-duplicates of a document D
● Discovery:
○ find all pairs of near-duplicate documents in the collection
○ O(N2) comparisons
● Many web pages contain text, links, and pictures that are not directly
related to the main content of the page
● This additional material is mostly noise that could negatively affect the
ranking of the page
● Techniques have been developed to detect the content blocks in a web
page
○ Non-content material is either ignored or reduced in importance in the indexing process
https://timesofindia.indiatimes.com/india/attack-on-merchant-vessels-mv-ruen-mv-chem-pluto-indian-navy-enhances-ma
ritime-surveillance-efforts-in-arabian-sea/articleshow/106417489.cms
○ Main text content of the page corresponds to the “plateau” in the middle of the distribution
○ Write a 64 bit hash function for a word using polynomial rolling hash function
○ Here s[i] is the ASCII for letter i in a word, use p = 53 and m = 264
○ Compute Simhash for the document (as shown in slide 57)
○ Modify your program to take two URLs from the web on the command line, print how
many bits are common in their simhashes.
○ Due date: One Week