Welcome to Scribd!

Crawler

Uploaded by

0% found this document useful (0 votes)

23 views3 pages

Import sgmllib class MyParser: "a simple parser class" parse(self, s): self.feed(s) self.close() def __init__(self, verbose=0): "initialize and object, passing verboses to the superclass" get_hyperlinks def get_descriptions def return self.descriptions.

Original Description:

Original Title

crawler

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

23 views3 pages

Crawler

Uploaded by

Moshe Green

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 3

Search inside document

import sgmllib class MyParser(sgmllib.SGMLParser): "a simple parser class" def parse(self, s): self.feed(s) self.

close() def __init__(self, verbose=0): "Initialize and object, passing verbose to the superclass" sgmllib.SGMLParser.__init__(self,verbose) self.inside_a_element = 0 self.descriptions = [] self.hyperlinks = [] def start_a(self, attributes): self.inside_a_element = 1 for name,value in attributes: if (name == "href" and "http://" in value): self.hyperlinks.append(value) def end_a(self): self.inside_a_element = 0 def handle_data(self, data): if self.inside_a_element == 1: self.descriptions.append(data) def get_hyperlinks(self): "returns hyperlinks found" return self.hyperlinks def get_descriptions(self): return self.descriptions import urllib, threading from Queue import PriorityQueue,Queue class URLGetter(threading.Thread): def __init__(self, ud): self.url = ud[1] self.result = {} self.parser = MyParser() self.depth = ud[0] threading.Thread.__init__(self) def get_result(self): return self.result def run(self): try: print "parsing\n" f = urllib.urlopen(self.url) contents = f.read() f.close() try: self.parser.parse(contents) except: print "Exception while parsing....." for link in self.parser.get_hyperlinks():

self.result[link] = self.depth except: print "Could not open document: %s" % self.url class crawler: def __init__(self,target,maxdepth=3,MAXTHREADS=4): self.maxdepth = maxdepth #Maximum depth of crawling self.to_visit = PriorityQueue() self.to_visit.put((1,target)) self.results = {} self.q = Queue(MAXTHREADS-1) def producer(self): while (not self.to_visit.empty()) or (not self.q.empty()): print "remain to visit: " + str(self.to_visit.qsize()), "\nthreads running: " + str(self.q.qsize()) + "\n" thread = URLGetter(self.to_visit.get()) thread.start() self.q.put(thread, True) def consumer(self): while (not self.to_visit.empty()) or (not self.q.empty()): thread = self.q.get(True) thread.join() res = thread.get_result() for url in res.keys(): try: if self.results[url] != None: self.results[url] = res[url] print "skipping, depth= ",self.results[url] except KeyError: print res[url]," ",self.maxdepth,"\n" if res[url] <= self.maxdepth: self.results[url] = res[url] self.to_visit.put((res[url]+1,url)) def crawl(self): prod_thread = threading.Thread(target=self.producer) cons_thread = threading.Thread(target=self.consumer) prod_thread.start() cons_thread.start() prod_thread.join() cons_thread.join() def get_results(self): return self.results.keys() #single threaded version class webcrawler: def __init__(self): pass def _webcrawl(self,seed,depth,search_text,l,next_urls): import urllib print "Depth:",depth; print "Seed:",seed; if depth>0:

r = urllib.urlopen(seed) s = r.read() r.close() print s if search_text in s: l.append(seed) p = MyParser() p.parse(s) urls = p.get_hyperlinks() for url in urls: if not (url in next_urls): next_urls.append(url) self._webcrawl(url,depth-1,search_text,l,next_urls) def webcrawl(self,seed,depth,search_text): a = [] self._webcrawl(seed,depth,search_text,a,[]) print a def main(): a = crawler("http://www.google.com",1) a.crawl() print a.get_results() f = open("/home/moshe/Desktop/50\ weeks/week1.py","wr") for l,d in a.get_results(): f.write(str(d) + l + "\n") f.close() main()

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5822)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Rating: 4 out of 5 stars
4/5 (1093)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (852)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (610)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1717)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (590)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1105)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (898)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (540)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2104)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (349)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1025)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1867)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (822)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (122)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (441)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1948)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (403)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4771)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2259)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (809)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (266)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4208)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1929)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1903)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2522)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3973)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2409)
Solution Manual For Database Systems Design Implementation and Management 10th Edition
Document13 pages
Solution Manual For Database Systems Design Implementation and Management 10th Edition
Chris Harris
0% (1)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (74)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (880)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
Rating: 3.5 out of 5 stars
3.5/5 (104)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (105)
TBS6909 User Guide: Dear Customers
Document29 pages
TBS6909 User Guide: Dear Customers
CarloRoberto Cerrato
No ratings yet
1Z0-327 Full
Document39 pages
1Z0-327 Full
Durgarao Panchala
No ratings yet
Bioconnect Enterprise v5.0 Software Configuration Guide
Document50 pages
Bioconnect Enterprise v5.0 Software Configuration Guide
Imou Medal
No ratings yet
PLOP 5.4 Manual
Document184 pages
PLOP 5.4 Manual
Marcell
No ratings yet
SpyGlass Lint
Document3 pages
SpyGlass Lint
brbvikas
No ratings yet
Custom Install
Document8 pages
Custom Install
mcemce
No ratings yet
Toc
Document16 pages
Toc
kishore2285
No ratings yet
TypeScript Essentials Sample Chapter
Document20 pages
TypeScript Essentials Sample Chapter
Packt Publishing
No ratings yet
Practice Business Web DevelopmentK21 K22 ENG
Document202 pages
Practice Business Web DevelopmentK21 K22 ENG
Pink Black
No ratings yet
Tableau
Document23 pages
Tableau
gtm host
No ratings yet
Oracle Database Upgrade 11 1 0 7 To 11 2 0 2 With EBS R12 1 3 On RedHat Linux x86 64 5 5 2 PDF
Document35 pages
Oracle Database Upgrade 11 1 0 7 To 11 2 0 2 With EBS R12 1 3 On RedHat Linux x86 64 5 5 2 PDF
shahfaisal
No ratings yet
Demo
Document21 pages
Demo
Joy Valerie
No ratings yet
How To Use VLOOKUP in Excel: 1. Download The Excel VLOOKUP Sample File Shown in The More VLOOKUP References Section
Document12 pages
How To Use VLOOKUP in Excel: 1. Download The Excel VLOOKUP Sample File Shown in The More VLOOKUP References Section
Muthu Raman Chinnadurai
No ratings yet
Envision Hardware Guide PDF
Document106 pages
Envision Hardware Guide PDF
frankcoap666
No ratings yet
Netbackup Upgrade
Document7 pages
Netbackup Upgrade
பாரதி ராஜா
No ratings yet
Icarus Verilog Installation and Usage Manual
Document7 pages
Icarus Verilog Installation and Usage Manual
Shammy Patil
No ratings yet
SQL Commands For Views and Triggers
Document4 pages
SQL Commands For Views and Triggers
Aman Baheti
No ratings yet
Jenil Patel March 2021
Document1 page
Jenil Patel March 2021
Mihir Vaghela
No ratings yet
Visual Programming: by Sohail Adil Khan
Document37 pages
Visual Programming: by Sohail Adil Khan
0061- Muhammad Farhan
No ratings yet
HL-2040 Serrvice Manual 02
Document10 pages
HL-2040 Serrvice Manual 02
วรพงษ์ กอชัชวาล
No ratings yet
Dokumen - Tips - Centricity Pacs Universal Viewer Quick User Guide Centricity Pacs Universal Viewer
Document16 pages
Dokumen - Tips - Centricity Pacs Universal Viewer Quick User Guide Centricity Pacs Universal Viewer
manolopunk
No ratings yet
SQL Server
Document10 pages
SQL Server
Jennifer Ford
No ratings yet
Cloud Computing Architecture
Document6 pages
Cloud Computing Architecture
Vinayak Upadhyay
No ratings yet
Introduction To Programming
Document104 pages
Introduction To Programming
Miliyon Tilahun
No ratings yet
SP-2150708 - Assignments 2019
Document6 pages
SP-2150708 - Assignments 2019
Anup
No ratings yet
Madhurya Alankaar (RJ Maddy)
Document3 pages
Madhurya Alankaar (RJ Maddy)
Manan Parikh
No ratings yet
QuestPond FullBook
Document451 pages
QuestPond FullBook
Ravi Ranjan
No ratings yet
Auditing The Data Using Python
Document4 pages
Auditing The Data Using Python
Marcos Toti
No ratings yet
Venkat Mobile: 09916924056
Document5 pages
Venkat Mobile: 09916924056
M.s. Venkatesh
No ratings yet