Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

IR

ASSIGNMENT NO. 1

NAME: Tejas Kshirsagar


ROLL NO.: 233
DIV: B
BATCH: IR1
PRN: 0120180218

Implement your own web crawler to download web pages with seed URLs
stored in a queue.

CODE:
import logging
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

logging.basicConfig(
    format='%(asctime)s %(levelname)s:%(message)s',
    level=logging.INFO)

class Crawler:

    def __init__(self, urls=[]):
        self.visited_urls = []
self.urls_to_visit = urls
    def download_url(self, url):
        return requests.get(url).text

    def get_linked_urls(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a'):
            path = link.get('href')
            if path and path.startswith('/'):
                path = urljoin(url, path)
            yield path

    def add_url_to_visit(self, url):
        if url not in self.visited_urls and url not in self.urls_to_visit:
            self.urls_to_visit.append(url)

    def crawl(self, url):
        html = self.download_url(url)
        for url in self.get_linked_urls(url, html):
            self.add_url_to_visit(url)

    def run(self):
        while self.urls_to_visit:
            url = self.urls_to_visit.pop(0)
            logging.info(f'Crawling: {url}')
            try:
                self.crawl(url)
            except Exception:
                logging.exception(f'Failed to crawl: {url}')
            finally:
                self.visited_urls.append(url)

if __name__ == '__main__':
    Crawler(urls=['https://www.facebook.com/']).run()

OUTPUT:

2021-09-15 17:30:22,404 INFO:Crawling: https://www.facebook.com/

2021-09-15 17:30:22,829 INFO:Crawling: #

2021-09-15 17:30:22,829 ERROR:Failed to crawl: #

Traceback (most recent call last):

File "C:\Python\hello.py", line 41, in run

self.crawl(url)

File "C:\Python\hello.py", line 32, in crawl

html = self.download_url(url)

File "C:\Python\hello.py", line 17, in download_url

return requests.get(url).text

File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalC
ache\local-packages\Python39\site-packages\requests\api.py", line 75, in get

return request('get', url, params=params, **kwargs)

File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalC
ache\local-packages\Python39\site-packages\requests\api.py", line 61, in request

return session.request(method=method, url=url, **kwargs)

File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalC
ache\local-packages\Python39\site-packages\requests\sessions.py", line 528, in request
prep = self.prepare_request(req)

File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalC
ache\local-packages\Python39\site-packages\requests\sessions.py", line 456, in prepare_request

p.prepare(

File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalC
ache\local-packages\Python39\site-packages\requests\models.py", line 316, in prepare

self.prepare_url(url, params)

File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalC
ache\local-packages\Python39\site-packages\requests\models.py", line 390, in prepare_url

raise MissingSchema(error)

requests.exceptions.MissingSchema: Invalid URL '#': No schema supplied. Perhaps you meant http://#?

2021-09-15 17:30:22,842 INFO:Crawling: https://www.facebook.com/recover/initiate/?


ars=facebook_login&privacy_mutation_token=eyJ0eXBlIjowLCJjcmVhdGlvbl90aW1lIjoxNjMxMzYxNjI0LC
JjYWxsc2l0ZV9pZCI6MzgxMjI5MDc5NTc1OTQ2fQ%3D%3D

2021-09-15 17:30:23,548 INFO:Crawling: https://www.facebook.com/pages/create/?


ref_type=registration_form

2021-09-15 17:30:23,835 INFO:Crawling: https://hi-in.facebook.com/

2021-09-15 17:30:24,165 INFO:Crawling: https://ur-pk.facebook.com/

2021-09-15 17:30:24,770 INFO:Crawling: https://gu-in.facebook.com/

2021-09-15 17:30:25,193 INFO:Crawling: https://kn-in.facebook.com/

2021-09-15 17:30:25,606 INFO:Crawling: https://pa-in.facebook.com/

2021-09-15 17:30:26,015 INFO:Crawling: https://ta-in.facebook.com/

2021-09-15 17:30:26,424 INFO:Crawling: https://bn-in.facebook.com/

2021-09-15 17:30:26,759 INFO:Crawling: https://te-in.facebook.com/

2021-09-15 17:30:27,344 INFO:Crawling: https://ml-in.facebook.com/

2021-09-15 17:30:27,758 INFO:Crawling: https://en-gb.facebook.com/

2021-09-15 17:30:28,151 INFO:Crawling: https://www.facebook.com/r.php

2021-09-15 17:30:28,696 INFO:Crawling: https://www.facebook.com/login/

2021-09-15 17:30:29,078 INFO:Crawling: https://messenger.com/


2021-09-15 17:30:30,145 INFO:Crawling: https://www.facebook.com/lite/

2021-09-15 17:30:30,472 INFO:Crawling: https://www.facebook.com/watch/

2021-09-15 17:30:30,759 INFO:Crawling: https://www.facebook.com/places/

2021-09-15 17:30:31,056 INFO:Crawling: https://www.facebook.com/games/

2021-09-15 17:30:31,466 INFO:Crawling: https://www.facebook.com/marketplace/

2021-09-15 17:30:31,847 INFO:Crawling: https://pay.facebook.com/

2021-09-15 17:30:32,801 INFO:Crawling: https://www.facebook.com/jobs/

2021-09-15 17:30:33,185 INFO:Crawling: https://www.oculus.com/

2021-09-15 17:30:35,396 INFO:Crawling: https://portal.facebook.com/

2021-09-15 17:30:37,915 INFO:Crawling: https://l.facebook.com/l.php?u=https%3A%2F


%2Fwww.instagram.com
%2F&h=AT2I6zPR6qj6k7DcLNMF0N05vyVpaJgS3B6mrWYupTMgzslUVZrcqi_XTR6B9UMyNxHXZCZEzgpu
OXbg5U5dekPja6fJjeiP4DRXR8ouCrbVXqcXNSxevG7aQU0FmHA2rdtTy8Ginjcb6C8J

2021-09-15 17:30:38,152 INFO:Crawling: https://www.facebook.com/local/lists/245019872666104/

2021-09-15 17:30:39,261 INFO:Crawling: https://www.facebook.com/fundraisers/

2021-09-15 17:30:42,394 INFO:Crawling: https://www.facebook.com/biz/directory/

2021-09-15 17:30:43,196 INFO:Crawling: https://www.facebook.com/votinginformationcenter/?


entry_point=c2l0ZQ%3D%3D

2021-09-15 17:30:43,851 INFO:Crawling: https://about.facebook.com/

2021-09-15 17:30:44,568 INFO:Crawling: https://www.facebook.com/ad_campaign/landing.php?


placement=pflo&campaign_id=402047449186&nav_source=unknown&extra_1=auto

2021-09-15 17:30:45,676 INFO:Crawling: https://www.facebook.com/pages/create/?


ref_type=site_footer

2021-09-15 17:30:46,052 INFO:Crawling: https://developers.facebook.com/?ref=pf

2021-09-15 17:30:48,033 INFO:Crawling: https://www.facebook.com/careers/?ref=pf

2021-09-15 17:30:48,655 INFO:Crawling: https://www.facebook.com/privacy/explanation

2021-09-15 17:30:49,081 INFO:Crawling: https://www.facebook.com/policies/cookies/

2021-09-15 17:30:49,478 INFO:Crawling: https://www.facebook.com/help/568137493302217

2021-09-15 17:30:49,870 INFO:Crawling: https://www.facebook.com/policies?ref=pf

2021-09-15 17:30:50,476 INFO:Crawling: https://www.facebook.com/help/?ref=pf

2021-09-15 17:30:50,942 INFO:Crawling: https://www.facebook.com/settings


2021-09-15 17:30:51,730 INFO:Crawling: https://www.facebook.com/allactivity?
privacy_source=activity_log_top_menu

2021-09-15 17:30:52,119 INFO:Crawling: https://www.facebook.com/login.php

2021-09-15 17:30:52,642 INFO:Crawling: https://hi-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:53,051 INFO:Crawling: https://ur-pk.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:53,453 INFO:Crawling: https://gu-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:53,862 INFO:Crawling: https://kn-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:54,379 INFO:Crawling: https://pa-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:54,695 INFO:Crawling: https://ta-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:55,200 INFO:Crawling: https://bn-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:55,617 INFO:Crawling: https://te-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:56,009 INFO:Crawling: https://ml-in.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:56,310 INFO:Crawling: https://en-gb.facebook.com/login/identify/?


ctx=recover&ars=facebook_login&from_login_screen=0

2021-09-15 17:30:56,633 INFO:Crawling: https://l.facebook.com/l.php?u=https%3A%2F


%2Fwww.instagram.com
%2F&h=AT3GvL51uzkc8hoaI4HDxUfJaxyiqBdGJ3_V4bxOg18Rup0FgDyFzg1IjNX2p5ePCJblm5aG_3F_AKZ
QPPbDFZgbdFN908UNATZGZLkMnfF7nscKtC3ipZVluRrbF3dU4VepjyG1scJelNZL

2021-09-15 17:30:56,897 INFO:Crawling: https://www.facebook.com/help/

2021-09-15 17:30:57,225 INFO:Crawling: https://hi-in.facebook.com/recover/initiate/?


ars=facebook_login&privacy_mutation_token=eyJ0eXBlIjowLCJjcmVhdGlvbl90aW1lIjoxNjMxMzYxNjI1LC
JjYWxsc2l0ZV9pZCI6MzgxMjI5MDc5NTc1OTQ2fQ%3D%3D

2021-09-15 17:30:57,660 INFO:Crawling: https://hi-in.facebook.com/pages/create/?


ref_type=registration_form

2021-09-15 17:30:57,847 INFO:Crawling: https://mr-in.facebook.com/

2021-09-15 17:30:58,186 INFO:Crawling: https://hi-in.facebook.com/r.php


2021-09-15 17:30:58,523 INFO:Crawling: https://hi-in.facebook.com/login/

2021-09-15 17:30:58,826 INFO:Crawling: https://hi-in.messenger.com/

2021-09-15 17:30:59,895 INFO:Crawling: https://hi-in.facebook.com/lite/

2021-09-15 17:31:00,226 INFO:Crawling: https://hi-in.facebook.com/watch/

2021-09-15 17:31:00,526 INFO:Crawling: https://hi-in.facebook.com/places/

2021-09-15 17:31:00,818 INFO:Crawling: https://hi-in.facebook.com/games/

2021-09-15 17:31:07,351 ERROR:Failed to crawl: https://hi-in.facebook.com/games/

You might also like