Professional Documents
Culture Documents
Monday, 15 July, 13
Web Scraping, a process of automatically collecting (stealing?) information from the Internet
Monday, 15 July, 13
THE TOOL
You need these tool to steal (uupss) those data:
Python (2.6 or 2.7) with some packages* Scrapy** framework Google Chrome with XPath*** review plugin Computer, of course and functional brain
*) http://doc.scrapy.org/en/latest/intro/install.html#requirements **) refer to http://scrapy.org/ (this slides wont cover the installation of those things) ***) I use XPath helper plugin
Monday, 15 July, 13
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
S C R A P Y
Not Crappy
Monday, 15 July, 13
Scrapy works by creating logical spiders that will crawl to any website you like. You dene the logic of that spider, using Python Scrapy uses a mechanism based on XPath expressions called XPath selectors.
S C R A P Y
Not Crappy
Monday, 15 July, 13
XPath is W3C standard to navigate through XML document (so as HTML) Here, XML documents are treated as trees of nodes. The topmost element of the tree is called the root element.
X P A T H
For more, refer to: http://www.w3schools.com/xpath/
Monday, 15 July, 13
<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>
X P A T H
For more, refer to: http://www.w3schools.com/xpath/
Monday, 15 July, 13
Selecting Nodes
XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps
Expression
nodename / // . .. @attr text()
Result
Selects all nodes with the name nodename Do selection from the root Do selection from current node Select current node Select parent node Select attributes of nodes Select the value of chosen node
X P A T H
For more, refer to: http://www.w3schools.com/xpath/
Monday, 15 July, 13
Predicate Expressions
Predicates are used to nd a specic node or a node that contains a specic value. Predicates are always embedded in square brackets.
Expression
/bookstore/book[1] /bookstore/book[last()] /bookstore/book[last()-1] /bookstore/book[position()<3] //title[@lang] //title[@lang='eng'] /bookstore/book[price>35.00]
Result
Selects the rst book element that is the child of the bookstore element. Selects the last book element that is the child of the bookstore element Selects the last but one book element that is the child of the bookstore element Selects the rst two book elements that are children of the bookstore element Selects all the title elements that have an attribute named lang Selects all the title elements that have an attribute named lang with a value of 'eng' Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00
X P A T H
Monday, 15 July, 13
By using XPATH Helper, you can easily get the XPath expression of a given node in HTML doc. It will be enabled by pressing <Ctrl>+<Shift>+X on Chrome
X PAT H H E L P E R
Monday, 15 July, 13
/* /* /* /*
Definition of Items to scrap */ Pipeline config for advance use*/ Advance setting file */ Directory to put spiders file */
REAL ACTION
Monday, 15 July, 13
So, of all those data, we want to collect: name of places, photo, description, address (if any), contact number (if any), opening hours (if any), website (if any), and video (if any)
REAL ACTION
Monday, 15 July, 13
Items Definition
Monday, 15 July, 13
Basically, here is our strategy 1. 2. 3. Implements rst spider that will get url of the listed items Crawl to that url one by one Implements second spider that will fetch all the required data
REAL ACTION
Monday, 15 July, 13
Fir st spider
class AttractionSpider(CrawlSpider): ! name = "get-attraction" ! allowed_domains = ["comesingapore.com"] ## Will never go outside playground ! start_urls = [ ## Starting URL ! ! "http://comesingapore.com/travel-guide/category/285/attractions" ! ] ! rules = () ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! def ! ! ! ! ! def ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! __init__(self, name=None, **kwargs): super(AttractionSpider, self).__init__(name, **kwargs) self.items_buffer = {} self.base_url = "http://comesingapore.com" from scrapy.conf import settings settings.overrides['DOWNLOAD_TIMEOUT'] = 360 parse(self, response): print "Start scrapping Attractions...." try: ! hxs = HtmlXPathSelector(response) ! ! ! ! ! ! ! ! ! ! ! ! ## XPath expression to get the URL of item details links = hxs.select("//*[@id='content']//a[@style='color:black']/@href") if not links: ! return ! log.msg("No Data to scrap") for ! ! ! ! ! ! link in links: v_url = ''.join( link.extract() ) ! ! ! ! if not v_url: ! continue else: ! _url = self.base_url + v_url
## real work handled by second spider ! ! ! yield Request( url= _url, callback=self.parse_details ) except Exception as e: ! log.msg("Parsing failed for URL {%s}"%format(response.request.url))
Monday, 15 July, 13
Second spider
def parse_details(self, response): ! ! print "Start scrapping Detailed Info...." ! ! try: ! ! ! hxs = HtmlXPathSelector(response) ! ! ! l_venue = ComesgItem() ! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page-bgbtm']/ div[@id='content']/div[3]/h1/text()").extract() ! ! ! if not v_name: ! ! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='pagebgbtm']/div[@id='content']/div[2]/h1/text()").extract() ! ! ! ! ! ! l_venue["name"] = v_name[0].strip() ! ! ! ! ! ! base = hxs.select("//*[@id='content']/div[7]") ! ! ! if base.extract()[0].strip() == "<div style=\"clear:both\"></div>": ! ! ! ! base = hxs.select("//*[@id='content']/div[8]") ! ! ! elif base.extract()[0].strip() == "<div style=\"padding-top:10px;margin-top:10px;border-top:1px dotted #DDD;\">\n You must be logged in to add a tip\n </div>": ! ! ! ! base = hxs.select("//*[@id='content']/div[6]") ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! x_datas = base.select("div[1]/b").extract() v_datas = base.select("div[1]/text()").extract() i_d = 0; if x_datas: ! for x_data in x_datas: ! ! print "data is:" + x_data.strip() ! ! if x_data.strip() == "<b>Address:</b>": ! ! ! l_venue["address"] = v_datas[i_d].strip() ! ! if x_data.strip() == "<b>Contact:</b>": ! ! ! l_venue["contact"] = v_datas[i_d].strip() ! ! if x_data.strip() == "<b>Operating Hours:</b>": ! ! ! l_venue["hours"] = v_datas[i_d].strip() ! ! if x_data.strip() == "<b>Website:</b>": ! ! ! l_venue["website"] = (base.select("div[1]/a/@href").extract())[0].strip() ! ! i_d += 1 ! v_photo = base.select("img/@src").extract() if v_photo: ! l_venue["photo"] = v_photo[0].strip() v_desc = base.select("div[3]/text()").extract() if v_desc: ! desc = "" ! for dsc in v_desc: ! ! desc += dsc ! l_venue["desc"] = desc.strip()
Monday, 15 July, 13
In the end, it produces le attr.csv with the scraped data, like following:
> head -3 attr.csv website,name,photo,hours,contact,video,address,desc http://www.tigerlive.com.sg,TigerLIVE,http://tn.comesingapore.com/img/others/240x240/f/6/0000246.jpg,Daily from 11am to 8pm (Last admission at 6.30pm).,(+65) 6270 7676,,"St. James Power Station, 3 Sentosa Gateway, Singapore 098544", http://www.zoo.com.sg,Singapore Zoo,http://tn.comesingapore.com/img/others/240x240/6/2/0000098.jpg,Daily from 8.30am - 6pm (Last ticket sale at 5.30pm),(+65) 6269 3411,http://www.youtube.com/embed/p4jgx4yNY9I,"80 Mandai Lake Road, Singapore 729826","See exotic and endangered animals up close in their natural habitats in the . Voted the best attraction in Singapore on Trip Advisor, and considered one of the best zoos in the world, this attraction is a must see, housing over 2500 mammals, birds and reptiles.
REAL ACTION
Monday, 15 July, 13
Monday, 15 July, 13
THANK YOU!
- Anton -
Monday, 15 July, 13