Professional Documents
Culture Documents
Nutch in A Nutshell: Presented by Liew Guo Min Zhao Jin
Nutch in A Nutshell: Presented by Liew Guo Min Zhao Jin
Presented by
Liew Guo Min
Zhao Jin
Outline
Recap
Special features
Running Nutch in a distributed
environment (with demo)
Q&A
Discussion
Recap
Complete web search engine
Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)
Features:
Customizable
Extensible
Distributed
Nutch as a crawler
Initial URLs
Injector Web
CrawlDB Webpages/files
update get
read/write
generate read/write
Segment Parser
Special Features
Extensible (Plugin system)
Most of the essential functionalities of Nutch
are implemented as plugins
Three layers
Extension points
What can be extended: Protocol, Parser, ScoringFilter, etc.
Extensions
The interfaces to be implemented for the extension points
Plugins
The actual implementation
Special Features
Extensible (Plugin system)
Anyone can write a plugin
Write the code
Prepare metadata files
Plugin.xml: what has been extended by what
Build.xml: how ant can build your source code
wiki.apache.org/nutch/PluginCentral
Special Features
Extensible (Plugin system)
To use a plugin
Make sure you have modified Nutch-site.xml to
include the plugin
Then, either
Nutch would automatically call it when needed, or
You can write something to call it with its classname and
then use it
Special Features
Distributed (Hadoop)
Map-Reduce (Diagram)
A framework for distributed programming
Map -- Process the splits of data to get
k1:v1
k1:v1,v2
Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
k2:v4
Split 4 k3:v2
Worker Worker Output 3
k2:v5
k4:v6 k4:v6