Welcome to Scribd!

Storm Crawler

Uploaded by

0% found this document useful (0 votes)

24 views2 pages

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It provides modular components including a core module for fetching, parsing and filtering URLs. Additional resources include spouts and bolts for Elasticsearch and Apache Solr. StormCrawler is used by various organizations including Common Crawl for generating a large publicly available dataset of news and has been featured in several research papers.

Original Description:

Original Title

StormCrawler

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

24 views2 pages

Storm Crawler

Uploaded by

katherine976

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 2

Search inside document

StormCrawler

StormCrawler is an open-source collection of resources for

building low-latency, scalable web crawlers on Apache Storm. It
StormCrawler
is provided under Apache License and is written mostly in Java Developer(s) DigitalPebble, Ltd.
(programming language). Initial release September 11, 2014
StormCrawler is modular and consists of a core module, which Stable release 2.8 / March 29,
provides the basic building blocks of a web crawler such as 2023
fetching, parsing, URL filtering. Apart from the core components, Repository github.com
the project also provides external resources, like for instance
/DigitalPebble
spout and bolts for Elasticsearch and Apache Solr or a ParserBolt
which uses Apache Tika to parse various document formats. /storm-crawler (http
s://github.com/Digita
The project is used by various organisations,[1] notably Common lPebble/storm-crawl
Crawl[2] for generating a large and publicly available dataset of er)
news.
Written in Java
Linux published a Q&A in October 2016 with the author of Type Web crawler
StormCrawler.[3] InfoQ ran one in December 2016.[4] A License Apache License
comparative benchmark with Apache Nutch was published in
Website stormcrawler.net (ht
January 2017 on dzone.com.[5]
tp://stormcrawler.ne
Several research papers mentioned the use of StormCrawler, in t)
particular:

Crawling the German Health Web: Exploratory Study and Graph Analysis.[6]
The generation of a multi-million page corpus for the Persian language.[7]
The SIREN - Security Information Retrieval and Extraction engine.[8]

The project Wiki contains a list of videos and slides available online.[9]

See also
Apache Storm
Apache Nutch
Apache Solr
Elasticsearch

References
1. "Powered By · DigitalPebble/storm-crawler Wiki · GitHub" (https://github.com/DigitalPebble/s
torm-crawler/wiki/Powered-By). Github.com. 2017-03-02. Retrieved 2017-04-19.
2. "News Dataset Available – Common Crawl" (http://commoncrawl.org/2016/10/news-dataset-
available/).
3. "StormCrawler: An Open Source SDK for Building Web Crawlers with ApacheStorm |
Linux.com | The source for Linux information" (https://www.linux.com/news/stormcrawler-ope
n-source-sdk-building-web-crawlers-apachestorm). Linux.com. 2016-10-12. Retrieved
2017-04-19.
4. "Julien Nioche on StormCrawler, Open-Source Crawler Pipelines Backed by Apache Storm"
(http://www.infoq.com/news/2016/12/nioche-stormcrawler-web-crawler). Infoq.com. 2016-12-
15. Retrieved 2017-04-19.
5. "The Battle of the Crawlers: Apache Nutch vs. StormCrawler - DZone Big Data" (https://dzon
e.com/articles/the-battle-of-the-crawlers-apache-nutch-vs-stormcr). Dzone.com. Retrieved
2017-04-19.
6. Zowalla, Richard; Wetter, Thomas; Pfeifer, Daniel (2020). "Crawling the German Health
Web: Exploratory Study and Graph Analysis" (https://www.jmir.org/2020/7/e17853/). Journal
of Medical Internet Research. 22 (7): e17853. doi:10.2196/17853 (https://doi.org/10.2196%2
F17853). PMC 7414401 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414401).
PMID 32706701 (https://pubmed.ncbi.nlm.nih.gov/32706701).
7. "MirasText: An Automatically Generated Text Corpus for Persian" (https://www.researchgate.
net/publication/325324201).
8. Sanagavarapu, Lalit Mohan; Mathur, Neeraj; Agrawal, Shriyansh; Reddy, Y. Raghu (2018).
Advances in Information Retrieval. Lecture Notes in Computer Science. Vol. 10772.
pp. 811–814. doi:10.1007/978-3-319-76941-7_81 (https://doi.org/10.1007%2F978-3-319-76
941-7_81). ISBN 978-3-319-76940-0.
9. "Presentations · DigitalPebble/storm-crawler Wiki · GitHub" (https://github.com/DigitalPebbl
e/storm-crawler/wiki/Presentations). Github.com. 2017-04-04. Retrieved 2017-04-19.

Retrieved from "https://en.wikipedia.org/w/index.php?title=StormCrawler&oldid=1166107752"

Scrapy
Document2 pages
Scrapy
katherine976
No ratings yet
CURL
Document4 pages
CURL
linda976
No ratings yet
Native Docker Clustering with Swarm
From Everand
Native Docker Clustering with Swarm
Fabrizio Soppelsa
No ratings yet
Openshift and Cloud Foundry Paas:: High-Level Overview of Features and Architectures
Document10 pages
Openshift and Cloud Foundry Paas:: High-Level Overview of Features and Architectures
Vel_st
No ratings yet
Resourceslist
Document17 pages
Resourceslist
Ben Franks
No ratings yet
Hunting Tool Recommendations
Document6 pages
Hunting Tool Recommendations
Tushar Jadhav
No ratings yet
Quick Code
Document2 pages
Quick Code
linda976
No ratings yet
Xapache Hadoop112S1Q: Apache Software Foundation Java
Document7 pages
Xapache Hadoop112S1Q: Apache Software Foundation Java
mynenianupama
No ratings yet
Apache Hadoop112s: Apache Software Foundation Java
Document7 pages
Apache Hadoop112s: Apache Software Foundation Java
mynenianupama
No ratings yet
Apache Hadoop: Apache Software Foundation Java
Document7 pages
Apache Hadoop: Apache Software Foundation Java
mynenianupama
No ratings yet
Project Report Browser
Document40 pages
Project Report Browser
neha122
42% (12)
Containers and Pods 101
Document43 pages
Containers and Pods 101
Afsar Baig
No ratings yet
TurnKey Linux Virtual Appliance Library
Document3 pages
TurnKey Linux Virtual Appliance Library
EPSILON303
No ratings yet
Open Stack
Document8 pages
Open Stack
Ahmed Ismail
No ratings yet
Documentation v.2.5 OPENSTAK
Document20 pages
Documentation v.2.5 OPENSTAK
Bnaren Naren
No ratings yet
Mohr Et Al 2004
Document15 pages
Mohr Et Al 2004
neonfirex
No ratings yet
Introduction Recap
Document3 pages
Introduction Recap
Allan Eduardo Rosas Garcia
No ratings yet
Web Crawler & Scraper Design and Implementation
Document9 pages
Web Crawler & Scraper Design and Implementation
kassila
100% (1)
Features: Web Application
Document2 pages
Features: Web Application
Abhijeet Deshmukh
No ratings yet
Semantic Search Demo Booklet
Document20 pages
Semantic Search Demo Booklet
Stéphane Croisier
No ratings yet
World Wide Web
Document5 pages
World Wide Web
Ankaj Mohindroo
No ratings yet
Dragonfly
Document65 pages
Dragonfly
David
No ratings yet
Cloud Computing With Open Source Tool :OpenStack
Document8 pages
Cloud Computing With Open Source Tool :OpenStack
AJER JOURNAL
No ratings yet
Extending Docker
From Everand
Extending Docker
McKendrick Russ
No ratings yet
Atmosphere Framework White Paper: Version 0.6
Document33 pages
Atmosphere Framework White Paper: Version 0.6
Pablo Iturralde
No ratings yet
Python Scrapy
Document4 pages
Python Scrapy
Shubham Sharma
No ratings yet
Wayback Machine
Document17 pages
Wayback Machine
clubmailus
No ratings yet
SDN Controller and Implementation PDF
Document63 pages
SDN Controller and Implementation PDF
hadje benilha
No ratings yet
Simple Facts About Flask
Document2 pages
Simple Facts About Flask
Any O'Neill
No ratings yet
Zephyr (Operating System) - Wikipedia
Document17 pages
Zephyr (Operating System) - Wikipedia
Aditya Swaroop
No ratings yet
Kali Linux - Wikipedia
Document6 pages
Kali Linux - Wikipedia
fredilson pires
No ratings yet
InterPlanetary File System - Wikipedia
Document5 pages
InterPlanetary File System - Wikipedia
Arasu IVS
No ratings yet
Keras Definition
Document2 pages
Keras Definition
levin696
No ratings yet
AIX Container
Document13 pages
AIX Container
Sergey Voron
No ratings yet
Red Teaming Toolkit
Document28 pages
Red Teaming Toolkit
Learner
No ratings yet
Turbo Gears
Document4 pages
Turbo Gears
Boyapati Bhavya
No ratings yet
Nutch Version 0.7 Tutorial
Document5 pages
Nutch Version 0.7 Tutorial
Debasis Mohanta
No ratings yet
5803 4831 1 PB
Document10 pages
5803 4831 1 PB
botplusnoob97
No ratings yet
Becoming A Full
Document2 pages
Becoming A Full
Shuraihilqadhi Kasule
No ratings yet
Front-End Technologies
Document7 pages
Front-End Technologies
Ibrahim Tarek Amin
No ratings yet
Docker Part2
Document6 pages
Docker Part2
anbuchennai82
No ratings yet
IBM Watson Studio
Document2 pages
IBM Watson Studio
watson191
No ratings yet
SWISH SWI-Prolog For Sharing
Document16 pages
SWISH SWI-Prolog For Sharing
Ttn Ttn
No ratings yet
Function: WWW Prefix
Document2 pages
Function: WWW Prefix
fagi2
No ratings yet
Js From Wiki
Document23 pages
Js From Wiki
Andrei Huides
No ratings yet
Tools
Document11 pages
Tools
Julio Barrera
No ratings yet
Javascript
Document7 pages
Javascript
Jerdonlee Listen
No ratings yet
XGBoost
Document4 pages
XGBoost
levin696
No ratings yet
Open Source Solution For Cloud Computing Platform Using Openstack
Document11 pages
Open Source Solution For Cloud Computing Platform Using Openstack
Glen
No ratings yet
My Jupyter Docker Full Stack
Document33 pages
My Jupyter Docker Full Stack
malliwi
No ratings yet
Top 9 Asynchronous Web Frameworks For Python
Document10 pages
Top 9 Asynchronous Web Frameworks For Python
Leon
No ratings yet
Fai Lab Project By-:Group 6
Document7 pages
Fai Lab Project By-:Group 6
shivansh
No ratings yet
Apache Web Server
Document15 pages
Apache Web Server
Tsagaye Adisu
No ratings yet
RED Teaming Toolkit
Document12 pages
RED Teaming Toolkit
Dehiker Vzla
No ratings yet
Web Scrapping: From NP-10
Document11 pages
Web Scrapping: From NP-10
Bagas Prawira Adji Wisesa
No ratings yet
Containers Vs Virtualization
Document28 pages
Containers Vs Virtualization
SrinivasKannan
No ratings yet
DevOps Roadmap
Document16 pages
DevOps Roadmap
Hamza Toumi
No ratings yet
Building Web Applications With Protege
Document5 pages
Building Web Applications With Protege
Harold
No ratings yet
DevOps Roadmap
Document16 pages
DevOps Roadmap
Iheb Belhsan
No ratings yet
Docker For Java Developers
Document63 pages
Docker For Java Developers
Jose X Luis
100% (1)
Cuil
Document5 pages
Cuil
katherine976
No ratings yet
Change Data Capture
Document4 pages
Change Data Capture
katherine976
No ratings yet
Web Bot
Document3 pages
Web Bot
katherine976
No ratings yet
Data Science
Document7 pages
Data Science
katherine976
No ratings yet
Core Data Integration
Document1 page
Core Data Integration
katherine976
No ratings yet
Enterprise Application Integration
Document6 pages
Enterprise Application Integration
katherine976
100% (1)
Data Curation
Document4 pages
Data Curation
katherine976
No ratings yet
Database Model
Document8 pages
Database Model
katherine976
No ratings yet
Master Data Management
Document5 pages
Master Data Management
katherine976
No ratings yet
Schema Matching
Document4 pages
Schema Matching
katherine976
No ratings yet
Web Service
Document7 pages
Web Service
katherine976
No ratings yet
Closest Pair of Points Problem
Document3 pages
Closest Pair of Points Problem
katherine976
No ratings yet
Library Catalog
Document17 pages
Library Catalog
Zykeh
No ratings yet
Mini ProjectA17
Document25 pages
Mini ProjectA17
xyz94300
No ratings yet
Umar Khan CV - 2023june
Document1 page
Umar Khan CV - 2023june
Roxer Khan
No ratings yet
Acosta-Vargas Et Al. - 2016 - Evaluation of The Web Accessibility of Higher-Educ
Document6 pages
Acosta-Vargas Et Al. - 2016 - Evaluation of The Web Accessibility of Higher-Educ
Fanny
No ratings yet
(M1-MAIN) - The Database Environment and Development Process
Document51 pages
(M1-MAIN) - The Database Environment and Development Process
Alexander T. Barsaga
No ratings yet
Solution
Document6 pages
Solution
Tare Er Kshitij
No ratings yet
Asd
Document2 pages
Asd
divyansh garg
No ratings yet
Imp DBMS Questions
Document4 pages
Imp DBMS Questions
ROHAN A
No ratings yet
AoIS CFP
Document1 page
AoIS CFP
Meraj Ahmad
No ratings yet
E 10575
Document124 pages
E 10575
Thana Balan Sathneeganandan
No ratings yet
Coronel PPT Ch01
Document33 pages
Coronel PPT Ch01
Hiếu Bonaparte
No ratings yet
Checkpoints PPT1
Document13 pages
Checkpoints PPT1
kumshubham9870
No ratings yet
The Information Schema - MySQL 8 Query Performance Tuning - A Systematic Method For Improving Execution Speeds
Document12 pages
The Information Schema - MySQL 8 Query Performance Tuning - A Systematic Method For Improving Execution Speeds
Chandra Sekhar D
No ratings yet
تركيب البينات في أنظمة المعلومات الجغرافية
Document18 pages
تركيب البينات في أنظمة المعلومات الجغرافية
mahmoud abdelrahman
100% (1)
DSA Project List
Document5 pages
DSA Project List
punjabian37
No ratings yet
University of Florida DW Case Study - Folio Size
Document4 pages
University of Florida DW Case Study - Folio Size
AfnanKhan
No ratings yet
White Paper - Best Practices For Data Replication With EMC Isilon SyncIQ
Document33 pages
White Paper - Best Practices For Data Replication With EMC Isilon SyncIQ
amineki
No ratings yet
Elango Duplication Issues
Document6 pages
Elango Duplication Issues
ruxandra28
No ratings yet
10 HE by Nielson 2005
Document17 pages
10 HE by Nielson 2005
Fitri Yani
No ratings yet
NG22 OUR PVV ISO 000003 - Rev00
Document25 pages
NG22 OUR PVV ISO 000003 - Rev00
Omeoga Obinna
No ratings yet
Big Data MSC Thesis
Document4 pages
Big Data MSC Thesis
lynnwebersaintpaul
100% (2)
Top 50 Tableau Real-Time Interview Questions and Answers PDF
Document11 pages
Top 50 Tableau Real-Time Interview Questions and Answers PDF
Pandian Nadaar
No ratings yet
Systems I Software Db2 PDF Performance DDS SQL
Document23 pages
Systems I Software Db2 PDF Performance DDS SQL
phaniSingeetham
No ratings yet
ND MP Backup Solutions
Document9 pages
ND MP Backup Solutions
Loris Strozzini
No ratings yet
Business Analytics - Moving From Descriptive To Predictive Analytics - EMC
Document10 pages
Business Analytics - Moving From Descriptive To Predictive Analytics - EMC
Adam Limbumba
No ratings yet
HCM Extract DMOne
Document19 pages
HCM Extract DMOne
padma gadde
No ratings yet
InCites Journal Citation Reports-NeurocomputingTier1-ComputerScience
Document3 pages
InCites Journal Citation Reports-NeurocomputingTier1-ComputerScience
Thien Le
No ratings yet
Background of The Study
Document6 pages
Background of The Study
Annalie Alsado Bustillo
No ratings yet
The History of Big Data
Document11 pages
The History of Big Data
Ram R
No ratings yet
Nccer Module 7
Document20 pages
Nccer Module 7
api-447133208
No ratings yet