Professional Documents
Culture Documents
A Multi-Levels Geo-Location Based Crawling Method For Social Media Platforms
A Multi-Levels Geo-Location Based Crawling Method For Social Media Platforms
A Multi-Levels Geo-Location Based Crawling Method For Social Media Platforms
net/publication/337985729
CITATIONS READS
4 51
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shadi AlZu'bi on 11 February 2020.
Abstract—The large size and the dynamic nature of the Web the web, which leads to significant savings in hardware and
highlight the need for continuous support and updating of network resources without losing the Data sources.
Web based information retrieval systems. Crawlers facilitate the The idea behind this work is to have your websites server
process by following the hyperlinks in Web pages to automatically
download a partial snapshot of the Web. While some systems rely talk directly to server with a request to create an event with the
on crawlers that exhaustively crawl the Web, others incorporate given details. Your server would then receive response, process
focus within their crawlers to harvest application or topic specific it, and send back relevant information to the browser, such
collections. This project studied web crawling and scraping at as a confirmation message to the user. This project studied
many different levels. It will aggregate information from multiple web crawling and scraping at many different levels. It will
sources into one central location. It Specifics a program for
downloading web pages. Given an initial set of seed URLs, it aggregate information from multiple sources into one central
recursively downloads every page that is linked from pages in location. It Specifics a program for downloading web pages.
the set, that have content satisfies specific criterion. Social media, Given an initial set of seed URLs, it recursively downloads
web applications, and mobile applications have been employed every page that is linked from pages in the set, that have con-
together in the proposed system to manage the search in the tent satisfies specific criterion. Social media, web applications,
rapidly growing worldwide web. Applying the proposed system
is resulting in a fast and comfortable search engine that fulfill and mobile applications have been employed together in the
the users requests based on specific geolocations. proposed system to manage the searching process in the web.
Index Terms—Social Media, Data set sampling, Crawling, The remaining of this paper is structured as follow: Section
Scraping, Search engines, Geo-Locations II overviewed the related literature in the field and what
previous researchers achieved in the same area. The method-
I. I NTRODUCTION ology that has been followed to do this research has been
introduced in section III, including the data set preparation and
A web crawler is a piece of code that travels over the how the focused Crawler is implemented. Section IV presents
Internet and collects data from various web pages, also known experimental results and a discussion about the proposed work
as web scraping. Application program interface (API) which is provided. Finally, the work is concluded in section V.
is a set of routines, protocols, and tools for building software
applications. It has been used widely with web crawlers, II. L ITERATURE REVIEW
Basically, an API specifies how software components should Many researches have investigated the quality of data in
interact. It allows the programmers to use predefined functions social media and social network, but still a huge gap between
to interact with systems, instead of writing them from scratch what was achieved and expectation. Several researches studies
[1]. social media and social networks mainly using data acquired
Social media applications have been employed significantly from Twitter. These data are with a risk of being disturbed and
in this field, retrieving information from social networks is misleading sample of the complete data. Many researches have
the first and primordial step at many data analysis fields been implemented in the last decade focusing on acquiring
such as Natural Language Processing, Sentiment Analysis and data sets from the social media [7]–[13]. The proposed project
Machine Learning. The use of Facebook AP, LinkedIn API, idea is unique, it focuses on implementing and providing a new
Twitter API, and other public platforms for collecting public research platform related to big data serving community within
streams of information makes it happen [2]. different services and events associated within the companies,
Many researchers browse the motivations for crawling or as data collected from social media had a significant concern
Scraping, including the content indexing for search engines, in this research.
Automated security testing and vulnerability assessment, and Hai Dong et al. reviewed in [14] the recent studies on one
automated testing and model checking [3]–[6]. Therefore, we category of semantic focused crawlers, which is a series of
are aiming in this research to seek out pages that are relevant crawlers that utilize ontologies to link acquired documents
to a pre-defined set of topics avoiding irrelevant regions in from the fetching process with the ontological concepts.
They organized documents in the web and filtered irrelevant [35]. Modern APIs are well documented for consumption
webpages regarding to the searching topics. The research and versioning for specific audiences, they are much more
team compared the crawlers at several perspectives including standardized, stronger discipline for security and governance,
(domain, working environment, evaluation metrics, special as well as monitored and managed for performance and scale.
functions, technologies utilized, and evaluation results). Most of the collected data come from several sources,
Gautam Pant et al discussed in [15] the related issues to de- including Direct User Input (survey, search form), Third Party
velop crawling infrastructures. They reviewed several crawling APIs (social media), Server Logs (logs from web servers
algorithms that might be used to evaluate the quality. Crawling like Apache, heritrix, and octoparse), and Web Crawling
social media has been considered as well by many researchers. or Scraping. Different requirements are needed to prepare
In [16]–[19], research studies investigated the quality of social the data set and implement the proposed system including
media data. They focused on how online recommendation Apache Nutch, Apache Tomcat, CYGWIN, Apache Hadoop,
systems and social media data can be evaluated. HERITRIX, Cloudera virtual Machine, Oracle virtual box, oc-
According to Gjoka et al in [20], social networks sampling toparse, Facebook Graph Search, LinkedIn Lead Extractor, and
studies can be considered as a of part of social media crawling, Netvizz. Data has been collected for the Artificial Intelligence
which are quite common, including. Gjoka used in [21] the Agent through Graph API Facebook which contains many
original graph sampling study by Leskovec and Faloutsos [22] spoken or published texts in English. Knowing the fact that
as a baseline. A motivating work on sampling social networks Facebook is the most widely used social media network, it is
efficiently with a restricted budget had been presented by the best place to get random and accurate data [36]. We have
Wang et al. in [23]. collected data From LinkedIn and twitter through LinkedIn
It is known that Facebook users content is hard to be Lead Extractor tool and REST APIs.
accessed because of the default privacy policy of Facebook
[16], [24]–[28]. Therefore, the collected amount of private B. Focused Crawler
Facebook data is limited. Furthermore, since Facebook does The role of the focused crawler in the proposed system
not have the option of selling data, crawling methods to collect is to selectively seek out pages that are relevant to a pre-
social interactions from publicly in facebook is needed, which defined set of topics. The topics are specified using keywords,
is the main challenge of the proposed work. rather than collecting and indexing all accessible hypertext
Buccafurri et al. discussed in [29] different methods to links. Focused crawler analyzes its crawl boundary to find the
transverse social networks from a crawling viewpoint. They links that are likely to be most relevant for the crawl, it avoids
focused on groups instead of personal users profiles. irrelevant regions of the web. This leads to significant savings
Erlandsson et al. presented in [30] a novel User-guided So- in hardware and network resources and helps in keeping the
cial Media Crawling method (USMC). USMC has been built to crawl more up to date. Web crawler is relatively a simple
gather data from social networks, it employed the knowledge automated program used by linguists and market researchers,
of users to agree with user generated content order to cover the who are fetching for information from the Internet in an
most possible user interactions. The research team validated organized manner [37]. Alternative names for a web crawler
USMC by crawling a plenty of Facebook pages, and contents include web spider, web robot, bot, crawler, and automatic
from millions of users having billions of interactions. The indexer.
proposed USMC was compared with other crawling methods. The crawler begins as a basic exposure to search algorithms
The achieved results showed the possibility of covering most and then can be extended in several directions to include
of the Facebook page interactions by sampling a few posts. information retrieval, statistical method Learning, unsuper-
Ahlers and Boll presented in [31]–[34] a search engine vised learning, natural language processing, and knowledge
based on Geo-location. Their engine was automatically derives representation.
spatial context from the unorganized resources in the web,
IV. EXPERIMENTAL RESULTS
but allows for location-based search. A focused crawler is
presented in this research that applied heuristics to crawl, it A. Apache Nutch
analyzed Web pages which are probably relate to a region In 2003 Doug Cutting, the Lucene creator and Mike Ca-
or place; the actual location was identified using location farella founded Apache Nutch [38], it is an open source
extractor. The presented work proved good results in Web WebCrawler software written in java and used for crawling
search based on location applications that provide fast search- websites. Apache Nutch facilitates parsing, indexing, creating
ing results and right on the spot. a search engine, customizing the search according to needs,
scalability, robustness and Scoring Filter for custom imple-
III. M ETHODOLOGY
mentations. Apache Nutch can run on a single machine as
A. Data Collection and Preparation well as on a distributed environment such as Apache Hadoop.
Modern APIs adhere to standards (typically HTTP and It can be integrated with eclipse and CYGWIN easily and can
REST), that are developer-friendly, easily accessible and un- index all the web pages that are crawled by Apache Nutch to
derstood broadly. API has its own software development Cygwin or to eclipse. Figure 1 illustrates the operation classes
lifecycle (SDLC) of designing, testing, building and managing within the Nutch. Crawling is driven by the Apache Nutch
Fig. 1. Operation classes within the Nutch
crawling tool, once Apache Nutch has indexed the web pages
to Cygwin or to eclipse, user can search the required web Fig. 3. starting Tomcat with CYGWIN
pages in Cygwin. According to [39] CrawlDB is generated by
Apache Nutch, a crawling cycle has four steps, in which each
is implemented as a Hadoop MapReduce job (GeneratorJob,
FetcherJob, ParserJob, and DbUpdaterJob) [40]. The Nutch is
distributed by creating the seed file and copy it into a ”urls”
directory then copy the directory up to the HDFS, then copy
the configuration to the Hadoop configuration directory.
Apache Nutch can be easily integrated with Apache
Hadoop, and we can make our process much faster than
running Apache Nutch on a single machine. After integrating
Apache Nutch with Apache Hadoop, we can perform crawling
on the Apache Hadoop cluster environment. So, the process
will be much faster, and we will get the highest amount of
throughput.
B. Cygwin
In [41], Morteza defined Cygwin as a POSIX-compatible
environment that runs natively on Microsoft Windows. Its
goal is to allow Unix programs to be recompiled and run Fig. 4. successfully crawling within CYGWIN
natively on Windows with minimal source code modifications
. However, it provides the same underlying POSIX API they
would expect. Figures (2, 3, and 4) illustrate a successful is a command-line tool that can optionally be used to initiate
crawling within CYGWIN. crawls. Heritrix was developed jointly by the Internet Archive
and the Nordic national libraries on specifications written
in early 2003 [42]. Then, it has been continually improved
by employees of the Internet Archive and other interested
parties. In the proposed methods, Heritrix visits web pages
and searches for links. However, it follows links to new pages,
where it once again identifies links, then follows the identified
links, and so on. Therefore, a huge amount of links were
gathered rapidly. In the proposed system, three hops limit has
been set, at the limit, Heretrix will stop collecting links and
move on to the next seed from the list. This allows many
territories to be covered who are moving rapidly through the
huge governmental domains. The following figure illustrates
how Heritrix is extracted in the proposed system at Cloudera
Fig. 2. injector progress in CYGWIN
home.