Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Web Parser for Cleaning Services

Markus Tenghamn
Kvarnbacksvägen 5
72233 Västerås
+46 73 995 06 88
mtn11005@student.mdh.se
1. SUMMARY 4. TECHNOLOGY
This report summarizes the work of Markus Tenghamn on Several different languages were looked at for this project
a web parsing application for Cleaning Services. Cleaning to see which might be the most efficient at accomplishing
Services were in need of a tool to make it easier to gather the task at hand. To begin with, a simple test to try to find
information used for marketing. This application is a result out which language was the best was laid out. This
of that need and close work with the company. The involved starting an instance of the application or script
application is written with PHP and utilizes several external needed to run the parser, downloading the html from a
libraries and tools to accomplish the tasks presented. The specific URL and finding an element in the DOM. 100 tests
result is a fully functional web application which went were done for each language with 1 initialization and 100
beyond the initial requirements. requests each time. Requests were performed on the same
computer to attempt to eliminate as many anomalies or
2. INTRODUCTION errors as possible. The averages of the findings are shown
A web application was created with PHP which uses below.
external websites and API’s to gather information which
can later be exported and used for marketing. The PHP
application uses parsers to gather specific information from Average time to initialize: 1.057 sec
these websites which it then stores in a database along with Average time for a request: 0.971 sec
information gathered from various API’s. The information Python
can then be exported in an Excel, PDF, or text format Average time to initialize: 1.443 sec
which follows the format used by Posten to import shipping Average time for a request: 0.161 sec
addresses.
C#
3. BACKGROUND Average time to initialize: 2.480 sec
The project requirements were laid out by Cleaning Average time for a request: 0.062 sec
Services whom needed a better way of collecting
information, used for marketing, from real estate listings.
This had previously been done by hand and the company
The results were surprising as C# seemed to be the fastest
was now looking to build a web application that would
at completing the task once it had been initialized and PHP
automate the process.
being the fastest time to initialize but much slower request
times. The expected result was that all of the languages
would have almost equal performance times. The data may
not be completely accurate as internet speeds and loads on
the tested website may have affected the time of requests.

As the work on the project progressed it was decided that


the script would be run in a Linux environment and was
therefore limited to PHP and Python. PHP was chosen as
the language for the project due to its ease of use when
developing a web application where a user interface is
needed.
As the parsers were tested it was realized that the initial
tests were not very important for the current project since
creating as many requests as possible in a very short period
of time would result in negative side effects on the sites
being parsed. There is a possibility that by parsing a site too
quickly it may slow down the performance of the site for
other real users visiting the site. Most sites will also have
firewalls or other features which will put a stop to requests
which may have a potential to cause performance issues.
Therefore a pause has been implemented into the parsers 6.2 User interface
causing the parser to wait a few seconds between each
request.
5. MARKET ANALYSIS
There are many companies that use parsers to gather data,
however most companies will not disclose how or what
they parse since it is usually a tool they can use to gain an
advantage against other companies. Therefore it is hard to
find good examples of parsers that directly relate to the
project, however there are some types of parsers such as Figure 2: User interface for the application
search engines that are a bit easier to analyze.
The user interface for the application provides easy access
5.1 Google for authorized users to view data and current parsing
statistics. An export function is implemented to allow the
user to easily export information stored in the database as
either a excel file or a text file formatted to fit Posten’s
import format.
7. IMPLEMENTATION
A combination of many different tools are used to make it
possible to pull information from the various websites
which provide real estate listings.
Figure 1: Front page of Google.com
7.1 Parsers
The search engine Google is a great example of a company Several parsers and API readers were created to find and
that uses web scraping. Google has hundreds of thousands store information from several sites. The following sites are
of servers whereof a large portion is used for scraping parsed: Kvalster, Blocket, Bovision, Hemnet. Booli and
websites and storing information used in search results. 118100 have open API protocols that were utilized instead
Most users can only speculate on how the Google search of parsing the sites. Several different technologies went
algorithm actually works, but it is highly believed the a into these parsers in order to make them work.
large emphasis, among other factors, is put on the meta tags
of the web page where a webmaster will usually specify the 7.1.1 Curl
preferred description he would like Google to use for
search results and the keywords which he or she feels are
relevant to the website.
5.2 MDH Schedule
MDH Schedule is an schedule app for students which uses
parsers to index course and schedule information which is
then sent to users mobile phones in order to present course
and schedule information. These parsers are PHP based and
parse both HTML and XML from the schools schedule in
order to later present this data to students who use the app.
6. DESIGN PROCESS
The most important aspect of the web application is the
database since this will be where all of the parsed data is Figure 3: Cron jobs for Cleaning Services
stored and will also be used to export data when such a
request is made. The other part of the application is the Curl is a tool used for transferring data via URL syntax via
interface which is there to make it easier for a user to HTTP. It is an open source library which can easily be used
handle the data and prevent someone from having to deal with PHP. In the web application the built in curl library is
with the data directly in the database. used to make curl requests to websites which will return the
HTML DOM and allow the application to parse it with
6.1 Database XPath and PHP functions. By finding specific URLs on
A MySQL database was used to store the data from the pages the parser can navigate to sub pages of the
parsers. To make the database as effective as possible the application to find additional data. In some case sites would
database followed guidelines to reach a normalization form block content when certain conditions were not met, this
of at least 3NF. required the parsers to simulate form post and get request
before navigating to the listings where the data could be 7.4.1 PHPStorm
parsed [1]. PHPStorm is an IDE for editing projects written in PHP, it
also comes with a user contributed library of plugins which
7.1.2 XPath
add extra helpful features to the IDE. The IDE provided
XML Path Language or XPath is mainly used to address
automatic completion and indentation of code which made
and manipulate parts of an XML document but can also be
the process quicker and less likely to result in errors in the
used to manipulate the DOM of an HTML document. With
code. The IDE also features and automatic upload function
the use of DOMDocument XPath can be used to navigate
which uploads and syncs a local copy with a live copy on
the DOM [2]. With the use of XPath specific elements can
the working server via FTP. The IDE is also very helpful
easily be specified and turned into nodes and lists.
when including a file, whether it is a php file or an image, it
7.1.3 API will search the directories to make sure it exists and if it
Several of the sites parsed had open API protocols that does not it will display a relevant error to let the user know.
could be used to gather listings and additional information
7.4.2 Git
used to compliment existing listings. With specific API
Git is an open source version control system which could
requests new listings can be found as well as new
be installed locally for the project. PHPStorm also
information to compliment existing listings.
supported Git so that version handling could be done on
7.1.4 Cron everything which was edited on the project. This rarely had
Cron is a job scheduler for Unix like systems which was to be used but was an extra layer of security in case code
used to schedule the different parsers for the web was lost or broken during the project.
application at different times. Parsers are initiated every 60
minutes and leave roughly 10 minutes between each parser
8. EVALUATION
in order to try to keep server load down. 8.1 Results
When presenting the final version of the web application to
7.2 Data handling Cleaning Services the feedback was extremely positive.
It is very important that the data that is parsed is the type of Everything works great and the results were better than
data that needs to be stored. Therefore the application has what the client had hoped for. The application is easy to use
various rules which when met throws away bad and will output all the information needed in a nice and
information. Basic things such as if the object has an easy to use format which the client prefers.
address are checked initially. After the initial check regular
expressions are used to make sure that the address meets 8.2 Future additions
certain standards such as containing at least one number There is an endless number of things that could be added to
[3]. the parsing application to make it better, however there are
some things that really stand out. The parsers will
7.3 Export eventually stop working, either because the websites being
Information stored in the database of the web application parsed change or they simply close down, this will stop the
can be exported via PDF, Excel, and text files. The PDF application from being useful for the client. To adapt to
export method is available from the user interface and is these changes an interface could be built in the application
used to export information for one specific listing. On the which could let the client create new parsers that would
export page of the application the export functions for both either try to find the information on their own automatically
text and Excel files can be found, theses can export all or allow the user to specify which elements of pages
listings for a specific day, a week, or all listings in the contain the information that should be parsed. Another
database. The Excel file is complete with all the aspect of the application is user handling, currently the
information found in the database while the text file is users are inputted directly into the database manually, there
formatted to fit the import format of Posten.se. is no way to create a new user via the interface. This is not
a huge problem as there won’t be a need for many users,
7.4 Tools
but if a new user is needed or a password needs to be
Several tools where used while working on the project.
changed it requires someone with a knowledge of the
These helped speed up the process to make work faster and
database to take care of it.
more reliable at the same time.
9. CONCLUSION 10. REFERENCES
The final product has gone beyond the initial requirements [1] Schrenk, M. 2012. Webbots, Spiders, and Screen
which the client asked for and has resulted in a robust and Scrapers: A Guide To DevelopingInternet Agents
easy to use web application which runs on its own. Future With PHP/Curl. Ch. 6 ISBN 978-1-59327-397-2
additions to the application may be to add features to the [2] Macintyre, P., Danchilla, B., Gogala, M. 2011. Pro
parsers so that it could be determined whether the listings PHP Programming. Ch. 14 ISBN 978-1-4302-
parsed were sold or not in the end. In addition to this, 3560-6
features that make it easier to create and manage users and
also to delete and manage old data could be added to make [3] Beighley, L., Morrison, M. (2009) Head First PHP
the application easier to use and prevent it from endlessly & MySQL. Ch. 10 ISBN 978-0-596-00630-
taking up more and more memory.

You might also like