Welcome to Scribd!

6.2: Parsing HTML: This Is A Title Hello World!

Uploaded by

0% found this document useful (0 votes)

9 views2 pages

When a web page is downloaded, it is written in HTML which contains tags that indicate how the text should be displayed. To extract the text and links from the HTML, a crawler will need to parse the HTML. This can be done using the jsoup Java library. Parsing HTML results in a DOM tree containing the document elements as nodes linked in a structure that represents the relationships between elements as defined in the HTML. Most browsers provide tools to inspect the DOM tree of a currently viewed page.

Original Description:

Original Title

6.02__Parsing_HTML

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

9 views2 pages

6.2: Parsing HTML: This Is A Title Hello World!

Uploaded by

giihc

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 2

Search inside document

6.

2: Parsing HTML
When you download a web page, the contents are written in HyperText Markup Language, aka HTML. For example, here is a
minimal HTML document:

1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>This is a title</title>
5 </head>
6 <body>
7 Hello world!</p>
8 </body>
9 </html>

The phrases “This is a title” and “Hello world!” are the text that actually appears on the page; the other elements are tags that
indicate how the text should be displayed.
When our crawler downloads a page, it will need to parse the HTML in order to extract the text and find the links. To do that,
we’ll use jsoup, which is an open-source Java library that downloads and parses HTML.
The result of parsing HTML is a Document Object Model tree, or DOM tree, that contains the elements of the document,
including text and tags. The tree is a linked data structure made up of nodes; the nodes represent text, tags, and other document
elements.
The relationships between the nodes are determined by the structure of the document. In the example above, the first node,
called the root, is the <html> tag, which contains links to the two nodes it contains, <head> and <body>; these
nodes are the children of the root node.
The <head> node has one child, <title> , and the <body> node has one child,
(which stands for “paragraph”). Figure 6.2.1represents this tree graphically.

Figure 6.2.1: DOM tree for a simple HTML page.

Each node contains links to its children; in addition, each node contains a link to its parent, so from any node it is possible to
navigate up and down the tree. The DOM tree for real pages is usually more complicated than this example.
Most web browsers provide tools for inspecting the DOM of the page you are viewing. In Chrome, you can right-click on any
part of a web page and select “Inspect” from the menu that pops up. In Firefox, you can right-click and select “Inspect
Element” from the menu. Safari provides a tool called Web Inspector, which you can read about at thinkdast.com/safari. For
Internet Explorer, you can read the instructions at thinkdast.com/explorer.
Figure 6.2.2 shows a screenshot of the DOM for the Wikipedia page on Java, thinkdast.com/java. The element that’s
highlighted is the first paragraph of the main text of the article, which is contained in a element with . We’ll use this
element id to identify the main text of each article we download.

Allen B. Downey 6/14/2021 6.2.1 CC-BY-NC-SA https://eng.libretexts.org/@go/page/12758

Figure 6.2.2: Screenshot of the Chrome DOM Inspector.

Allen B. Downey 6/14/2021 6.2.2 CC-BY-NC-SA https://eng.libretexts.org/@go/page/12758

RCadvisors Odel Airplane Design Made Easy
Document212 pages
RCadvisors Odel Airplane Design Made Easy
darwyngraphicdesigner
No ratings yet
Broswer - Documents Events and Interfaces
Document270 pages
Broswer - Documents Events and Interfaces
Mayank Singh
No ratings yet
Introduction To The DOM
Document5 pages
Introduction To The DOM
faa04807
No ratings yet
The DOM
Document6 pages
The DOM
Charlie
No ratings yet
My SQL CBSE Class 12
Document79 pages
My SQL CBSE Class 12
Gaurav Charokar
No ratings yet
WEB Manual
Document34 pages
WEB Manual
mekideszekarias
No ratings yet
HTML Avancer
Document136 pages
HTML Avancer
mahad
No ratings yet
What's in The Head - Metadata in HTML - Learn Web Development - MDN
Document15 pages
What's in The Head - Metadata in HTML - Learn Web Development - MDN
nuikal376
No ratings yet
Htmlcourse 2
Document31 pages
Htmlcourse 2
manar mohamed
No ratings yet
WT Practical
Document55 pages
WT Practical
Nirmal
No ratings yet
Basic HTML Lab
Document14 pages
Basic HTML Lab
Valentina Nelkovska
No ratings yet
Understanding The DOM
Document126 pages
Understanding The DOM
Dharma Teja Sunkara
100% (1)
HTML - Note
Document8 pages
HTML - Note
Prasanna Niyadagala
100% (1)
IT 123 (Week 01)
Document12 pages
IT 123 (Week 01)
Louie Jay Flores Candame
No ratings yet
Js Info-2
Document302 pages
Js Info-2
marina kantar
No ratings yet
Chapter 5-1
Document14 pages
Chapter 5-1
Sabona
No ratings yet
Webtechnology
Document4 pages
Webtechnology
Hadi
No ratings yet
Intro - To - HTML PDF
Document48 pages
Intro - To - HTML PDF
Lisette Jack
No ratings yet
HTML and Css Notes
Document59 pages
HTML and Css Notes
barry_john06
No ratings yet
IP Chapter2
Document95 pages
IP Chapter2
Abenezer Teshome
No ratings yet
HTML Cha1
Document49 pages
HTML Cha1
အောင်ပိုင် ဖြိုး
No ratings yet
Cheat Sheet - DOM: What Is The DOM?
Document3 pages
Cheat Sheet - DOM: What Is The DOM?
scribd account
No ratings yet
Table of Content: Topic Page No
Document52 pages
Table of Content: Topic Page No
Amarjeet Kaur
No ratings yet
Unit-3 HTML && CSS
Document34 pages
Unit-3 HTML && CSS
Sai Harsha2003
No ratings yet
E-Commerce Lab File
Document17 pages
E-Commerce Lab File
aahilh19
No ratings yet
Cours2 HTML
Document13 pages
Cours2 HTML
Achraf Sallem
No ratings yet
Web Design Presentation
Document57 pages
Web Design Presentation
Victor Babatunde
No ratings yet
unit-1
Document72 pages
unit-1
vimalakar shet
No ratings yet
Introduction To HTML
Document30 pages
Introduction To HTML
Neha Dhadve
No ratings yet
Wad Lab Programs
Document28 pages
Wad Lab Programs
Bhuvana ThimmiReddy
No ratings yet
DeGS 2.0 HTML Notes
Document45 pages
DeGS 2.0 HTML Notes
Rajeev Ranjan
No ratings yet
Key Concepts: 2.1 Introduction To Hyper Text Markup Language (HTML)
Document66 pages
Key Concepts: 2.1 Introduction To Hyper Text Markup Language (HTML)
Nancy
No ratings yet
STD - Xi Web Application Unit 2 Section I HTML
Document22 pages
STD - Xi Web Application Unit 2 Section I HTML
daivik7909
No ratings yet
Ict Programming Week 1
Document6 pages
Ict Programming Week 1
Ginalyn Quimson
No ratings yet
Chapter 2 HTML
Document56 pages
Chapter 2 HTML
muhammedsavas799
No ratings yet
Ccsw321 Ch02 Html5 p1
Document102 pages
Ccsw321 Ch02 Html5 p1
Aseil Nagro
No ratings yet
Markup Languages
Document10 pages
Markup Languages
Shine Institute
No ratings yet
Web Design MANUAL
Document24 pages
Web Design MANUAL
enord.ros
No ratings yet
Ob b59f54 Introduction-To-Html PDF
Document19 pages
Ob b59f54 Introduction-To-Html PDF
edouard dude
No ratings yet
How To Create A Simple HTML Document?: Lesson#1
Document12 pages
How To Create A Simple HTML Document?: Lesson#1
Muhammad Azfar
No ratings yet
HTML 1
Document61 pages
HTML 1
Tay Yu Jie
No ratings yet
HTML Assignment
Document5 pages
HTML Assignment
fagixa3491
No ratings yet
Hypertext Markup Language/Print Version: Before We Start
Document43 pages
Hypertext Markup Language/Print Version: Before We Start
waxstone
No ratings yet
VSK HTML
Document61 pages
VSK HTML
praveen_thamilarasan
No ratings yet
CMA - Unit1 Sol QP
Document30 pages
CMA - Unit1 Sol QP
Manjunath K
No ratings yet
WebDesigning PDF
Document133 pages
WebDesigning PDF
VAIBHAV
No ratings yet
Day-9 DOM-1
Document19 pages
Day-9 DOM-1
jahnabi122
No ratings yet
Lecture0 (HTML and CSS)
Document29 pages
Lecture0 (HTML and CSS)
amna shahid
No ratings yet
ITIM Muskan
Document44 pages
ITIM Muskan
Ujjwal
No ratings yet
Ip Chapter 2 HTML
Document57 pages
Ip Chapter 2 HTML
HASEN SEID
No ratings yet
Hyper Text Markup Language (HTML) : Unit 5
Document12 pages
Hyper Text Markup Language (HTML) : Unit 5
Lavanya lokesh
No ratings yet
WELCOME TO HTML, CSS AND JavaScript
Document5 pages
WELCOME TO HTML, CSS AND JavaScript
Baye
No ratings yet
Introduction To HTML
Document21 pages
Introduction To HTML
Jayachandra Venkataramanappa
No ratings yet
Basic Web Page Creation: How To Create A Web Page Using Notepad?
Document15 pages
Basic Web Page Creation: How To Create A Web Page Using Notepad?
Gerald Grospe
No ratings yet
HTML Saját Jegyzet
Document72 pages
HTML Saját Jegyzet
Erika Nagy
No ratings yet
What Is An HTML File?
Document32 pages
What Is An HTML File?
sharmasoni
No ratings yet
Introduction To HTML
Document13 pages
Introduction To HTML
Bryle Drio
No ratings yet
Introduction To HTML
Document18 pages
Introduction To HTML
Ioana Airinei
No ratings yet
HTML Unleashed: The Complete Guide
From Everand
HTML Unleashed: The Complete Guide
Pen Hur
No ratings yet
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
Rating: 4.5 out of 5 stars
4.5/5 (14)
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
Class Progress Chart: Qualification: Computer Systems Servicing NC II Date Started: Trainer: Target To Finish
Document12 pages
Class Progress Chart: Qualification: Computer Systems Servicing NC II Date Started: Trainer: Target To Finish
alice jane lagsa
No ratings yet
2024 NEW PRICE FLYER (National)
Document2 pages
2024 NEW PRICE FLYER (National)
chidieberendukweokpan
No ratings yet
All You Want To Know About Digital Signature: Home Blog Submit Article Acts Careers Notes Weekly Competition
Document14 pages
All You Want To Know About Digital Signature: Home Blog Submit Article Acts Careers Notes Weekly Competition
indian democracy
No ratings yet
DS ConcreteProduction VCP 0415 ENG
Document2 pages
DS ConcreteProduction VCP 0415 ENG
miguelc
No ratings yet
Operating Manual - 86 °C Premium Freezers
Document62 pages
Operating Manual - 86 °C Premium Freezers
alex_341045866
No ratings yet
KR ThermaV Split (R32 50Hz) HP EU MFL66101114 5BPU0-03E (June.2022)
Document64 pages
KR ThermaV Split (R32 50Hz) HP EU MFL66101114 5BPU0-03E (June.2022)
Zak Kontoutsikos
No ratings yet
Pegasus W200
Document56 pages
Pegasus W200
Aleixandre Gomez
No ratings yet
IS - Case Study - Flayton Electronics
Document10 pages
IS - Case Study - Flayton Electronics
Pranjal Kala
No ratings yet
Operation Manual: Smart-UPS
Document18 pages
Operation Manual: Smart-UPS
osvaldo
No ratings yet
UI Design Using Inkscape
Document31 pages
UI Design Using Inkscape
kumaresh
No ratings yet
Coupling Capacitor Voltage Transformer Laboratory
Document7 pages
Coupling Capacitor Voltage Transformer Laboratory
Kervin Viales
No ratings yet
Programming Windows Identity
Document102 pages
Programming Windows Identity
Marco Cisneros
No ratings yet
Windows Mobility Center - OEM Deployment
Document47 pages
Windows Mobility Center - OEM Deployment
Mhakelal
No ratings yet
Structure SD
Document33 pages
Structure SD
AIN
No ratings yet
Pnc12extreme Partsmap Pga Eng Web
Document15 pages
Pnc12extreme Partsmap Pga Eng Web
Esteban Astudillo
No ratings yet
Behavioral Analysis of Cybercrime - Paving The Way For Effective Policing Strategies
Document26 pages
Behavioral Analysis of Cybercrime - Paving The Way For Effective Policing Strategies
Damero Palomino
No ratings yet
Practical Exam Scoresheet g-10
Document1 page
Practical Exam Scoresheet g-10
Ma-jo Osomar Datu Allebas
No ratings yet
Prototyping & Low-Volume Practices
Document12 pages
Prototyping & Low-Volume Practices
X Prototype
No ratings yet
WEDING EECTRODES Repypdf PDF
Document5 pages
WEDING EECTRODES Repypdf PDF
mamounsd
No ratings yet
Share Local - Foreign - WPS Office
Document4 pages
Share Local - Foreign - WPS Office
reyes villanueva
No ratings yet
Customer Account Opening Form
Document1 page
Customer Account Opening Form
muhammad Ammar Shamshad
No ratings yet
Assignment
Document4 pages
Assignment
api-561990701
No ratings yet
Exoticca Travel Website Requirements
Document4 pages
Exoticca Travel Website Requirements
abdul subhan khan
No ratings yet
New Products: March. 2015
Document2 pages
New Products: March. 2015
林永康
No ratings yet
Fichas Tecnicas PDF
Document56 pages
Fichas Tecnicas PDF
Alexander Alvarez Vega
No ratings yet
How To Get Unlimited Canva Edu For Free Without Documents
Document3 pages
How To Get Unlimited Canva Edu For Free Without Documents
ifernand2011
No ratings yet
Ampeg BA115HPT
Document3 pages
Ampeg BA115HPT
Federico Bruno
No ratings yet
Port City International University: Assignment
Document20 pages
Port City International University: Assignment
Rahat Alam
No ratings yet
TAEYEON 태연 - Fine - Piano Midi Cover
Document3 pages
TAEYEON 태연 - Fine - Piano Midi Cover
THE xiOrbit
No ratings yet