Professional Documents
Culture Documents
6.2: Parsing HTML: This Is A Title Hello World!
6.2: Parsing HTML: This Is A Title Hello World!
2: Parsing HTML
When you download a web page, the contents are written in HyperText Markup Language, aka HTML. For example, here is a
minimal HTML document:
1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>This is a title</title>
5 </head>
6 <body>
7 Hello world!</p>
8 </body>
9 </html>
The phrases “This is a title” and “Hello world!” are the text that actually appears on the page; the other elements are tags that
indicate how the text should be displayed.
When our crawler downloads a page, it will need to parse the HTML in order to extract the text and find the links. To do that,
we’ll use jsoup, which is an open-source Java library that downloads and parses HTML.
The result of parsing HTML is a Document Object Model tree, or DOM tree, that contains the elements of the document,
including text and tags. The tree is a linked data structure made up of nodes; the nodes represent text, tags, and other document
elements.
The relationships between the nodes are determined by the structure of the document. In the example above, the first node,
called the root, is the <html> tag, which contains links to the two nodes it contains, <head> and <body>; these
nodes are the children of the root node.
The <head> node has one child, <title> , and the <body> node has one child,
(which stands for “paragraph”). Figure 6.2.1represents this tree graphically.