Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Contenido

Quick Start...................................................................................................................................2
How to create HTML DOM object?...........................................................................................2
How to find HTML elements?.....................................................................................................3
How to access the HTML element's attributes...........................................................................5
How to traverse the DOM tree?..................................................................................................6
How to dump contents of DOM object?.....................................................................................7
How to customize the parsing behavior?....................................................................................7
API Reference.................................................................................................................................8
Camel naming convertions...........................................................................................................10
Quick Start
 Get HTML elements

// Create DOM from URL or file


$html = file_get_html('http://www.google.com/');

// Find all images


foreach($html->find('img') as $element)
echo $element->src . '<br>';

// Find all links


foreach($html->find('a') as $element)
echo $element->href . '<br>';

 Modify HTML elements

// Create DOM from string


$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; // Output: <div id="hello">foo</div><div id="world"


class="bar">World</div>

 Extract contents from HTML

// Dump contents (without tags) from HTML


echo file_get_html('http://www.google.com/')->plaintext;

How to create HTML DOM object?


 Quick way

// Create a DOM object from a string


$html = str_get_html('<html><body>Hello!</body></html>');

// Create a DOM object from a URL


$html = file_get_html('http://www.google.com/');

// Create a DOM object from a HTML file


$html = file_get_html('test.htm');
 Object-oriented way

// Create a DOM object


$html = new simple_html_dom();

// Load HTML from a string


$html->load('<html><body>Hello!</body></html>');

// Load HTML from a URL


$html->load_file('http://www.google.com/');

// Load HTML from a HTML file


$html->load_file('test.htm');

How to find HTML elements?


 Basics

// Find all anchors, returns a array of element objects


$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

// Find all <div> with the id attribute


$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo


$ret = $html->find('div[id=foo]');

 Advanced

// Find all element which id=foo


$ret = $html->find('#foo');

// Find all element which class=foo


$ret = $html->find('.foo');

// Find all element has attribute id


$ret = $html->find('*[id]');

// Find all anchors and images


$ret = $html->find('a, img');

// Find all anchors and images with the "title" attribute


$ret = $html->find('a[title], img[title]');

 Descendant selectors

// Find all <li> in <ul>


$es = $html->find('ul li');

// Find Nested <div> tags


$es = $html->find('div div div');

// Find all <td> in <table> which class=hello


$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags


$es = $html->find(''table td[align=center]');

 Nested selectors

// Find all <li> in <ul>


foreach($html->find('ul') as $ul)
{
foreach($ul->find('li') as $li)
{
// do something...
}
}

// Find first <li> in first <ul>


$e = $html->find('ul', 0)->find('li', 0);

 Attribute Filters
 Supports these operators in attribute selectors:

Filter Description
[attribute] Matches elements that have the specified attribute.
[!attribute] Matches elements that don't have the specified attribute.
Matches elements that have the specified attribute with a certain
[attribute=value]
value.
Matches elements that don't have the specified attribute with a
[attribute!=value]
certain value.
Matches elements that have the specified attribute and it starts with
[attribute^=value]
a certain value.
Matches elements that have the specified attribute and it ends with a
[attribute$=value]
certain value.
Matches elements that have the specified attribute and it contains a
[attribute*=value]
certain value.

 Text & Comments

// Find all text blocks


$es = $html->find('text');

// Find all comment (<!--...-->) blocks


$es = $html->find('comment');

How to access the HTML element's attributes


 Get, Set and Remove attributes

// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will
returns true or false)
$value = $e->href;

// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's
value as true or false)
$e->href = 'my link';

// Remove a attribute, set it's value as null!


$e->href = null;

// Determine whether a attribute exist?


if(isset($e->href))
echo 'href exist!';

 Magic attributes

// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"


echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

Attribute Name Usage


$e->tag Read or write the tag name of element.
$e->outertext Read or write the outer HTML text of element.
$e->innertext Read or write the inner HTML text of element.
$e->plaintext Read or write the plain text of element.
 Tips

// Extract contents from HTML


echo $html->plaintext;

// Wrap a element
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

// Remove a element, set it's outertext as an empty string


$e->outertext = '';

// Append a element
$e->outertext = $e->outertext . '<div>foo<div>';

// Insert a element
$e->outertext = '<div>foo<div>' . $e->outertext;

How to traverse the DOM tree?


 Background Knowledge

// If you are not so familiar with HTML DOM, check this link to learn more...

// Example
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)-
>childNodes(2)->getAttribute('id');

 Traverse the DOM tree

You can also call methods with Camel naming convertions.


Method Description
mixed Returns the Nth child object if index is set, otherwise return
$e->children ( [int an array of children.
$index] )
element Returns the parent of element.
$e->parent ()
element Returns the first child of element, or null if not found.
$e->first_child ()
element Returns the last child of element, or null if not found.
$e->last_child ()
element Returns the next sibling of element, or null if not found.
$e->next_sibling ()
element Returns the previous sibling of element, or null if not found.
$e->prev_sibling ()
How to dump contents of DOM object?
 Quick way

// Dumps the internal DOM tree back into string


$str = $html;

// Print it!
echo $html;

 Object-oriented way

// Dumps the internal DOM tree back into string


$str = $html->save();

// Dumps the internal DOM tree back into a file


$html->save('result.htm');

How to customize the parsing behavior?


 Callback function

// Write a function with parameter "$element"


function my_callback($element) {
// Hide all <b> tags
if ($element->tag=='b')
$element->outertext = '';
}

// Register the callback function with it's function name


$html->set_callback('my_callback');

// Callback function will be invoked while dumping


echo $html;

API Reference
Helper functions
Name Description
object str_get_html ( string $content ) Creates a DOM object from a string.
object file_get_html ( string $filename ) Creates a DOM object from a file or a URL.

DOM methods & properties

Name Description
void Constructor, set the filename parameter
__construct ( [string $filename] ) will automatically load the contents,
either text or file/url.
string Returns the contents extracted from
plaintext HTML.
void Clean up memory.
clear ()
void Load contents from a string.
load ( string $content )
string Dumps the internal DOM tree back into a
save ( [string $filename] ) string. If the $filename is set, result string
will save to file.
void Load contents from a from a file or a URL.
load_file ( string $filename )
void Set a callback function.
set_callback ( string $function_name )
mixed Find elements by the CSS selector. Returns
find ( string $selector [, int $index] ) the Nth element object if index is set,
otherwise return an array of object.

Element methods & properties

Name Description
string Read or write element's attribure value.
[attribute]
string Read or write the tag name of element.
tag
string Read or write the outer HTML text of
outertext element.
string Read or write the inner HTML text of
innertext element.
string Read or write the plain text of element.
plaintext
mixed Find children by the CSS selector. Returns
find ( string $selector [, int $index] ) the Nth element object if index is set,
otherwise, return an array of object.

DOM traversing

Name Description
mixed Returns the Nth child object if index is set,
$e->children ( [int $index] ) otherwise return an array of children.
element Returns the parent of element.
$e->parent ()
element Returns the first child of element, or null if
$e->first_child () not found.
element Returns the last child of element, or null if
$e->last_child () not found.
element Returns the next sibling of element, or
$e->next_sibling () null if not found.
element Returns the previous sibling of element, or
$e->prev_sibling () null if not found.

Camel naming convertions


You can also call methods with W3C STANDARD camel naming convertions.

Method Mapping
array array
$e->getAllAttributes () $e->attr
string string
$e->getAttribute ( $name ) $e->attribute
void void
$e->setAttribute ( $name, $value ) $value = $e->attribute
bool bool
$e->hasAttribute ( $name ) isset($e->attribute)
void void
$e->removeAttribute ( $name ) $e->attribute = null
element mixed
$e->getElementById ( $id ) $e->find ( "#$id", 0 )
mixed mixed
$e->getElementsById ( $id [,$index] ) $e->find ( "#$id" [, int $index] )
element mixed
$e->getElementByTagName ($name ) $e->find ( $name, 0 )
mixed mixed
$e->getElementsByTagName ( $name [, $e->find ( $name [, int $index] )
$index] )
element element
$e->parentNode () $e->parent ()
mixed mixed
$e->childNodes ( [$index] ) $e->children ( [int $index] )
element element
$e->firstChild () $e->first_child ()
element element
$e->lastChild () $e->last_child ()
element element
$e->nextSibling () $e->next_sibling ()
element element
$e->previousSibling () $e->prev_sibling ()

Ejemplos

// Include the library


include('simple_html_dom.php');

// Retrieve the DOM from a given URL


$html = file_get_html('https://davidwalsh.name/');

// Find all "A" tags and print their HREFs


foreach($html->find('a') as $e)
echo $e->href . '<br>';

// Retrieve all images and print their SRCs


foreach($html->find('img') as $e)
echo $e->src . '<br>';

// Find all images, print their text with the "<>" included
foreach($html->find('img') as $e)
echo $e->outertext . '<br>';

// Find the DIV tag with an id of "myId"


foreach($html->find('div#myId') as $e)
echo $e->innertext . '<br>';

// Find all SPAN tags that have a class of "myClass"


foreach($html->find('span.myClass') as $e)
echo $e->outertext . '<br>';

// Find all TD tags with "align=center"


foreach($html->find('td[align=center]') as $e)
echo $e->innertext . '<br>';

// Extract all text from a given cell


echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';

Like I said earlier, this library is a dream for finding elements, just as the early JavaScript
frameworks and selector engines have become. Armed with the ability to pick content from
DOM nodes with PHP, it's time to analyze websites for changes.

The Script
The following script checks two websites for changes:

// Pull in PHP Simple HTML DOM Parser


include("simplehtmldom/simple_html_dom.php");

// Settings on top
$sitesToCheck = array(
// id is the page ID for selector
array("url" =>
"http://www.arsenal.com/first-team/players", "selector" => "#squad"),
array("url" =>
"http://www.liverpoolfc.tv/news", "selector" =>
"ul[style='height:400px;']")
);
$savePath = "cachedPages/";
$emailContent = "";

// For every page to check...


foreach($sitesToCheck as $site) {
$url = $site["url"];

// Calculate the cachedPage name, set oldContent = "";


$fileName = md5($url);
$oldContent = "";

// Get the URL's current page content


$html = file_get_html($url);

// Find content by querying with a selector, just like a selector


engine!
foreach($html->find($site["selector"]) as $element) {
$currentContent = $element->plaintext;;
}

// If a cached file exists


if(file_exists($savePath.$fileName)) {
// Retrieve the old content
$oldContent = file_get_contents($savePath.$fileName);
}

// If different, notify!
if($oldContent && $currentContent != $oldContent) {
// Here's where we can do a whoooooooooooooole lotta stuff
// We could tweet to an address
// We can send a simple email
// We can text ourselves

// Build simple email content


$emailContent = "David, the following page has changed!\n\
n".$url."\n\n";
}

// Save new content


file_put_contents($savePath.$fileName,$currentContent);
}

// Send the email if there's content!


if($emailContent) {
// Sendmail!
mail("david@davidwalsh.name","Sites Have Changed!",
$emailContent,"From: alerts@davidwalsh.name","\r\n");
// Debug
echo $emailContent;
}
The code and comments are self-explanatory. I've set the script up such that I get one
"digest" alert if many of the pages change. The script is the hard part -- to enact the script,
I've set up a CRON job to run the script every 20 minutes.

This solution isn't specific to just spying on footy -- you could use this type of script on any
number of sites. This script, however, is a bit simplistic in all cases. If you wanted to spy
on a website that had extremely dynamic code (i.e. a timestamp was in the code), you
would want to create a regular expressions that would isolate the content to just the block
you're looking for. Since each website is constructed differently, I'll leave it up to you to
create page-specific isolators. Have fun spying on websites though...and be sure to let me
know if you hear a good, reliable footy rumor!

You might also like